All cleanup tasks have been completed:
- Backend: 3 service files removed (~1,200 lines)
- Frontend: 2 components + 2 API files removed (~758 lines)
- Total: 1,958 lines of redundant code removed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Backend cleanup:
- Remove ocr_service_original.py (legacy OCR service, replaced by ocr_service.py)
- Remove preprocessor.py (unused, functionality absorbed by layout_preprocessing_service.py)
- Remove pdf_font_manager.py (unused, never referenced by any service)
Frontend cleanup:
- Remove MarkdownPreview.tsx (unused component)
- Remove ResultsTable.tsx (unused, replaced by TaskHistoryPage)
- Remove services/api.ts (legacy API client, migrated to apiV2)
- Remove types/api.ts (legacy types, migrated to apiV2.ts)
API migration:
- Add export rules CRUD methods to apiClientV2
- Update SettingsPage.tsx to use apiClientV2
- Update Layout.tsx to use only apiClientV2 for logout
This reduces ~1,500 lines of redundant code and unifies the API client.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements detection of embedded images used for redaction/covering:
- Analyzes embedded images for mostly black (avg RGB <= 30) or white (>= 245)
- Uses PIL to efficiently sample image colors
- Gets image position on page via get_image_rects()
- Integrates with existing preprocessing pipeline
- Adds covering_images to page metadata and quality report
Detection results:
- demo_docs/edit3.pdf: 10 black covering images detected (7 on P1, 3 on P2)
Quality report now includes:
- total_covering_images count
- Per-page covering_images details with bbox, color_type, size
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Increase white threshold from 0.95 to 0.98 (pure white only)
- Decrease black threshold from 0.05 to 0.02 (pure black only)
- Remove "other solid" detection (caused false positives on gray backgrounds)
This prevents light gray table cell backgrounds (RGB ~0.93) from being
incorrectly detected as covering/redaction rectangles.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Expands whiteout detection to handle:
- White rectangles (RGB >= 0.95) - correction tape / white-out
- Black rectangles (RGB <= 0.05) - redaction / censoring
- Other solid fills (very dark or very light) - potential covering
Adds color_type to covered text results for better logging.
Logs now show breakdown by cover type (white, black, other).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Adds test script for validating PDF preprocessing pipeline:
- Garble rate detection unit tests
- Page number pattern detection unit tests
- Integration tests with demo_docs/edit*.pdf files
- Quality report generation verification
Usage: PYTHONPATH=backend python3 scripts/run_preprocessing_tests.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Key fixes:
- Skip large vector_graphics charts (>50% page coverage) that cover text
- Fix font fallback to use NotoSansSC for CJK support instead of Helvetica
- Improve translated table rendering with dynamic font sizing
- Add merged cell (row_span/col_span) support for reflow tables
- Skip text elements inside table bboxes to avoid duplication
Archive openspec proposal: fix-pdf-table-rendering
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add generate_translated_layout_pdf() method for layout-preserving translated PDFs
- Add generate_translated_pdf() method for reflow translated PDFs
- Update translate router to accept format parameter (layout/reflow)
- Update frontend with dropdown to select translated PDF format
- Fix reflow PDF table cell extraction from content dict
- Add embedded images handling in reflow PDF tables
- Archive improve-translated-text-fitting openspec proposal
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Task model doesn't have file_path attribute directly. Use the files
relationship to access TaskFile.stored_path for source file path.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Kill entire process tree (parent + children) when stopping
- Add port-based cleanup as fallback for orphaned processes
- Remove 'set -e' to allow graceful failure handling
- Pass port numbers to stop_service for cleanup
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add extra='ignore' to Settings Config to prevent ValidationError when
.env files contain deprecated variables (e.g., PADDLEOCR_MODEL_DIR).
This ensures backwards compatibility with Docker deployments.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Update config.py to read both .env and .env.local (with .env.local priority)
- Move DIFY API settings from hardcoded values to environment configuration
- Remove unused PADDLEOCR_MODEL_DIR setting (models stored in ~/.paddleocr/)
- Remove deprecated argostranslate translation settings
- Add DIFY settings: base_url, api_key, timeout, max_retries, batch limits
- Update dify_client.py to use settings from config.py
- Update translation_service.py to use settings instead of constants
- Fix frontend env files to use correct variable name VITE_API_BASE_URL
- Update setup_dev_env.sh with correct PaddlePaddle version (3.2.0)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add unified start.sh script with subcommands (all/backend/frontend)
- Add process management (--stop, --status)
- Remove separate start_backend.sh and start_frontend.sh
- Update setup_dev_env.sh with pre-flight checks and --cpu-only/--skip-db options
- Update .env.example to remove sensitive data and add DIFY translation config
- Add .pid/ to .gitignore for process management
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Adds the ability to download translated documents as PDF files while
preserving the original document layout. Key changes:
- Add apply_translations() function to merge translation JSON with UnifiedDocument
- Add generate_translated_pdf() method to PDFGeneratorService
- Add POST /api/v2/translate/{task_id}/pdf endpoint
- Add downloadTranslatedPdf() method and PDF button in frontend
- Add comprehensive unit tests (52 tests: merge, PDF generation, API endpoints)
- Archive add-translated-pdf-export proposal
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Implement document translation feature using DIFY AI API with batch processing:
Backend:
- Add DIFY client with batch translation support (5000 chars, 20 items per batch)
- Add translation service with element extraction and result building
- Add translation router with start/status/result/list/delete endpoints
- Add translation schemas (TranslationRequest, TranslationStatus, etc.)
Frontend:
- Enable translation UI in TaskDetailPage
- Add translation API methods to apiV2.ts
- Add translation types
Features:
- Batch translation with numbered markers [1], [2], [3]...
- Support for text, title, header, footer, paragraph, footnote, table cells
- Translation result JSON with statistics (tokens, latency, batch_count)
- Background task processing with progress tracking
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Force Office documents (PPTX, DOCX, XLSX) to use Direct track after
LibreOffice conversion, since converted PDFs always have extractable text
- Fix PDF generator to not exclude text in image regions for Direct track,
allowing text to render on top of background images (critical for PPT)
- Increase file_type column from VARCHAR(50) to VARCHAR(100) to support
long MIME types like PPTX
- Remove reference to non-existent total_images metadata attribute
This significantly improves processing time for Office documents
(from ~170s OCR to ~10s Direct) while preserving text quality.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Archived the extract-table-cell-boxes proposal which implemented:
- Table cell boxes extraction from PP-StructureV3 table_res_list
- Layered rendering for tables with cell borders
- CV-based table line detection (disabled)
- Scan artifact removal preprocessing
- PDF orientation detection for rotated documents
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
When rotation is detected, the OCR coordinate system needs to be swapped:
- Original OCR dimensions: 1242 x 1755 (portrait image)
- Content coordinates: up to x=1593 (exceeds image width, indicates rotation)
- Rotated OCR dimensions: 1755 x 1242 (matching content coordinate system)
Previously, page_dimensions was incorrectly set to target PDF dimensions,
causing scale factors to be ~1.0 instead of ~0.48.
Now correctly:
- original_page_sizes[0] = target PDF dimensions (842.4 x 595.68)
- page_dimensions[0] = swapped OCR dimensions (1755 x 1242)
- Scale = 842.4/1755 ≈ 0.48 for both x and y
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Fixed two issues in PDF orientation detection:
1. Unit mismatch: Orientation detection was comparing content bboxes
(in pixels) against PDF page dimensions (in points). Now correctly
uses OCR dimensions (pixels) for detection.
2. Priority override: When orientation adjustment is needed, now also
updates original_page_sizes dict so per-page processing uses the
adjusted dimensions instead of the original PDF dimensions.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add orientation detection to handle cases where scanned documents have
content in a different orientation than the image dimensions suggest.
When PP-StructureV3 processes rotated documents, it may return bounding
boxes in the "corrected" orientation while the image remains in its
scanned orientation. This causes content to extend beyond page boundaries.
The fix:
- Add _detect_content_orientation() method to detect when content bbox
exceeds page dimensions significantly
- Automatically swap page dimensions when landscape content is detected
in portrait-oriented images (and vice versa)
- Apply orientation detection for both single-page and multi-page documents
Fixes issue where horizontal delivery slips scanned vertically were
generating PDFs with content cut off or incorrectly positioned.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Based on pp_demo analysis, PPStructureV3 returns table_res_list containing
cell_box_list which was previously ignored. This commit:
- Extract table_res_list from PPStructureV3 result alongside parsing_res_list
- Add table_res_list parameter to _process_parsing_res_list()
- Prioritize cell_box_list from table_res_list over SLANeXt extraction
- Match tables by HTML content or use first available
Priority order for cell boxes:
1. table_res_list.cell_box_list (native, already absolute coords)
2. res_data['boxes'] (unlikely in PaddleX 3.x)
3. Direct SLANeXt model call (fallback)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Phase 1-3 implementation of extract-table-cell-boxes proposal:
- Add enable_table_cell_boxes_extraction config option
- Implement lazy-loaded SLANeXt model caching in PPStructureEnhanced
- Add _extract_cell_boxes_with_slanet() method for direct model invocation
- Supplement PPStructureV3 table processing with SLANeXt cell boxes
- Add _compute_table_grid_from_cell_boxes() for column width calculation
- Modify draw_table_region() to use cell_boxes for accurate layout
Key features:
- Auto-detect table type (wired/wireless) using PP-LCNet classifier
- Convert 8-point polygon bbox to 4-point rectangle
- Graceful fallback to equal distribution when cell_boxes unavailable
- Proper coordinate transformation with scaling support
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Archive unify-image-scaling proposal to archive/2025-11-28
- Create new extract-table-cell-boxes proposal for supplementing PPStructureV3
with direct SLANeXt model calls to extract table cell bounding boxes
- Add debug logging to pp_structure_enhanced.py for table cell boxes investigation
- Discovered that PPStructureV3 high-level API filters out cell bbox data,
but paddlex.create_model() can directly invoke underlying models
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- Remove PDF caching to ensure code changes take effect
- Add PDF rotation handling (90/270 degree swap)
- Add dict bbox format support for UnifiedDocument
- Use Paragraph objects for table cells to enable text auto-wrapping
- Align OCR track table rendering logic with Direct track (no fixed rowHeights)
Known issue: PP-StructureV3 does not provide cell bbox in output
(block_content only contains HTML string, no res['boxes'] like old PPStructure)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Remove onAutoConfigReceived callback that caused state update loop
- Add document analysis to check if file needs OCR track
- Only show preprocessing options for OCR-eligible files (images, scanned PDFs)
- Show informative message for editable PDFs that use direct text extraction
- Display text coverage percentage for editable documents
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Create PreprocessingPreview component with debounced config updates
- Show original vs preprocessed images side-by-side
- Display image quality metrics (contrast, sharpness) with quality indicators
- Add zoom controls and fullscreen view for detailed inspection
- Show auto-detected configuration when in auto mode
- Integrate preview toggle with PreprocessingSettings component
- Add i18n translations for preview panel UI
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major improvements to preprocessing controls:
Backend:
- Add contrast_strength (0.5-3.0) and sharpen_strength (0.5-2.0) to PreprocessingConfig
- Auto-detection now calculates optimal strength based on image quality metrics:
- Lower contrast → Higher contrast_strength
- Lower edge strength → Higher sharpen_strength
- Disable binarization in auto mode (rarely beneficial)
- CLAHE clipLimit now scales with contrast_strength
- Sharpening uses unsharp mask with variable strength
Frontend:
- Add strength sliders for contrast and sharpen in manual mode
- Sliders show current value and strength level (輕微/正常/強/最強)
- Move binarize option to collapsible "進階選項" section
- Updated i18n translations for strength labels
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The preprocessing_mode and preprocessing_config parameters were not
being passed from the start_task endpoint through to the OCR service:
- Add preprocessing_mode and preprocessing_config to process_task_ocr()
- Extract preprocessing options from ProcessingOptions in start_task()
- Convert string/dict to proper PreprocessingModeEnum/PreprocessingConfig
- Pass converted parameters to ocr_service.process() and process_image()
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Updates add-layout-preprocessing proposal:
- Auto mode: analyze image quality, auto-select parameters
- Manual mode: user override with specific settings
- Preview API: compare original vs preprocessed before processing
- Frontend UI: mode selection, manual controls, preview button
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem: PP-Structure misses tables with faint lines/borders
Solution: Preprocess images (contrast, sharpen) for layout detection
- Preprocessed image only used for layout detection
- Original image preserved for element extraction (quality)
Includes: proposal.md, design.md, tasks.md, spec delta
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add TableData.from_dict() and TableCell.from_dict() methods to convert
JSON table dicts to proper TableData objects during UnifiedDocument parsing.
Modified _json_to_document_element() to detect TABLE elements with dict
content containing 'cells' key and convert to TableData.
Note: This fix ensures table elements have proper to_html() method available
but the rendered output still needs investigation - tables may still render
incorrectly in OCR track PDFs.
Files changed:
- unified_document.py: Add from_dict() class methods
- pdf_generator_service.py: Convert table dicts during JSON parsing
- Add fix-ocr-track-table-rendering proposal
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays
Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern
Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)
Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Archive incomplete proposal for later continuation.
OCR processing has known quality issues to be addressed in future work.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>