Translation status (multi-worker support):
- Add filesystem lock files (.translating) to track in-progress translations
- Check lock files in /status API when job_state not found in current worker
- Remove lock files on translation success or failure
OCR fallback fix:
- Fix empty pages when layout analysis fails but OCR succeeds
- Change 'enhanced_results' in ocr_results to ocr_results.get('enhanced_results')
- This ensures fallback to text_regions when enhanced_results is empty list
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays
Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern
Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)
Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Multi-line text rendering (pdf_generator_service.py):
- Calculate font size by dividing bbox height by number of lines
- Start Y coordinate from bbox TOP instead of bottom
- Use non_empty_lines for proper line positioning
HTML table detection:
- pp_structure_enhanced.py: Detect HTML tables in 'text' type content
and reclassify to TABLE when <table tag found
- pdf_generator_service.py: Content-based reclassification from TEXT
to TABLE during UnifiedDocument parsing
- ocr_to_unified_converter.py: Fallback to check 'content' field for
HTML tables when 'html' field is empty
Known issue: OCR processing still has quality issues that need further
investigation and fixes.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Implements the converter that transforms PP-StructureV3 OCR results into
the UnifiedDocument format, enabling consistent output for both OCR and
direct extraction tracks.
- Create OCRToUnifiedConverter class with full element type mapping
- Handle both enhanced (parsing_res_list) and standard markdown results
- Support 4-point and simple bbox formats for coordinates
- Establish element relationships (captions, lists, headers)
- Integrate converter into OCR service dual-track processing
- Update tasks.md marking section 3.3 complete
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>