OCR/services at fa9b542b06c6ffb6fb8902ef6560c965a507f00d - OCR

egg/OCR

Files

egg fa9b542b06 fix: improve OCR track multi-line text rendering and HTML table detection

Multi-line text rendering (pdf_generator_service.py):
- Calculate font size by dividing bbox height by number of lines
- Start Y coordinate from bbox TOP instead of bottom
- Use non_empty_lines for proper line positioning

HTML table detection:
- pp_structure_enhanced.py: Detect HTML tables in 'text' type content
  and reclassify to TABLE when <table tag found
- pdf_generator_service.py: Content-based reclassification from TEXT
  to TABLE during UnifiedDocument parsing
- ocr_to_unified_converter.py: Fallback to check 'content' field for
  HTML tables when 'html' field is empty

Known issue: OCR processing still has quality issues that need further
investigation and fixes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 16:09:31 +08:00

__init__.py

feat: add unified JSON export with standardized schema

2025-11-19 08:36:24 +08:00

admin_service.py

fix: migrate UI to V2 API and fix admin dashboard

2025-11-17 08:55:50 +08:00

audit_service.py

feat: complete external auth V2 migration with advanced features

2025-11-14 17:19:43 +08:00

direct_extraction_engine.py

fix: resolve Direct track PDF regression issues