egg/OCR - OCR

egg/OCR

Author	SHA1	Message	Date
egg	3903bcf77d	fix: tighten covering detection thresholds to avoid false positives - Increase white threshold from 0.95 to 0.98 (pure white only) - Decrease black threshold from 0.05 to 0.02 (pure black only) - Remove "other solid" detection (caused false positives on gray backgrounds) This prevents light gray table cell backgrounds (RGB ~0.93) from being incorrectly detected as covering/redaction rectangles. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 07:36:07 +08:00
egg	bc66f72352	feat: extend covering detection to include black/redaction rectangles Expands whiteout detection to handle: - White rectangles (RGB >= 0.95) - correction tape / white-out - Black rectangles (RGB <= 0.05) - redaction / censoring - Other solid fills (very dark or very light) - potential covering Adds color_type to covered text results for better logging. Logs now show breakdown by cover type (white, black, other). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 07:34:35 +08:00
egg	6a65c7617d	feat: add PDF preprocessing pipeline for Direct track Implement multi-stage preprocessing pipeline to improve extraction quality: Phase 1 - Object-level Cleaning: - Content stream sanitization via clean_contents(sanitize=True) - Hidden OCG layer detection - White-out detection with IoU 80% threshold Phase 2 - Layout Analysis: - Column-aware sorting (sort=True) - Page number pattern detection and filtering - Position-based element classification Phase 3 - Enhanced Extraction: - Garble rate detection (cid:xxxx, U+FFFD, PUA characters) - OCR fallback recommendation when garble >10% - Quality report generation interface Phase 4 - GS Distillation (Exception Handler): - Ghostscript PDF repair for severely damaged files - Auto-triggered on high garble or mupdf errors - Graceful fallback when GS unavailable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-03 16:11:00 +08:00
egg	87dc97d951	fix: improve Office document processing with Direct track - Force Office documents (PPTX, DOCX, XLSX) to use Direct track after LibreOffice conversion, since converted PDFs always have extractable text - Fix PDF generator to not exclude text in image regions for Direct track, allowing text to render on top of background images (critical for PPT) - Increase file_type column from VARCHAR(50) to VARCHAR(100) to support long MIME types like PPTX - Remove reference to non-existent total_images metadata attribute This significantly improves processing time for Office documents (from ~170s OCR to ~10s Direct) while preserving text quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 16:22:04 +08:00
egg	79cffe6da0	fix: resolve Direct track PDF regression issues - Add _is_likely_chart() to detect charts misclassified as tables - High empty cell ratio (>70%) indicates chart grid - Axis label patterns (numbers, °C, %, Time, Temperature) - Multi-line cells with axis text - Add _build_rows_from_cells_dict() to handle JSON table content - Properly parse cells structure from Direct extraction - Avoid HTML round-trip conversion issues - Remove rowHeights parameter from Table() to fix content overlap - Let ReportLab auto-calculate row heights based on content - Use scaling to fit within bbox Fixes edit.pdf table overlap and chart misclassification issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 12:29:46 +08:00
egg	1afdb822c3	feat: implement hybrid image extraction and memory management Backend: - Add hybrid image extraction for Direct track (inline image blocks) - Add render_inline_image_regions() fallback when OCR doesn't find images - Add check_document_for_missing_images() for detecting missing images - Add memory management system (MemoryGuard, ModelManager, ServicePool) - Update pdf_generator_service to handle HYBRID processing track - Add ElementType.LOGO for logo extraction Frontend: - Fix PDF viewer re-rendering issues with memoization - Add TaskNotFound component and useTaskValidation hook - Disable StrictMode due to react-pdf incompatibility - Fix task detail and results page loading states 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 10:56:22 +08:00
egg	3358d97624	fix: resolve Direct track PDF table rendering overlap with canvas scaling This commit fixes the critical table overlap issue in Direct track PDF layout restoration where generated tables exceeded their bounding boxes and overlapped with surrounding text. Root Cause: ReportLab's Table component auto-calculates row heights based on content, often rendering tables larger than their specified bbox. The rowHeights parameter was ignored during actual rendering, and font size reduction didn't proportionally affect table height. Solution - Canvas Transform Scaling: Implemented a reliable canvas transform approach in _draw_table_element_direct(): 1. Wrap table with generous space to get natural rendered dimensions 2. Calculate scale factor: min(bbox_width/actual_width, bbox_height/actual_height, 1.0) 3. Apply canvas transform: saveState → translate → scale → drawOn → restoreState 4. Removed all buffers, using exact bbox positioning Key Changes: - backend/app/services/pdf_generator_service.py (_draw_table_element_direct): * Added canvas scaling logic (lines 2180-2208) * Removed buffer adjustments (previously 2pt→18pt attempts) * Use exact bbox position: pdf_y = page_height - bbox.y1 * Supports column widths from metadata to preserve original ratios - backend/app/services/direct_extraction_engine.py (_process_native_table): * Extract column widths from PyMuPDF table.cells data (lines 691-761) * Calculate and store original column width ratios (e.g., 40:60) * Store in element metadata for use during PDF generation * Prevents unnecessary text wrapping that increases table height Results: Test case showed perfect scaling: natural table 246.8×108.0pt → scaled to 246.8×89.6pt with factor 0.830, fitting exactly within bbox without overlap. Cleanup: - Removed test/debug scripts: check_tables.py, verify_chart_recognition.py - Removed demo files from demo_docs/ (basic/, layout/, mixed/, tables/) User Confirmed: "FINAL_SCALING_FIX.pdf 此份的結果是可接受的. 恭喜你完成的direct pdf的修復" Next: Other document formats require layout verification and fixes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-24 19:39:12 +08:00
egg	108784a270	fix: resolve table/image overlap and missing images in Direct track PDF generation This commit fixes two critical rendering issues in Direct track PDF generation that were reported by the user after the span-based rendering fixes. ## Issue 1: Table Text Overlap (表格跟文字重疊) Problem: Tables rendered with duplicate text appearing on top because DirectExtractionEngine extracts table content as both TABLE elements (with structure) and separate TEXT elements (individual text blocks), causing PDFGeneratorService to render both and create overlaps. Solution: Implemented overlap filtering mechanism with area-based detection Changes: - Added `_is_element_inside_regions()` method in PDFGeneratorService - Uses overlap ratio detection (50% threshold) instead of strict containment - Handles cases where text blocks are larger than detected regions - Algorithm: filters element if ≥50% of its area overlaps with table/image bbox - Modified `_generate_direct_track_pdf()` to: - Collect exclusion regions (tables + images) before rendering - Check each text/list element for overlap before drawing - Skip elements that significantly overlap with exclusion regions Evidence: - Test case: "PRODUCT DESCRIPTION" text block overlaps 74.5% with table - File size reduced by 545 bytes (-3.8%) from filtered elements - E2E tests passed: test_2_4_1_simple_tables, test_2_4_2_complex_tables - User confirmed: "表格問題看起來處理好了" ✓ ## Issue 2: Missing Images (圖片消失) Problem: Images not rendering in generated PDFs because `extract()` was called without `output_dir` parameter, causing images to not be saved to filesystem, resulting in missing `saved_path` in element content. Solution: Auto-create default output directory for image extraction Changes: - Modified `DirectExtractionEngine.extract()` to: - Auto-create `storage/results/{document_id}/` when output_dir not provided - Ensures images always saved when enable_image_extraction=True - Uses short UUID (8 chars) for cleaner directory names - Maintains backward compatibility (existing calls still work) Evidence: - Image extraction: 2/2 images saved to storage/results/ - Image files: 5,320 + 4,945 = 10,265 bytes total - PDF file size: 13,627 → 26,643 bytes (+13,016 bytes, +95.5%) - PyMuPDF verification: 2 images embedded in page 1 - E2E tests passed: test_1_3_2_direct_track_image_rendering, test_1_3_3_verify_image_paths ## Technical Details Overlap Filtering Algorithm: ``` For each text/list element: For each table/image region: Calculate overlap_area = intersection(element_bbox, region_bbox) Calculate overlap_ratio = overlap_area / element_area If overlap_ratio ≥ 0.5: SKIP element (inside region) ``` Key Advantages: - Area-based vs strict containment (handles larger text blocks) - Configurable threshold (default 50%, adjustable if needed) - Preserves reading order and layout - No breaking changes to existing code ## Test Results E2E Test Suite: 6/8 passed (2 OCR track timeouts unrelated to these fixes) - ✅ test_1_3_2_direct_track_image_rendering - ✅ test_1_3_3_verify_image_paths - ✅ test_2_4_1_simple_tables - ✅ test_2_4_2_complex_tables - ✅ test_4_4_1_compare_direct_with_original File Size Evidence: - Text-only (no images): 13,627 bytes - With images (both fixes): 26,643 bytes - Difference: +13,016 bytes (+95.5%) confirming image inclusion Visual Quality: - Tables render without text overlay ✓ - Images embedded correctly (2/2) ✓ - Text outside regions still renders ✓ - No duplicate rendering ✓ ## Files Changed - backend/app/services/pdf_generator_service.py - Added _is_element_inside_regions() (lines 592-642) - Modified _generate_direct_track_pdf() (lines 697-766) - backend/app/services/direct_extraction_engine.py - Modified extract() (lines 78-84) - backend/tests/e2e/TEST_RESULTS_FINAL_FIX.md - Comprehensive test documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-24 16:31:28 +08:00
egg	6d4df26223	feat: add multi-column layout support for PDF extraction and generation - Enable PyMuPDF sort=True for correct reading order in multi-column PDFs - Add column detection utilities (_sort_elements_for_reading_order, _detect_columns) - Preserve extraction order in PDF generation instead of re-sorting by Y position - Fix StyleInfo field names (font_name, font_size, text_color instead of font, size, color) - Fix Page.dimensions access (was incorrectly accessing Page.width directly) - Implement row-by-row reading order (top-to-bottom, left-to-right within each row) This fixes the issue where multi-column PDFs (e.g., technical data sheets) had incorrect element ordering, with title appearing at position 12 instead of first. PyMuPDF's built-in sort=True parameter provides optimal reading order for most multi-column layouts without requiring custom column detection. Resolves: Multi-column layout reading order issue reported by user Affects: Direct track PDF extraction and generation (Task 8) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-24 14:25:53 +08:00
egg	75c194fe2a	feat: implement Task 7 span-level rendering for inline styling Added support for preserving and rendering inline style variations within text elements (e.g., bold/italic/color changes mid-line). Span Extraction (direct_extraction_engine.py): 1. Parse PyMuPDF span data with font, size, flags, color per span 2. Create DocumentElement children for each span with StyleInfo 3. Store spans in element.children for downstream rendering 4. Extract span-specific bbox from PyMuPDF (lines 434-453) Span Rendering (pdf_generator_service.py): 1. Implement _draw_text_with_spans() method (lines 1685-1734) - Iterate through span children - Apply per-span styling via _apply_text_style - Track X position and calculate widths - Return total rendered width 2. Integrate in _draw_text_element_direct() (lines 1822-1823, 1905-1914) - Check for element.children (has_spans flag) - Use span rendering for first line - Fall back to normal rendering for list items 3. Add span count to debug logging Features: - Inline font changes (Arial → Times → Courier) - Inline size changes (12pt → 14pt → 10pt) - Inline style changes (normal → bold → italic) - Inline color changes (black → red → blue) Limitations (future work): - Currently renders all spans on first line only - Multi-line span support requires line breaking logic - List items use single-style rendering (compatibility) Direct track only (OCR track has no span information). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-24 11:44:05 +08:00
egg	5bcf3dfd42	fix: complete layout analysis features for DirectExtractionEngine Implements missing layout analysis capabilities: - Add footer detection based on page position (bottom 10%) - Build hierarchical section structure from font sizes - Create nested list structure from indentation levels All elements now have proper metadata for: - section_level, parent_section, child_sections (headers) - list_level, parent_item, children (list items) - is_page_header, is_page_footer flags Updates tasks.md to reflect accurate completion status. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-19 08:15:11 +08:00
egg	2d50c128f7	feat: implement core dual-track processing infrastructure Added foundation for dual-track document processing: 1. UnifiedDocument Model (backend/app/models/unified_document.py) - Common output format for both OCR and direct extraction - Comprehensive element types (23+ types from PP-StructureV3) - BoundingBox, StyleInfo, TableData structures - Backward compatibility with legacy format 2. DocumentTypeDetector Service (backend/app/services/document_type_detector.py) - Intelligent document type detection using python-magic - PDF editability analysis using PyMuPDF - Processing track recommendation with confidence scores - Support for PDF, images, Office docs, and text files 3. DirectExtractionEngine Service (backend/app/services/direct_extraction_engine.py) - Fast extraction from editable PDFs using PyMuPDF - Preserves fonts, colors, and exact positioning - Native and positional table detection - Image extraction with coordinates - Hyperlink and metadata extraction 4. Dependencies - Added PyMuPDF>=1.23.0 for PDF extraction - Added pdfplumber>=0.10.0 as fallback - Added python-magic-bin>=0.4.14 for file detection Next: Integrate with OCR service for complete dual-track processing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-18 20:17:50 +08:00

12 Commits