Files
OCR/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/tasks.md
egg 940a406dce chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00

5.9 KiB

Tasks: Fix OCR Track Cell Over-Detection

Root Cause Analysis Update

Original assumption: PP-Structure was over-detecting cells.

Actual root cause: cell_boxes from table_res_list were being assigned to WRONG tables when HTML matching failed. The fallback used "first available" instead of bbox matching, causing:

  • Table A's cell_boxes assigned to Table B
  • False over-detection metrics (density 6.22 vs actual 1.65)
  • Incorrect reclassification as TEXT

Phase 1: Cell Validation Engine

  • 1.1 Create cell_validation_engine.py with metric-based validation
  • 1.2 Implement cell density calculation (cells per 10000px²)
  • 1.3 Implement average cell area calculation
  • 1.4 Implement cell height validation (table_height / cell_count)
  • 1.5 Add configurable thresholds with defaults:
    • max_cell_density: 3.0 cells/10000px²
    • min_avg_cell_area: 3000 px²
    • min_cell_height: 10px
  • 1.6 Unit tests for validation functions

Phase 2: Table Reclassification

  • 2.1 Implement table-to-text reclassification logic
  • 2.2 Preserve original text content from HTML table
  • 2.3 Create TEXT element with proper bbox
  • 2.4 Recalculate reading order after reclassification

Phase 3: Integration

  • 3.1 Integrate validation into OCR service pipeline (after PP-Structure)
  • 3.2 Add validation before cell_boxes processing
  • 3.3 Add debug logging for filtered tables
  • 3.4 Update processing metadata with filter statistics

Phase 3.5: cell_boxes Matching Fix (NEW)

  • 3.5.1 Fix cell_boxes matching in pp_structure_enhanced.py to use bbox overlap instead of "first available"
  • 3.5.2 Calculate IoU between table_res cell_boxes bounding box and layout element bbox
  • 3.5.3 Match tables with >10% overlap, log match quality
  • 3.5.4 Update validate_cell_boxes to also check table bbox boundaries, not just page boundaries

Results:

  • OLD: cell_boxes mismatch caused false over-detection (density=6.22)
  • NEW: correct bbox matching (overlap=0.97-0.98), actual metrics (density=1.06-1.65)

Phase 4: Testing

  • 4.1 Test with edit.pdf (sample with over-detection)
  • 4.2 Verify Table 3 (51 cells) - now correctly matched with density=1.65 (within threshold)
  • 4.3 Verify Tables 1, 2, 4 remain as tables
  • 4.4 Compare PDF output quality before/after
  • 4.5 Regression test on other documents

Phase 5: cell_boxes Quality Check (NEW - 2025-12-07)

Problem: PP-Structure's cell_boxes don't always form proper grids. Some tables have overlapping cells (18-23% of cell pairs overlap), causing messy overlapping borders in PDF.

Solution: Added cell overlap quality check in _draw_table_with_cell_boxes():

  • 5.1 Count overlapping cell pairs in cell_boxes
  • 5.2 Calculate overlap ratio (overlapping pairs / total pairs)
  • 5.3 If overlap ratio > 10%, skip cell_boxes rendering and use ReportLab Table fallback
  • 5.4 Text inside table regions filtered out to prevent duplicate rendering

Test Results (task_id: 5e04bd00-a7e4-4776-8964-0a56eaf608d8):

  • Table pp3_0_3 (13 cells): 10/78 pairs (12.8%) overlap → ReportLab fallback
  • Table pp3_0_6 (29 cells): 94/406 pairs (23.2%) overlap → ReportLab fallback
  • Table pp3_0_7 (12 cells): No overlap issue → Grid-based line drawing
  • Table pp3_0_16 (51 cells): 233/1275 pairs (18.3%) overlap → ReportLab fallback
  • 26 text regions inside tables filtered out to prevent duplicate rendering

Phase 6: Fix Double Rendering of Text Inside Tables (2025-12-07)

Problem: Text inside table regions was rendered twice:

  1. Via layout/HTML table rendering
  2. Via raw OCR text_regions (because regions_to_avoid excluded tables)

Root Cause: In pdf_generator_service.py:1162-1169:

regions_to_avoid = [img for img in images_metadata if img.get('type') != 'table']

This intentionally excluded tables from filtering, causing text overlap.

Solution:

  • 6.1 Include tables in regions_to_avoid to filter text inside table bboxes
  • 6.2 Test PDF output with fix applied
  • 6.3 Verify no blank areas where tables should have content

Test Results (task_id: 2d788fca-c824-492b-95cb-35f2fedf438d):

  • PDF size reduced 18% (59,793 → 48,772 bytes)
  • Text content reduced 66% (14,184 → 4,829 chars) - duplicate text eliminated
  • Before: "PRODUCT DESCRIPTION" appeared twice, table values duplicated
  • After: Content appears only once, clean layout
  • Table content preserved correctly via HTML table rendering

Phase 7: Smart Table Rendering Based on cell_boxes Quality (2025-12-07)

Problem: Phase 6 fix caused content to be largely missing because all tables were excluded from text rendering, but tables with bad cell_boxes quality had their content rendered via ReportLab Table fallback which might not preserve text accurately.

Solution: Smart rendering based on cell_boxes quality:

  • Good quality cell_boxes (≤10% overlap) → Filter text, render via cell_boxes
  • Bad quality cell_boxes (>10% overlap) → Keep raw OCR text, draw table border only

Implementation:

  • 7.1 Add _check_cell_boxes_quality() to assess cell overlap ratio
  • 7.2 Add _draw_table_border_only() for border-only rendering
  • 7.3 Modify smart filtering in _generate_pdf_from_data():
    • Good quality tables → add to regions_to_avoid
    • Bad quality tables → mark with _use_border_only=True
  • 7.4 Add element_id to table_element in convert_unified_document_to_ocr_data() (was missing, causing _use_border_only flag mismatch)
  • 7.5 Modify draw_table_region() to check _use_border_only flag

Test Results (task_id: 82c7269f-aff0-493b-adac-5a87248cd949, scan.pdf):

  • Tables pp3_0_3 and pp3_0_4 identified as bad quality → border-only rendering
  • Raw OCR text preserved and rendered at original positions
  • PDF output: 62,998 bytes with all text content visible
  • Logs confirm: [TABLE] pp3_0_3: Drew border only (bad cell_boxes quality)