chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions
--- a/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/proposal.md
+++ b/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/proposal.md
@@ -0,0 +1,73 @@
+# Change: Fix OCR Track Cell Over-Detection
+
+## Why
+
+PP-StructureV3 is over-detecting table cells in OCR Track processing, incorrectly identifying regular text content (key-value pairs, bullet points, form labels) as table cells. This results in:
+- 4 tables detected instead of 1 on sample document
+- 105 cells detected instead of 12 (expected)
+- Broken text layout and incorrect font sizing in PDF output
+- Poor document reconstruction quality compared to Direct Track
+
+Evidence from task comparison:
+- Direct Track (`cfd996d9`): 1 table, 12 cells - correct representation
+- OCR Track (`62de32e0`): 4 tables, 105 cells - severe over-detection
+
+## What Changes
+
+- Add post-detection cell validation pipeline to filter false-positive cells
+- Implement table structure validation using geometric patterns
+- Add text density analysis to distinguish tables from key-value text
+- Apply stricter confidence thresholds for cell detection
+- Add cell clustering algorithm to identify isolated false-positive cells
+
+## Root Cause Analysis
+
+PP-StructureV3's cell detection models over-detect cells in structured text regions. Analysis of page 1:
+
+| Table | Cells | Density (cells/10000px²) | Avg Cell Area | Status |
+|-------|-------|--------------------------|---------------|--------|
+| 1 | 13 | 0.87 | 11,550 px² | Normal |
+| 2 | 12 | 0.44 | 22,754 px² | Normal |
+| **3** | **51** | **6.22** | **1,609 px²** | **Over-detected** |
+| 4 | 29 | 0.94 | 10,629 px² | Normal |
+
+**Table 3 anomalies:**
+- Cell density 7-14x higher than normal tables
+- Average cell area only 7-14% of normal
+- 150px height with 51 cells = ~3px per cell row (impossible)
+
+## Proposed Solution: Post-Detection Cell Validation
+
+Apply metric-based filtering after PP-Structure detection:
+
+### Filter 1: Cell Density Check
+- **Threshold**: Reject tables with density > 3.0 cells/10000px²
+- **Rationale**: Normal tables have 0.4-1.0 density; over-detected have 6+
+
+### Filter 2: Minimum Cell Area
+- **Threshold**: Reject tables with average cell area < 3,000 px²
+- **Rationale**: Normal cells are 10,000-25,000 px²; over-detected are ~1,600 px²
+
+### Filter 3: Cell Height Validation
+- **Threshold**: Reject if (table_height / cell_count) < 10px
+- **Rationale**: Each cell row needs minimum height for readable text
+
+### Filter 4: Reclassification
+- Tables failing validation are reclassified as TEXT elements
+- Original text content is preserved
+- Reading order is recalculated
+
+## Impact
+
+- Affected specs: `ocr-processing`
+- Affected code:
+  - `backend/app/services/ocr_service.py` - Add cell validation pipeline
+  - `backend/app/services/processing_orchestrator.py` - Integrate validation
+  - New file: `backend/app/services/cell_validation_engine.py`
+
+## Success Criteria
+
+1. OCR Track cell count matches Direct Track within 10% tolerance
+2. No false-positive tables detected from non-tabular content
+3. Table structure maintains logical row/column alignment
+4. PDF output quality comparable to Direct Track for documents with tables
--- a/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/specs/ocr-processing/spec.md
@@ -0,0 +1,64 @@
+## ADDED Requirements
+
+### Requirement: Cell Over-Detection Filtering
+
+The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells.
+
+#### Scenario: Cell density exceeds threshold
+- **GIVEN** a table detected by PP-StructureV3 with cell_boxes
+- **WHEN** cell density exceeds 3.0 cells per 10,000 px²
+- **THEN** the system SHALL flag the table as over-detected
+- **AND** reclassify the table as a TEXT element
+
+#### Scenario: Average cell area below threshold
+- **GIVEN** a table detected by PP-StructureV3
+- **WHEN** average cell area is less than 3,000 px²
+- **THEN** the system SHALL flag the table as over-detected
+- **AND** reclassify the table as a TEXT element
+
+#### Scenario: Cell height too small
+- **GIVEN** a table with height H and N cells
+- **WHEN** (H / N) is less than 10 pixels
+- **THEN** the system SHALL flag the table as over-detected
+- **AND** reclassify the table as a TEXT element
+
+#### Scenario: Valid tables are preserved
+- **GIVEN** a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10)
+- **WHEN** validation is applied
+- **THEN** the table SHALL be preserved unchanged
+- **AND** all cell_boxes SHALL be retained
+
+### Requirement: Table-to-Text Reclassification
+
+The system SHALL convert over-detected tables to TEXT elements while preserving content.
+
+#### Scenario: Table content is preserved
+- **GIVEN** a table flagged for reclassification
+- **WHEN** converting to TEXT element
+- **THEN** the system SHALL extract text content from table HTML
+- **AND** preserve the original bounding box
+- **AND** set element type to TEXT
+
+#### Scenario: Reading order is recalculated
+- **GIVEN** tables have been reclassified as TEXT
+- **WHEN** assembling the final page structure
+- **THEN** the system SHALL recalculate reading order
+- **AND** sort elements by y0 then x0 coordinates
+
+### Requirement: Validation Configuration
+
+The system SHALL provide configurable thresholds for cell validation.
+
+#### Scenario: Default thresholds are applied
+- **GIVEN** no custom configuration is provided
+- **WHEN** validating tables
+- **THEN** the system SHALL use default thresholds:
+  - max_cell_density: 3.0 cells/10000px²
+  - min_avg_cell_area: 3000 px²
+  - min_cell_height: 10 px
+
+#### Scenario: Custom thresholds can be configured
+- **GIVEN** custom validation thresholds in configuration
+- **WHEN** validating tables
+- **THEN** the system SHALL use the custom values
+- **AND** apply them consistently to all pages
--- a/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/tasks.md
+++ b/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/tasks.md
@@ -0,0 +1,124 @@
+# Tasks: Fix OCR Track Cell Over-Detection
+
+## Root Cause Analysis Update
+
+**Original assumption:** PP-Structure was over-detecting cells.
+
+**Actual root cause:** cell_boxes from `table_res_list` were being assigned to WRONG tables when HTML matching failed. The fallback used "first available" instead of bbox matching, causing:
+- Table A's cell_boxes assigned to Table B
+- False over-detection metrics (density 6.22 vs actual 1.65)
+- Incorrect reclassification as TEXT
+
+## Phase 1: Cell Validation Engine
+
+- [x] 1.1 Create `cell_validation_engine.py` with metric-based validation
+- [x] 1.2 Implement cell density calculation (cells per 10000px²)
+- [x] 1.3 Implement average cell area calculation
+- [x] 1.4 Implement cell height validation (table_height / cell_count)
+- [x] 1.5 Add configurable thresholds with defaults:
+  - max_cell_density: 3.0 cells/10000px²
+  - min_avg_cell_area: 3000 px²
+  - min_cell_height: 10px
+- [ ] 1.6 Unit tests for validation functions
+
+## Phase 2: Table Reclassification
+
+- [x] 2.1 Implement table-to-text reclassification logic
+- [x] 2.2 Preserve original text content from HTML table
+- [x] 2.3 Create TEXT element with proper bbox
+- [x] 2.4 Recalculate reading order after reclassification
+
+## Phase 3: Integration
+
+- [x] 3.1 Integrate validation into OCR service pipeline (after PP-Structure)
+- [x] 3.2 Add validation before cell_boxes processing
+- [x] 3.3 Add debug logging for filtered tables
+- [ ] 3.4 Update processing metadata with filter statistics
+
+## Phase 3.5: cell_boxes Matching Fix (NEW)
+
+- [x] 3.5.1 Fix cell_boxes matching in pp_structure_enhanced.py to use bbox overlap instead of "first available"
+- [x] 3.5.2 Calculate IoU between table_res cell_boxes bounding box and layout element bbox
+- [x] 3.5.3 Match tables with >10% overlap, log match quality
+- [x] 3.5.4 Update validate_cell_boxes to also check table bbox boundaries, not just page boundaries
+
+**Results:**
+- OLD: cell_boxes mismatch caused false over-detection (density=6.22)
+- NEW: correct bbox matching (overlap=0.97-0.98), actual metrics (density=1.06-1.65)
+
+## Phase 4: Testing
+
+- [x] 4.1 Test with edit.pdf (sample with over-detection)
+- [x] 4.2 Verify Table 3 (51 cells) - now correctly matched with density=1.65 (within threshold)
+- [x] 4.3 Verify Tables 1, 2, 4 remain as tables
+- [x] 4.4 Compare PDF output quality before/after
+- [ ] 4.5 Regression test on other documents
+
+## Phase 5: cell_boxes Quality Check (NEW - 2025-12-07)
+
+**Problem:** PP-Structure's cell_boxes don't always form proper grids. Some tables have
+overlapping cells (18-23% of cell pairs overlap), causing messy overlapping borders in PDF.
+
+**Solution:** Added cell overlap quality check in `_draw_table_with_cell_boxes()`:
+
+- [x] 5.1 Count overlapping cell pairs in cell_boxes
+- [x] 5.2 Calculate overlap ratio (overlapping pairs / total pairs)
+- [x] 5.3 If overlap ratio > 10%, skip cell_boxes rendering and use ReportLab Table fallback
+- [x] 5.4 Text inside table regions filtered out to prevent duplicate rendering
+
+**Test Results (task_id: 5e04bd00-a7e4-4776-8964-0a56eaf608d8):**
+- Table pp3_0_3 (13 cells): 10/78 pairs (12.8%) overlap → ReportLab fallback
+- Table pp3_0_6 (29 cells): 94/406 pairs (23.2%) overlap → ReportLab fallback
+- Table pp3_0_7 (12 cells): No overlap issue → Grid-based line drawing
+- Table pp3_0_16 (51 cells): 233/1275 pairs (18.3%) overlap → ReportLab fallback
+- 26 text regions inside tables filtered out to prevent duplicate rendering
+
+## Phase 6: Fix Double Rendering of Text Inside Tables (2025-12-07)
+
+**Problem:** Text inside table regions was rendered twice:
+1. Via layout/HTML table rendering
+2. Via raw OCR text_regions (because `regions_to_avoid` excluded tables)
+
+**Root Cause:** In `pdf_generator_service.py:1162-1169`:
+```python
+regions_to_avoid = [img for img in images_metadata if img.get('type') != 'table']
+```
+This intentionally excluded tables from filtering, causing text overlap.
+
+**Solution:**
+- [x] 6.1 Include tables in `regions_to_avoid` to filter text inside table bboxes
+- [x] 6.2 Test PDF output with fix applied
+- [x] 6.3 Verify no blank areas where tables should have content
+
+**Test Results (task_id: 2d788fca-c824-492b-95cb-35f2fedf438d):**
+- PDF size reduced 18% (59,793 → 48,772 bytes)
+- Text content reduced 66% (14,184 → 4,829 chars) - duplicate text eliminated
+- Before: "PRODUCT DESCRIPTION" appeared twice, table values duplicated
+- After: Content appears only once, clean layout
+- Table content preserved correctly via HTML table rendering
+
+## Phase 7: Smart Table Rendering Based on cell_boxes Quality (2025-12-07)
+
+**Problem:** Phase 6 fix caused content to be largely missing because all tables were
+excluded from text rendering, but tables with bad cell_boxes quality had their content
+rendered via ReportLab Table fallback which might not preserve text accurately.
+
+**Solution:** Smart rendering based on cell_boxes quality:
+- Good quality cell_boxes (≤10% overlap) → Filter text, render via cell_boxes
+- Bad quality cell_boxes (>10% overlap) → Keep raw OCR text, draw table border only
+
+**Implementation:**
+- [x] 7.1 Add `_check_cell_boxes_quality()` to assess cell overlap ratio
+- [x] 7.2 Add `_draw_table_border_only()` for border-only rendering
+- [x] 7.3 Modify smart filtering in `_generate_pdf_from_data()`:
+  - Good quality tables → add to `regions_to_avoid`
+  - Bad quality tables → mark with `_use_border_only=True`
+- [x] 7.4 Add `element_id` to `table_element` in `convert_unified_document_to_ocr_data()`
+  (was missing, causing `_use_border_only` flag mismatch)
+- [x] 7.5 Modify `draw_table_region()` to check `_use_border_only` flag
+
+**Test Results (task_id: 82c7269f-aff0-493b-adac-5a87248cd949, scan.pdf):**
+- Tables pp3_0_3 and pp3_0_4 identified as bad quality → border-only rendering
+- Raw OCR text preserved and rendered at original positions
+- PDF output: 62,998 bytes with all text content visible
+- Logs confirm: `[TABLE] pp3_0_3: Drew border only (bad cell_boxes quality)`