feat: simplify layout model selection and archive proposals

Changes: - Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector - Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla - Add LayoutModelSelector component and zh-TW translations - Fix "default" model behavior with sentinel value for PubLayNet - Add gap filling service for OCR track coverage improvement - Add PP-Structure debug utilities - Archive completed/incomplete proposals: - add-ocr-track-gap-filling (complete) - fix-ocr-track-table-rendering (incomplete) - simplify-ppstructure-model-selection (22/25 tasks) - Add new layout model tests, archive old PP-Structure param tests - Update OpenSpec ocr-processing spec with layout model requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 13:27:00 +08:00
parent c65df754cf
commit 59206a6ab8
35 changed files with 3621 additions and 658 deletions
--- a/openspec/changes/fix-ocr-track-table-rendering/proposal.md
+++ b/openspec/changes/fix-ocr-track-table-rendering/proposal.md
@@ -1,108 +0,0 @@
-# Fix OCR Track Table Rendering
-
-## Summary
-
-OCR track PDF generation produces tables with incorrect format and layout. Tables appear without proper structure - cell content is misaligned and the visual format differs significantly from the original document. Image placement is correct, but table rendering is broken.
-
-## Problem Statement
-
-When generating PDF from OCR track results (via `scan.pdf` processed by PP-StructureV3), the output tables have:
-1. **Wrong cell alignment** - content not positioned in proper cells
-2. **Missing table structure** - rows/columns don't match original document layout
-3. **Incorrect content distribution** - all content seems to flow linearly instead of maintaining grid structure
-
-Reference: `backend/storage/results/af7c9ee8-60a0-4291-9f22-ef98d27eed52/`
- Original: `af7c9ee8-60a0-4291-9f22-ef98d27eed52_scan_page_1.png`
- Generated: `scan_layout.pdf`
- Result JSON: `scan_result.json` - Tables have correct `{rows, cols, cells}` structure
-
-## Root Cause Analysis
-
-### Issue 1: Table Content Not Converted to TableData Object
-
-In `_json_to_document_element` (pdf_generator_service.py:1952):
-```python
-element = DocumentElement(
-    ...
-    content=elem_dict.get('content', ''),  # Raw dict, not TableData
-    ...
-)
-```
-
-Table elements have `content` as a dict `{rows: 5, cols: 4, cells: [...]}` but it's not converted to a `TableData` object.
-
-### Issue 2: OCR Track HTML Conversion Fails
-
-In `convert_unified_document_to_ocr_data` (pdf_generator_service.py:464-467):
-```python
-elif isinstance(element.content, dict):
-    html_content = element.content.get('html', str(element.content))
-```
-
-Since there's no 'html' key in the cells-based dict, it falls back to `str(element.content)` = `"{'rows': 5, 'cols': 4, ...}"` - invalid HTML.
-
-### Issue 3: Different Table Rendering Paths
-
- **Direct track** uses `_draw_table_element_direct` which properly handles dict with cells via `_build_rows_from_cells_dict`
- **OCR track** uses `draw_table_region` which expects HTML strings and fails with dict content
-
-## Proposed Solution
-
-### Option A: Convert dict to TableData during JSON loading (Recommended)
-
-In `_json_to_document_element`, when element type is TABLE and content is a dict with cells, convert it to a `TableData` object:
-
-```python
-# For TABLE elements, convert dict to TableData
-if elem_type == ElementType.TABLE and isinstance(content, dict) and 'cells' in content:
-    content = self._dict_to_table_data(content)
-```
-
-This ensures `element.content.to_html()` works correctly in `convert_unified_document_to_ocr_data`.
-
-### Option B: Fix conversion in convert_unified_document_to_ocr_data
-
-Handle dict with cells properly by converting to HTML:
-
-```python
-elif isinstance(element.content, dict):
-    if 'cells' in element.content:
-        # Convert cells-based dict to HTML
-        html_content = self._cells_dict_to_html(element.content)
-    elif 'html' in element.content:
-        html_content = element.content['html']
-    else:
-        html_content = str(element.content)
-```
-
-## Impact on Hybrid Mode
-
-Hybrid mode uses Direct track rendering (`_generate_direct_track_pdf`) which already handles dict content properly via `_build_rows_from_cells_dict`. The proposed fixes should not affect hybrid mode negatively.
-
-However, testing should verify:
-1. Hybrid mode continues to work with combined Direct + OCR elements
-2. Table rendering quality is consistent across all tracks
-
-## Success Criteria
-
-1. OCR track tables render with correct structure matching original document
-2. Cell content positioned in proper grid locations
-3. Table borders/grid lines visible
-4. No regression in Direct track or Hybrid mode table rendering
-5. All test files (scan.pdf, img1.png, img2.png, img3.png) produce correct output
-
-## Files to Modify
-
-1. `backend/app/services/pdf_generator_service.py`
-   - `_json_to_document_element`: Convert table dict to TableData
-   - `convert_unified_document_to_ocr_data`: Improve dict handling (if Option B)
-
-2. `backend/app/models/unified_document.py` (optional)
-   - Add `TableData.from_dict()` class method for cleaner conversion
-
-## Testing Plan
-
-1. Test scan.pdf with OCR track - verify table structure matches original
-2. Test img1.png, img2.png, img3.png with OCR track
-3. Test PDF files with Direct track - verify no regression
-4. Test Hybrid mode with files that trigger OCR fallback
--- a/openspec/changes/fix-ocr-track-table-rendering/specs/pdf-generation/spec.md
+++ b/openspec/changes/fix-ocr-track-table-rendering/specs/pdf-generation/spec.md
@@ -1,52 +0,0 @@
-# PDF Generation - OCR Track Table Rendering Fix
-
-## MODIFIED Requirements
-
-### Requirement: OCR Track Table Content Conversion
-
-The PDF generator MUST properly convert table content from JSON dict format to renderable structure when processing OCR track results.
-
-#### Scenario: Table dict with cells array converts to proper HTML
-
-Given an OCR track JSON with table element containing rows, cols, and cells array
-When the PDF generator processes this element
-Then the table content MUST be converted to a TableData object
-And TableData.to_html() MUST produce valid HTML with proper tr/td structure
-And the generated PDF table MUST have cells positioned in correct grid locations
-
-#### Scenario: Table with rowspan/colspan renders correctly
-
-Given a table element with cells having rowspan > 1 or colspan > 1
-When the PDF generator renders the table
-Then merged cells MUST span the correct number of rows/columns
-And content MUST appear in the merged cell position
-
-### Requirement: Table Visual Fidelity
-
-The PDF generator MUST render OCR track tables with visual structure matching the original document.
-
-#### Scenario: Table renders with grid lines
-
-Given an OCR track table element
-When rendered to PDF
-Then the table MUST have visible grid lines/borders
-And cell boundaries MUST be clearly defined
-
-#### Scenario: Table text alignment preserved
-
-Given an OCR track table with cell content
-When rendered to PDF
-Then text MUST be positioned within the correct cell boundaries
-And text MUST NOT overflow into adjacent cells
-
-### Requirement: Backward Compatibility with Hybrid Mode
-
-The table rendering fix MUST NOT break hybrid mode processing.
-
-#### Scenario: Hybrid mode tables render correctly
-
-Given a document processed with hybrid mode combining Direct and OCR tracks
-When PDF is generated
-Then Direct track tables MUST render with existing quality
-And OCR track tables MUST render with improved quality
-And no regression in table positioning or content
--- a/openspec/changes/fix-ocr-track-table-rendering/tasks.md
+++ b/openspec/changes/fix-ocr-track-table-rendering/tasks.md
@@ -1,55 +0,0 @@
-# Implementation Tasks
-
-## Phase 1: Core Fix - Table Content Conversion
-
-### 1.1 Add TableData.from_dict() class method
- [ ] In `unified_document.py`, add `from_dict()` method to `TableData` class
- [ ] Handle conversion of cells list (list of dicts) to `TableCell` objects
- [ ] Preserve rows, cols, headers, caption fields
-
-### 1.2 Fix _json_to_document_element for TABLE elements
- [ ] In `pdf_generator_service.py`, modify `_json_to_document_element`
- [ ] When `elem_type == ElementType.TABLE` and content is dict with 'cells', convert to `TableData`
- [ ] Use `TableData.from_dict()` for clean conversion
-
-### 1.3 Verify TableData.to_html() generates correct HTML
- [ ] Test that `to_html()` produces parseable HTML with proper row/cell structure
- [ ] Verify colspan/rowspan attributes are correctly generated
- [ ] Ensure empty cells are properly handled
-
-## Phase 2: OCR Track Rendering Consistency
-
-### 2.1 Review convert_unified_document_to_ocr_data
- [ ] Verify TableData objects are properly converted to HTML
- [ ] Add fallback handling for dict content with 'cells' key
- [ ] Log warning if content cannot be converted to HTML
-
-### 2.2 Review draw_table_region
- [ ] Verify HTMLTableParser correctly parses generated HTML
- [ ] Check that ReportLab Table is positioned at correct bbox
- [ ] Verify font and style application
-
-## Phase 3: Testing and Verification
-
-### 3.1 Test OCR Track
- [ ] Test scan.pdf - verify tables have correct structure
- [ ] Test img1.png, img2.png, img3.png
- [ ] Compare generated PDF with original documents
-
-### 3.2 Test Direct Track (Regression)
- [ ] Test PDF files with Direct track
- [ ] Verify table rendering unchanged
-
-### 3.3 Test Hybrid Mode
- [ ] Test files that trigger hybrid processing
- [ ] Verify mixed Direct + OCR elements render correctly
-
-## Phase 4: Code Quality
-
-### 4.1 Add logging
- [ ] Add debug logging for table content type detection
- [ ] Log conversion steps for troubleshooting
-
-### 4.2 Error handling
- [ ] Handle malformed cell data gracefully
- [ ] Log warnings for unexpected content formats