Files
OCR/openspec/changes/fix-ocr-track-table-rendering/specs/ocr-processing/spec.md
egg 940a406dce chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00

4.2 KiB

MODIFIED Requirements

Requirement: Enhanced OCR with Full PP-StructureV3

The system SHALL utilize the full capabilities of PP-StructureV3, extracting all element types from parsing_res_list, with proper handling of visual elements and table coordinates.

Scenario: Extract comprehensive document structure

  • WHEN processing through OCR track
  • THEN the system SHALL use page_result.json['parsing_res_list']
  • AND extract all element types including headers, lists, tables, figures
  • AND preserve layout_bbox coordinates for each element

Scenario: Maintain reading order

  • WHEN extracting elements from PP-StructureV3
  • THEN the system SHALL preserve the reading order from parsing_res_list
  • AND assign sequential indices to elements
  • AND support reordering for complex layouts

Scenario: Extract table structure with HTML content

  • WHEN PP-StructureV3 identifies a table
  • THEN the system SHALL extract cell content and boundaries from table_res_list
  • AND extract pred_html for table HTML content
  • AND validate cell_boxes coordinates against page boundaries
  • AND apply fallback detection for invalid coordinates
  • AND preserve table HTML for structure
  • AND extract plain text for translation

Scenario: Table matching via bbox overlap

  • GIVEN a table element from parsing_res_list without direct HTML content
  • WHEN matching against table_res_list using bbox overlap
  • AND overlap ratio exceeds 10%
  • THEN the system SHALL extract both cell_box_list and pred_html from the matched table_res
  • AND set element['html'] to the extracted pred_html
  • AND set element['extracted_text'] from the HTML content
  • AND log the successful extraction

Scenario: Extract visual elements with paths

  • WHEN PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
  • THEN the system SHALL preserve saved_path for each element
  • AND include image dimensions and format
  • AND enable image embedding in output PDF

ADDED Requirements

Requirement: OCR Track PDF Coordinate System

The system SHALL generate PDF output for OCR Track using the OCR coordinate system dimensions to ensure accurate text sizing and positioning.

Scenario: PDF page size matches OCR coordinate system

  • GIVEN an OCR track processing task
  • WHEN generating the output PDF
  • THEN the system SHALL use the OCR image dimensions as PDF page size
  • AND set scale factors to 1.0 (no scaling)
  • AND preserve original bbox coordinates without transformation

Scenario: Text font size calculation without scaling

  • GIVEN a text element with bbox height H in OCR coordinates
  • WHEN rendering text in PDF
  • THEN the system SHALL calculate font size based directly on bbox height
  • AND NOT apply additional scaling factors
  • AND ensure readable text output

Scenario: Direct Track PDF maintains original size

  • GIVEN a direct track processing task
  • WHEN generating the output PDF
  • THEN the system SHALL use the original PDF page dimensions
  • AND preserve existing coordinate transformation logic
  • AND NOT be affected by OCR Track coordinate changes

Requirement: Table Cell Quality Assessment

The system SHALL assess table cell_boxes quality with appropriate thresholds to avoid filtering valid tables.

Scenario: Cell density threshold

  • GIVEN a table with cell_boxes from PP-StructureV3
  • WHEN cell density exceeds 5.0 cells per 10,000 px²
  • THEN the system SHALL flag the table as potentially over-detected
  • AND log the specific density value for debugging

Scenario: Average cell area threshold

  • GIVEN a table with cell_boxes
  • WHEN average cell area is less than 2,000 px²
  • THEN the system SHALL flag the table as potentially over-detected
  • AND log the specific area value for debugging

Scenario: Valid tables with normal metrics

  • GIVEN a table with density < 5.0 cells/10000px² and avg area > 2000px²
  • WHEN quality assessment is applied
  • THEN the table SHALL be considered valid
  • AND cell_boxes SHALL be used for rendering
  • AND table content SHALL be displayed in PDF output