egg/OCR

Files

egg 940a406dce chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 11:55:39 +08:00

7.2 KiB

Raw Blame History

document-processing Specification Delta

ADDED Requirements

Requirement: Table Cell Merging Detection

The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents.

Scenario: Detect merged cells in Direct Track

WHEN extracting tables from an editable PDF using Direct Track
THEN the system SHALL use PyMuPDF find_tables() API
AND correctly identify cells with rowspan > 1 or colspan > 1
AND preserve merge information in UnifiedDocument table structure
AND skip placeholder cells that are covered by merged cells

Scenario: Handle complex table structures

WHEN processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges)
THEN the system SHALL NOT split merged cells into individual cells
AND the output cell count SHALL match the actual visual cell count
AND the rendered PDF SHALL display correct merged cell boundaries

Requirement: Visual Element Path Preservation

The system SHALL preserve image paths for all visual element types during OCR conversion.

Scenario: Preserve CHART element paths

WHEN converting PP-StructureV3 output containing CHART elements
THEN the system SHALL treat CHART as a visual element type
AND extract saved_path from the element data
AND include saved_path in the UnifiedDocument content field

Scenario: Support all visual element types

WHEN processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP
THEN the system SHALL extract saved_path or img_path for each element
AND preserve path, width, height, and format in content dictionary
AND enable downstream PDF generation to embed these images

Scenario: Fallback path resolution

WHEN a visual element has multiple path fields (saved_path, img_path)
THEN the system SHALL prefer saved_path over img_path
AND fallback to img_path if saved_path is missing
AND log warning if both paths are missing

Requirement: Cell Box Coordinate Validation

The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases.

Scenario: Detect out-of-bounds coordinates

WHEN processing cell_boxes from PP-StructureV3
THEN the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height)
AND log tables with coordinates exceeding page bounds
AND mark affected cells for fallback processing

Scenario: Apply CV line detection fallback

WHEN cell_boxes coordinates are invalid (out of bounds)
THEN the system SHALL apply OpenCV line detection as fallback
AND reconstruct table structure from detected lines
AND include fallback_used flag in table metadata

Scenario: Coordinate normalization

WHEN coordinates are within page bounds but slightly outside table bbox
THEN the system SHALL clamp coordinates to table boundaries
AND preserve relative cell positions
AND ensure no cells overlap after normalization

Requirement: Decoration Image Filtering

The system SHALL filter out minimal decoration images that do not contribute meaningful content.

Scenario: Filter tiny images by area

WHEN extracting images from a document
THEN the system SHALL calculate image area (width x height)
AND filter out images with area < 200 square pixels
AND log filtered image count for debugging

Scenario: Configurable filtering threshold

WHEN processing documents with intentionally small images
THEN the system SHALL support configuration of minimum image area threshold
AND default to 200 square pixels if not specified
AND allow threshold = 0 to disable filtering

Requirement: Covering Image Removal

The system SHALL remove covering/redaction images from the final output.

Scenario: Detect covering rectangles

WHEN preprocessing a PDF page
THEN the system SHALL detect black/white rectangles covering text regions
AND identify covering images by high IoU (> 0.8) with underlying content
AND mark covering images for exclusion

Scenario: Exclude covering images from rendering

WHEN generating output PDF
THEN the system SHALL exclude images marked as covering
AND preserve the text content that was covered
AND include covering_images_removed count in metadata

Scenario: Handle both black and white covering

WHEN detecting covering rectangles
THEN the system SHALL detect both black fill (redaction style)
AND white fill (whiteout style)
AND low-contrast rectangles intended to hide content

MODIFIED Requirements

Requirement: Enhanced OCR with Full PP-StructureV3

The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.

Scenario: Extract comprehensive document structure

WHEN processing through OCR track
THEN the system SHALL use page_result.json['parsing_res_list']
AND extract all element types including headers, lists, tables, figures
AND preserve layout_bbox coordinates for each element

Scenario: Maintain reading order

WHEN extracting elements from PP-StructureV3
THEN the system SHALL preserve the reading order from parsing_res_list
AND assign sequential indices to elements
AND support reordering for complex layouts

Scenario: Extract table structure

WHEN PP-StructureV3 identifies a table
THEN the system SHALL extract cell content and boundaries
AND validate cell_boxes coordinates against page boundaries
AND apply fallback detection for invalid coordinates
AND preserve table HTML for structure
AND extract plain text for translation

Scenario: Extract visual elements with paths

WHEN PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
THEN the system SHALL preserve saved_path for each element
AND include image dimensions and format
AND enable image embedding in output PDF

ADDED Requirements

Requirement: Generate UnifiedDocument from direct extraction

The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.

Scenario: Extract tables with cell merging

WHEN direct extraction encounters a table
THEN the system SHALL use PyMuPDF find_tables() API
AND extract cell content with correct rowspan/colspan
AND preserve merged cell boundaries
AND skip placeholder cells covered by merges

Scenario: Filter decoration images

WHEN extracting images from PDF
THEN the system SHALL filter images smaller than minimum area threshold
AND exclude covering/redaction images
AND preserve meaningful content images

Scenario: Preserve text styling with image handling

WHEN direct extraction completes
THEN the system SHALL convert PyMuPDF results to UnifiedDocument
AND preserve text styling, fonts, and exact positioning
AND extract tables with cell boundaries, content, and merge info
AND include only meaningful images in output

7.2 KiB Raw Blame History

document-processing Specification Delta

ADDED Requirements

Requirement: Table Cell Merging Detection

Scenario: Detect merged cells in Direct Track

Scenario: Handle complex table structures

Requirement: Visual Element Path Preservation

Scenario: Preserve CHART element paths

Scenario: Support all visual element types

Scenario: Fallback path resolution

Requirement: Cell Box Coordinate Validation

Scenario: Detect out-of-bounds coordinates

Scenario: Apply CV line detection fallback

Scenario: Coordinate normalization

Requirement: Decoration Image Filtering

Scenario: Filter tiny images by area

Scenario: Configurable filtering threshold

Requirement: Covering Image Removal

Scenario: Detect covering rectangles

Scenario: Exclude covering images from rendering

Scenario: Handle both black and white covering

MODIFIED Requirements

Requirement: Enhanced OCR with Full PP-StructureV3

Scenario: Extract comprehensive document structure

Scenario: Maintain reading order

Scenario: Extract table structure

Scenario: Extract visual elements with paths

ADDED Requirements

Requirement: Generate UnifiedDocument from direct extraction

Scenario: Extract tables with cell merging

Scenario: Filter decoration images

Scenario: Preserve text styling with image handling

7.2 KiB

Raw Blame History