egg/OCR

Files

egg 940a406dce chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 11:55:39 +08:00

6.5 KiB

Raw Blame History

document-processing Specification

Purpose

TBD - created by archiving change dual-track-document-processing. Update Purpose after archive.

Requirements

Requirement: Dual-track Processing

The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs.

Scenario: Process scanned PDF through OCR track

WHEN a scanned PDF is uploaded
THEN the system SHALL detect it requires OCR
AND route it through PaddleOCR PP-StructureV3 pipeline
AND return results in UnifiedDocument format

Scenario: Process editable PDF through direct extraction

WHEN an editable PDF with extractable text is uploaded
THEN the system SHALL detect it can be directly extracted
AND route it through PyMuPDF extraction pipeline
AND return results in UnifiedDocument format without OCR

Scenario: Auto-detect processing track

WHEN a document is uploaded without explicit track specification
THEN the system SHALL analyze the document type and content
AND automatically select the optimal processing track
AND include the selected track in processing metadata

Requirement: Document Type Detection

The system SHALL provide intelligent document type detection to determine the optimal processing track.

Scenario: Detect editable PDF

WHEN analyzing a PDF document
THEN the system SHALL check for extractable text content
AND return confidence score for editability
AND recommend "direct" track if text coverage > 90%

Scenario: Detect scanned document

WHEN analyzing an image or scanned PDF
THEN the system SHALL identify lack of extractable text
AND recommend "ocr" track for processing
AND configure appropriate OCR models

Scenario: Detect Office documents

WHEN analyzing .docx, .xlsx, .pptx files
THEN the system SHALL identify Office format
AND route to OCR track for initial implementation
AND preserve option for future direct Office extraction

Requirement: Unified Document Model

The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks.

Scenario: Generate UnifiedDocument from OCR

WHEN OCR processing completes
THEN the system SHALL convert PP-StructureV3 results to UnifiedDocument
AND preserve all element types, coordinates, and confidence scores
AND maintain reading order and hierarchical structure

Scenario: Generate UnifiedDocument from direct extraction

WHEN direct extraction completes
THEN the system SHALL convert PyMuPDF results to UnifiedDocument
AND preserve text styling, fonts, and exact positioning
AND extract tables with cell boundaries and content

Scenario: Consistent output regardless of track

WHEN processing completes through either track
THEN the output SHALL conform to UnifiedDocument schema
AND include processing_track metadata field
AND support identical downstream operations (PDF generation, translation)

Requirement: Enhanced OCR with Full PP-StructureV3

The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.

Scenario: Extract comprehensive document structure

WHEN processing through OCR track
THEN the system SHALL use page_result.json['parsing_res_list']
AND extract all element types including headers, lists, tables, figures
AND preserve layout_bbox coordinates for each element

Scenario: Maintain reading order

WHEN extracting elements from PP-StructureV3
THEN the system SHALL preserve the reading order from parsing_res_list
AND assign sequential indices to elements
AND support reordering for complex layouts

Scenario: Extract table structure

WHEN PP-StructureV3 identifies a table
THEN the system SHALL extract cell content and boundaries
AND validate cell_boxes coordinates against page boundaries
AND apply fallback detection for invalid coordinates
AND preserve table HTML for structure
AND extract plain text for translation

Scenario: Extract visual elements with paths

WHEN PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
THEN the system SHALL preserve saved_path for each element
AND include image dimensions and format
AND enable image embedding in output PDF

Requirement: Structure-Preserving Translation Foundation

The system SHALL maintain document structure and layout information to support future translation features.

Scenario: Preserve coordinates for translation

WHEN processing any document
THEN the system SHALL retain bbox coordinates for all text elements
AND calculate space requirements for text expansion/contraction
AND maintain element relationships and groupings

Scenario: Extract translatable content

WHEN processing tables and lists
THEN the system SHALL extract plain text content
AND maintain mapping to original structure
AND preserve formatting markers for reconstruction

Scenario: Support layout adjustment

WHEN preparing for translation
THEN the system SHALL identify flexible vs fixed layout regions
AND calculate maximum text expansion ratios
AND preserve non-translatable elements (logos, signatures)

Requirement: Generate UnifiedDocument from direct extraction

The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.

Scenario: Extract tables with cell merging

WHEN direct extraction encounters a table
THEN the system SHALL use PyMuPDF find_tables() API
AND extract cell content with correct rowspan/colspan
AND preserve merged cell boundaries
AND skip placeholder cells covered by merges

Scenario: Filter decoration images

WHEN extracting images from PDF
THEN the system SHALL filter images smaller than minimum area threshold
AND exclude covering/redaction images
AND preserve meaningful content images

Scenario: Preserve text styling with image handling

WHEN direct extraction completes
THEN the system SHALL convert PyMuPDF results to UnifiedDocument
AND preserve text styling, fonts, and exact positioning
AND extract tables with cell boundaries, content, and merge info
AND include only meaningful images in output

6.5 KiB Raw Blame History