egg/OCR

Files

egg 24253ac15e feat: unify Direct Track PDF rendering and simplify export options

Backend changes:
- Apply background image + invisible text layer to all Direct Track PDFs
- Add CHART to regions_to_avoid for text extraction
- Improve visual fidelity for native PDFs and Office documents

Frontend changes:
- Remove JSON, UnifiedDocument, Markdown download buttons
- Simplify to 2-column layout with only Layout PDF and Reflow PDF
- Remove translation JSON download and Layout PDF option
- Keep only Reflow PDF for translated document downloads
- Clean up unused imports (FileJson, Database, FileOutput)

Archives two OpenSpec proposals:
- unify-direct-track-pdf-rendering
- simplify-frontend-export-options

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-12 07:50:43 +08:00

8.8 KiB

Raw Blame History

document-processing Specification

Purpose

TBD - created by archiving change dual-track-document-processing. Update Purpose after archive.

Requirements

Requirement: Dual-track Processing

The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs.

Scenario: Process scanned PDF through OCR track

WHEN a scanned PDF is uploaded
THEN the system SHALL detect it requires OCR
AND route it through PaddleOCR PP-StructureV3 pipeline
AND return results in UnifiedDocument format

Scenario: Process editable PDF through direct extraction

WHEN an editable PDF with extractable text is uploaded
THEN the system SHALL detect it can be directly extracted
AND route it through PyMuPDF extraction pipeline
AND return results in UnifiedDocument format without OCR

Scenario: Auto-detect processing track

WHEN a document is uploaded without explicit track specification
THEN the system SHALL analyze the document type and content
AND automatically select the optimal processing track
AND include the selected track in processing metadata

Requirement: Document Type Detection

The system SHALL provide intelligent document type detection to determine the optimal processing track.

Scenario: Detect editable PDF

WHEN analyzing a PDF document
THEN the system SHALL check for extractable text content
AND return confidence score for editability
AND recommend "direct" track if text coverage > 90%

Scenario: Detect scanned document

WHEN analyzing an image or scanned PDF
THEN the system SHALL identify lack of extractable text
AND recommend "ocr" track for processing
AND configure appropriate OCR models

Scenario: Detect Office documents

WHEN analyzing .docx, .xlsx, .pptx files
THEN the system SHALL identify Office format
AND route to OCR track for initial implementation
AND preserve option for future direct Office extraction

Requirement: Unified Document Model

The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks.

Scenario: Generate UnifiedDocument from OCR

WHEN OCR processing completes
THEN the system SHALL convert PP-StructureV3 results to UnifiedDocument
AND preserve all element types, coordinates, and confidence scores
AND maintain reading order and hierarchical structure

Scenario: Generate UnifiedDocument from direct extraction

WHEN direct extraction completes
THEN the system SHALL convert PyMuPDF results to UnifiedDocument
AND preserve text styling, fonts, and exact positioning
AND extract tables with cell boundaries and content

Scenario: Consistent output regardless of track

WHEN processing completes through either track
THEN the output SHALL conform to UnifiedDocument schema
AND include processing_track metadata field
AND support identical downstream operations (PDF generation, translation)

Requirement: Enhanced OCR with Full PP-StructureV3

The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.

Scenario: Extract comprehensive document structure

WHEN processing through OCR track
THEN the system SHALL use page_result.json['parsing_res_list']
AND extract all element types including headers, lists, tables, figures
AND preserve layout_bbox coordinates for each element

Scenario: Maintain reading order

WHEN extracting elements from PP-StructureV3
THEN the system SHALL preserve the reading order from parsing_res_list
AND assign sequential indices to elements
AND support reordering for complex layouts

Scenario: Extract table structure

WHEN PP-StructureV3 identifies a table
THEN the system SHALL extract cell content and boundaries
AND validate cell_boxes coordinates against page boundaries
AND apply fallback detection for invalid coordinates
AND preserve table HTML for structure
AND extract plain text for translation

Scenario: Extract visual elements with paths

WHEN PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
THEN the system SHALL preserve saved_path for each element
AND include image dimensions and format
AND enable image embedding in output PDF

Requirement: Structure-Preserving Translation Foundation

The system SHALL maintain document structure and layout information to support future translation features.

Scenario: Preserve coordinates for translation

WHEN processing any document
THEN the system SHALL retain bbox coordinates for all text elements
AND calculate space requirements for text expansion/contraction
AND maintain element relationships and groupings

Scenario: Extract translatable content

WHEN processing tables and lists
THEN the system SHALL extract plain text content
AND maintain mapping to original structure
AND preserve formatting markers for reconstruction

Scenario: Support layout adjustment

WHEN preparing for translation
THEN the system SHALL identify flexible vs fixed layout regions
AND calculate maximum text expansion ratios
AND preserve non-translatable elements (logos, signatures)

Requirement: Generate UnifiedDocument from direct extraction

The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.

Scenario: Extract tables with cell merging

WHEN direct extraction encounters a table
THEN the system SHALL use PyMuPDF find_tables() API
AND extract cell content with correct rowspan/colspan
AND preserve merged cell boundaries
AND skip placeholder cells covered by merges

Scenario: Filter decoration images

WHEN extracting images from PDF
THEN the system SHALL filter images smaller than minimum area threshold
AND exclude covering/redaction images
AND preserve meaningful content images

Scenario: Preserve text styling with image handling

WHEN direct extraction completes
THEN the system SHALL convert PyMuPDF results to UnifiedDocument
AND preserve text styling, fonts, and exact positioning
AND extract tables with cell boundaries, content, and merge info
AND include only meaningful images in output

Requirement: Direct Track Background Image Rendering

The system SHALL render Direct Track PDF output using a full-page background image with an invisible text overlay to preserve visual fidelity while maintaining text extractability.

Scenario: Render Direct Track PDF with background image

WHEN generating Layout PDF for a Direct Track document
THEN the system SHALL render each source PDF page as a full-page background image at 2x resolution
AND overlay invisible text elements using PDF Text Rendering Mode 3
AND the invisible text SHALL be positioned at original coordinates for accurate selection

Scenario: Handle Office documents (PPT, DOC, XLS)

WHEN processing an Office document converted to PDF
THEN the system SHALL use the same background image + invisible text approach
AND preserve all visual elements including vector graphics, gradients, and complex layouts
AND the converted PDF in result directory SHALL be used as background source

Scenario: Handle native editable PDFs

WHEN processing a native PDF through Direct Track
THEN the system SHALL use the source PDF for background rendering
AND apply the same invisible text overlay approach
AND chart regions SHALL be excluded from the text layer

Requirement: Chart Region Text Exclusion

The system SHALL exclude text elements within chart regions from the invisible text layer to prevent duplicate content and unnecessary translation.

Scenario: Detect chart regions

WHEN classifying page elements for Direct Track
THEN the system SHALL identify elements with type CHART
AND add chart bounding boxes to regions_to_avoid list

Scenario: Exclude chart-internal text from invisible layer

WHEN rendering invisible text layer
THEN the system SHALL skip text elements whose bounding boxes overlap with chart regions
AND chart axis labels, legends, and data labels SHALL NOT be in the invisible text layer
AND these texts remain visible in the background image

Scenario: Chart text not available for translation

WHEN extracting text for translation from a Direct Track document
THEN chart-internal text SHALL NOT be included in translatable elements
AND this is expected behavior as chart labels typically don't require translation

8.8 KiB Raw Blame History