egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

4.9 KiB

Raw Blame History

Document Processing Spec Delta

ADDED Requirements

Requirement: Dual-track Processing

The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs.

Scenario: Process scanned PDF through OCR track

WHEN a scanned PDF is uploaded
THEN the system SHALL detect it requires OCR
AND route it through PaddleOCR PP-StructureV3 pipeline
AND return results in UnifiedDocument format

Scenario: Process editable PDF through direct extraction

WHEN an editable PDF with extractable text is uploaded
THEN the system SHALL detect it can be directly extracted
AND route it through PyMuPDF extraction pipeline
AND return results in UnifiedDocument format without OCR

Scenario: Auto-detect processing track

WHEN a document is uploaded without explicit track specification
THEN the system SHALL analyze the document type and content
AND automatically select the optimal processing track
AND include the selected track in processing metadata

Requirement: Document Type Detection

The system SHALL provide intelligent document type detection to determine the optimal processing track.

Scenario: Detect editable PDF

WHEN analyzing a PDF document
THEN the system SHALL check for extractable text content
AND return confidence score for editability
AND recommend "direct" track if text coverage > 90%

Scenario: Detect scanned document

WHEN analyzing an image or scanned PDF
THEN the system SHALL identify lack of extractable text
AND recommend "ocr" track for processing
AND configure appropriate OCR models

Scenario: Detect Office documents

WHEN analyzing .docx, .xlsx, .pptx files
THEN the system SHALL identify Office format
AND route to OCR track for initial implementation
AND preserve option for future direct Office extraction

Requirement: Unified Document Model

The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks.

Scenario: Generate UnifiedDocument from OCR

WHEN OCR processing completes
THEN the system SHALL convert PP-StructureV3 results to UnifiedDocument
AND preserve all element types, coordinates, and confidence scores
AND maintain reading order and hierarchical structure

Scenario: Generate UnifiedDocument from direct extraction

WHEN direct extraction completes
THEN the system SHALL convert PyMuPDF results to UnifiedDocument
AND preserve text styling, fonts, and exact positioning
AND extract tables with cell boundaries and content

Scenario: Consistent output regardless of track

WHEN processing completes through either track
THEN the output SHALL conform to UnifiedDocument schema
AND include processing_track metadata field
AND support identical downstream operations (PDF generation, translation)

Requirement: Enhanced OCR with Full PP-StructureV3

The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.

Scenario: Extract comprehensive document structure

WHEN processing through OCR track
THEN the system SHALL use page_result.json['parsing_res_list']
AND extract all element types including headers, lists, tables, figures
AND preserve layout_bbox coordinates for each element

Scenario: Maintain reading order

WHEN extracting elements from PP-StructureV3
THEN the system SHALL preserve the reading order from parsing_res_list
AND assign sequential indices to elements
AND support reordering for complex layouts

Scenario: Extract table structure

WHEN PP-StructureV3 identifies a table
THEN the system SHALL extract cell content and boundaries
AND preserve table HTML for structure
AND extract plain text for translation

Requirement: Structure-Preserving Translation Foundation

The system SHALL maintain document structure and layout information to support future translation features.

Scenario: Preserve coordinates for translation

WHEN processing any document
THEN the system SHALL retain bbox coordinates for all text elements
AND calculate space requirements for text expansion/contraction
AND maintain element relationships and groupings

Scenario: Extract translatable content

WHEN processing tables and lists
THEN the system SHALL extract plain text content
AND maintain mapping to original structure
AND preserve formatting markers for reconstruction

Scenario: Support layout adjustment

WHEN preparing for translation
THEN the system SHALL identify flexible vs fixed layout regions
AND calculate maximum text expansion ratios
AND preserve non-translatable elements (logos, signatures)

4.9 KiB Raw Blame History