chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions
--- a/openspec/changes/dual-track-document-processing/specs/document-processing/spec.md
+++ b/openspec/changes/dual-track-document-processing/specs/document-processing/spec.md
@@ -0,0 +1,108 @@
+# Document Processing Spec Delta
+
+## ADDED Requirements
+
+### Requirement: Dual-track Processing
+The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs.
+
+#### Scenario: Process scanned PDF through OCR track
+- **WHEN** a scanned PDF is uploaded
+- **THEN** the system SHALL detect it requires OCR
+- **AND** route it through PaddleOCR PP-StructureV3 pipeline
+- **AND** return results in UnifiedDocument format
+
+#### Scenario: Process editable PDF through direct extraction
+- **WHEN** an editable PDF with extractable text is uploaded
+- **THEN** the system SHALL detect it can be directly extracted
+- **AND** route it through PyMuPDF extraction pipeline
+- **AND** return results in UnifiedDocument format without OCR
+
+#### Scenario: Auto-detect processing track
+- **WHEN** a document is uploaded without explicit track specification
+- **THEN** the system SHALL analyze the document type and content
+- **AND** automatically select the optimal processing track
+- **AND** include the selected track in processing metadata
+
+### Requirement: Document Type Detection
+The system SHALL provide intelligent document type detection to determine the optimal processing track.
+
+#### Scenario: Detect editable PDF
+- **WHEN** analyzing a PDF document
+- **THEN** the system SHALL check for extractable text content
+- **AND** return confidence score for editability
+- **AND** recommend "direct" track if text coverage > 90%
+
+#### Scenario: Detect scanned document
+- **WHEN** analyzing an image or scanned PDF
+- **THEN** the system SHALL identify lack of extractable text
+- **AND** recommend "ocr" track for processing
+- **AND** configure appropriate OCR models
+
+#### Scenario: Detect Office documents
+- **WHEN** analyzing .docx, .xlsx, .pptx files
+- **THEN** the system SHALL identify Office format
+- **AND** route to OCR track for initial implementation
+- **AND** preserve option for future direct Office extraction
+
+### Requirement: Unified Document Model
+The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks.
+
+#### Scenario: Generate UnifiedDocument from OCR
+- **WHEN** OCR processing completes
+- **THEN** the system SHALL convert PP-StructureV3 results to UnifiedDocument
+- **AND** preserve all element types, coordinates, and confidence scores
+- **AND** maintain reading order and hierarchical structure
+
+#### Scenario: Generate UnifiedDocument from direct extraction
+- **WHEN** direct extraction completes
+- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
+- **AND** preserve text styling, fonts, and exact positioning
+- **AND** extract tables with cell boundaries and content
+
+#### Scenario: Consistent output regardless of track
+- **WHEN** processing completes through either track
+- **THEN** the output SHALL conform to UnifiedDocument schema
+- **AND** include processing_track metadata field
+- **AND** support identical downstream operations (PDF generation, translation)
+
+### Requirement: Enhanced OCR with Full PP-StructureV3
+The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
+
+#### Scenario: Extract comprehensive document structure
+- **WHEN** processing through OCR track
+- **THEN** the system SHALL use page_result.json['parsing_res_list']
+- **AND** extract all element types including headers, lists, tables, figures
+- **AND** preserve layout_bbox coordinates for each element
+
+#### Scenario: Maintain reading order
+- **WHEN** extracting elements from PP-StructureV3
+- **THEN** the system SHALL preserve the reading order from parsing_res_list
+- **AND** assign sequential indices to elements
+- **AND** support reordering for complex layouts
+
+#### Scenario: Extract table structure
+- **WHEN** PP-StructureV3 identifies a table
+- **THEN** the system SHALL extract cell content and boundaries
+- **AND** preserve table HTML for structure
+- **AND** extract plain text for translation
+
+### Requirement: Structure-Preserving Translation Foundation
+The system SHALL maintain document structure and layout information to support future translation features.
+
+#### Scenario: Preserve coordinates for translation
+- **WHEN** processing any document
+- **THEN** the system SHALL retain bbox coordinates for all text elements
+- **AND** calculate space requirements for text expansion/contraction
+- **AND** maintain element relationships and groupings
+
+#### Scenario: Extract translatable content
+- **WHEN** processing tables and lists
+- **THEN** the system SHALL extract plain text content
+- **AND** maintain mapping to original structure
+- **AND** preserve formatting markers for reconstruction
+
+#### Scenario: Support layout adjustment
+- **WHEN** preparing for translation
+- **THEN** the system SHALL identify flexible vs fixed layout regions
+- **AND** calculate maximum text expansion ratios
+- **AND** preserve non-translatable elements (logos, signatures)