chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions
--- a/openspec/changes/dual-track-document-processing/design.md
+++ b/openspec/changes/dual-track-document-processing/design.md
@@ -0,0 +1,276 @@
+# Technical Design: Dual-track Document Processing
+
+## Context
+
+### Background
+The current OCR tool processes all documents through PaddleOCR, even when dealing with editable PDFs that contain extractable text. This causes:
+- Unnecessary processing overhead
+- Potential quality degradation from re-OCRing already digital text
+- Loss of precise formatting information
+- Inefficient GPU usage on documents that don't need OCR
+
+### Constraints
+- RTX 4060 8GB GPU memory limitation
+- Need to maintain backward compatibility with existing API
+- Must support future translation features
+- Should handle mixed documents (partially scanned, partially digital)
+
+### Stakeholders
+- API consumers expecting consistent JSON/PDF output
+- Translation system requiring structure preservation
+- Performance-sensitive deployments
+
+## Goals / Non-Goals
+
+### Goals
+- Intelligently route documents to appropriate processing track
+- Preserve document structure for translation
+- Optimize GPU usage by avoiding unnecessary OCR
+- Maintain unified output format across tracks
+- Reduce processing time for editable PDFs by 70%+
+
+### Non-Goals
+- Implementing the actual translation engine (future phase)
+- Supporting video or audio transcription
+- Real-time collaborative editing
+- OCR model training or fine-tuning
+
+## Decisions
+
+### Decision 1: Dual-track Architecture
+**What**: Implement two separate processing pipelines - OCR track and Direct extraction track
+
+**Why**:
+- Editable PDFs don't need OCR, can be processed 10-100x faster
+- Direct extraction preserves exact formatting and fonts
+- OCR track remains optimal for scanned documents
+
+**Alternatives considered**:
+1. **Single enhanced OCR pipeline**: Would still waste resources on editable PDFs
+2. **Hybrid approach per page**: Too complex, most documents are uniformly editable or scanned
+3. **Multiple specialized pipelines**: Over-engineering for current requirements
+
+### Decision 2: UnifiedDocument Model
+**What**: Create a standardized intermediate representation for both tracks
+
+**Why**:
+- Provides consistent API interface regardless of processing track
+- Simplifies downstream processing (PDF generation, translation)
+- Enables track switching without breaking changes
+
+**Structure**:
+```python
+@dataclass
+class UnifiedDocument:
+    document_id: str
+    metadata: DocumentMetadata
+    pages: List[Page]
+    processing_track: Literal["ocr", "direct"]
+
+@dataclass
+class Page:
+    page_number: int
+    elements: List[DocumentElement]
+    dimensions: Dimensions
+
+@dataclass
+class DocumentElement:
+    element_id: str
+    type: ElementType  # text, table, image, header, etc.
+    content: Union[str, Dict, bytes]
+    bbox: BoundingBox
+    style: Optional[StyleInfo]
+    confidence: Optional[float]  # Only for OCR track
+```
+
+### Decision 3: PyMuPDF for Direct Extraction
+**What**: Use PyMuPDF (fitz) library for editable PDF processing
+
+**Why**:
+- Mature, well-maintained library
+- Excellent coordinate preservation
+- Fast C++ backend
+- Supports text, tables, and image extraction with positions
+
+**Alternatives considered**:
+1. **pdfplumber**: Good but slower, less precise coordinates
+2. **PyPDF2**: Limited layout information
+3. **PDFMiner**: Complex API, slower performance
+
+### Decision 4: Processing Track Auto-detection
+**What**: Automatically determine optimal track based on document analysis
+
+**Detection logic**:
+```python
+def detect_track(file_path: Path) -> str:
+    file_type = magic.from_file(file_path, mime=True)
+
+    if file_type.startswith('image/'):
+        return "ocr"
+
+    if file_type == 'application/pdf':
+        # Check if PDF has extractable text
+        doc = fitz.open(file_path)
+        for page in doc[:3]:  # Sample first 3 pages
+            text = page.get_text()
+            if len(text.strip()) < 100:  # Minimal text
+                return "ocr"
+        return "direct"
+
+    if file_type in OFFICE_MIMES:
+        return "ocr"  # For now, may add direct Office support later
+
+    return "ocr"  # Default fallback
+```
+
+### Decision 5: GPU Memory Management
+**What**: Implement dynamic batch sizing and model caching for RTX 4060 8GB
+
+**Why**:
+- Prevents OOM errors
+- Maximizes throughput
+- Enables concurrent request handling
+
+**Strategy**:
+```python
+# Adaptive batch sizing based on available memory
+batch_size = calculate_batch_size(
+    available_memory=get_gpu_memory(),
+    image_size=image.shape,
+    model_size=MODEL_MEMORY_REQUIREMENTS
+)
+
+# Model caching to avoid reload overhead
+@lru_cache(maxsize=2)
+def get_model(model_type: str):
+    return load_model(model_type)
+```
+
+### Decision 6: Backward Compatibility
+**What**: Maintain existing API while adding new capabilities
+
+**How**:
+- Existing endpoints continue working unchanged
+- New `processing_track` parameter is optional
+- Output format compatible with current consumers
+- Gradual migration path for clients
+
+## Risks / Trade-offs
+
+### Risk 1: Mixed Content Documents
+**Risk**: Documents with both scanned and digital pages
+**Mitigation**:
+- Page-level track detection as fallback
+- Confidence scoring to identify uncertain pages
+- Manual override option via API
+
+### Risk 2: Direct Extraction Quality
+**Risk**: Some PDFs have poor internal structure
+**Mitigation**:
+- Fallback to OCR track if extraction quality is low
+- Quality metrics: text density, structure coherence
+- User-reportable quality issues
+
+### Risk 3: Memory Pressure
+**Risk**: RTX 4060 8GB limitation with concurrent requests
+**Mitigation**:
+- Request queuing system
+- Dynamic batch adjustment
+- CPU fallback for overflow
+
+### Trade-off 1: Processing Time vs Accuracy
+- Direct extraction: Fast but depends on PDF quality
+- OCR: Slower but consistent quality
+- **Decision**: Prioritize speed for editable PDFs, accuracy for scanned
+
+### Trade-off 2: Complexity vs Flexibility
+- Two tracks increase system complexity
+- But enable optimal processing per document type
+- **Decision**: Accept complexity for 10x+ performance gains
+
+## Migration Plan
+
+### Phase 1: Infrastructure (Week 1-2)
+1. Deploy UnifiedDocument model
+2. Implement DocumentTypeDetector
+3. Add DirectExtractionEngine
+4. Update logging and monitoring
+
+### Phase 2: Integration (Week 3)
+1. Update OCR service with routing logic
+2. Modify PDF generator for unified model
+3. Add new API endpoints
+4. Deploy to staging
+
+### Phase 3: Validation (Week 4)
+1. A/B testing with subset of traffic
+2. Performance benchmarking
+3. Quality validation
+4. Client integration testing
+
+### Rollback Plan
+1. Feature flag to disable dual-track
+2. Fallback all requests to OCR track
+3. Maintain old code paths during transition
+4. Database migration reversible
+
+## Open Questions
+
+### Resolved
+- Q: Should we support page-level track mixing?
+  - A: No, adds complexity with minimal benefit. Document-level is sufficient.
+
+- Q: How to handle Office documents?
+  - A: OCR track initially, consider python-docx/openpyxl later if needed.
+
+### Pending
+- Q: What translation services to integrate with?
+  - Needs stakeholder input on cost/quality trade-offs
+
+- Q: Should we cache extracted text for repeated processing?
+  - Depends on storage costs vs reprocessing frequency
+
+- Q: How to handle password-protected PDFs?
+  - May need API parameter for passwords
+
+## Performance Targets
+
+### Direct Extraction Track
+- Latency: <500ms per page
+- Throughput: 100+ pages/minute
+- Memory: <500MB per document
+
+### OCR Track (Optimized)
+- Latency: 2-5s per page (GPU)
+- Throughput: 20-30 pages/minute
+- Memory: <2GB per batch
+
+### API Response Times
+- Document type detection: <100ms
+- Processing initiation: <200ms
+- Result retrieval: <100ms
+
+## Technical Dependencies
+
+### Python Packages
+```python
+# Direct extraction
+PyMuPDF==1.23.x
+pdfplumber==0.10.x  # Fallback/validation
+python-magic-bin==0.4.x
+
+# OCR enhancement
+paddlepaddle-gpu==2.5.2
+paddleocr==2.7.3
+
+# Infrastructure
+pydantic==2.x
+fastapi==0.100+
+redis==5.x  # For caching
+```
+
+### System Requirements
+- CUDA 11.8+ for PaddlePaddle
+- libmagic for file detection
+- 16GB RAM minimum
+- 50GB disk for models and cache
--- a/openspec/changes/dual-track-document-processing/proposal.md
+++ b/openspec/changes/dual-track-document-processing/proposal.md
@@ -0,0 +1,35 @@
+# Change: Dual-track Document Processing with Structure-Preserving Translation
+
+## Why
+
+The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal.
+
+## What Changes
+
+- **ADDED** Dual-track processing architecture with intelligent routing
+  - OCR track for scanned documents, images, and Office files using PaddleOCR
+  - Direct extraction track for editable PDFs using PyMuPDF
+- **ADDED** UnifiedDocument model as common output format for both tracks
+- **ADDED** DocumentTypeDetector service for automatic track selection
+- **MODIFIED** OCR service to use PP-StructureV3's parsing_res_list instead of markdown
+  - Now extracts all 23 element types with bbox coordinates
+  - Preserves reading order and hierarchical structure
+- **MODIFIED** PDF generator to handle UnifiedDocument format
+  - Enhanced overlap detection to prevent text/image/table collisions
+  - Improved coordinate transformation for accurate layout
+- **ADDED** Foundation for structure-preserving translation system
+- **BREAKING** JSON output structure will include new fields (backward compatible with defaults)
+
+## Impact
+
+- **Affected specs**:
+  - `document-processing` (new capability)
+  - `result-export` (enhanced with track metadata and structure data)
+  - `task-management` (tracks processing route and history)
+- **Affected code**:
+  - `backend/app/services/ocr_service.py` - Major refactoring for dual-track
+  - `backend/app/services/pdf_generator_service.py` - UnifiedDocument support
+  - `backend/app/api/v2/tasks.py` - New endpoints for track detection
+  - `frontend/src/pages/TaskDetailPage.tsx` - Display processing track info
+- **Performance**: 5-10x faster for editable PDFs, same speed for scanned documents
+- **Dependencies**: Adds PyMuPDF, pdfplumber, python-magic-bin
--- a/openspec/changes/dual-track-document-processing/specs/document-processing/spec.md
+++ b/openspec/changes/dual-track-document-processing/specs/document-processing/spec.md
@@ -0,0 +1,108 @@
+# Document Processing Spec Delta
+
+## ADDED Requirements
+
+### Requirement: Dual-track Processing
+The system SHALL support two distinct processing tracks for documents: OCR track for scanned/image documents and Direct extraction track for editable PDFs.
+
+#### Scenario: Process scanned PDF through OCR track
+- **WHEN** a scanned PDF is uploaded
+- **THEN** the system SHALL detect it requires OCR
+- **AND** route it through PaddleOCR PP-StructureV3 pipeline
+- **AND** return results in UnifiedDocument format
+
+#### Scenario: Process editable PDF through direct extraction
+- **WHEN** an editable PDF with extractable text is uploaded
+- **THEN** the system SHALL detect it can be directly extracted
+- **AND** route it through PyMuPDF extraction pipeline
+- **AND** return results in UnifiedDocument format without OCR
+
+#### Scenario: Auto-detect processing track
+- **WHEN** a document is uploaded without explicit track specification
+- **THEN** the system SHALL analyze the document type and content
+- **AND** automatically select the optimal processing track
+- **AND** include the selected track in processing metadata
+
+### Requirement: Document Type Detection
+The system SHALL provide intelligent document type detection to determine the optimal processing track.
+
+#### Scenario: Detect editable PDF
+- **WHEN** analyzing a PDF document
+- **THEN** the system SHALL check for extractable text content
+- **AND** return confidence score for editability
+- **AND** recommend "direct" track if text coverage > 90%
+
+#### Scenario: Detect scanned document
+- **WHEN** analyzing an image or scanned PDF
+- **THEN** the system SHALL identify lack of extractable text
+- **AND** recommend "ocr" track for processing
+- **AND** configure appropriate OCR models
+
+#### Scenario: Detect Office documents
+- **WHEN** analyzing .docx, .xlsx, .pptx files
+- **THEN** the system SHALL identify Office format
+- **AND** route to OCR track for initial implementation
+- **AND** preserve option for future direct Office extraction
+
+### Requirement: Unified Document Model
+The system SHALL use a standardized UnifiedDocument model as the common output format for both processing tracks.
+
+#### Scenario: Generate UnifiedDocument from OCR
+- **WHEN** OCR processing completes
+- **THEN** the system SHALL convert PP-StructureV3 results to UnifiedDocument
+- **AND** preserve all element types, coordinates, and confidence scores
+- **AND** maintain reading order and hierarchical structure
+
+#### Scenario: Generate UnifiedDocument from direct extraction
+- **WHEN** direct extraction completes
+- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
+- **AND** preserve text styling, fonts, and exact positioning
+- **AND** extract tables with cell boundaries and content
+
+#### Scenario: Consistent output regardless of track
+- **WHEN** processing completes through either track
+- **THEN** the output SHALL conform to UnifiedDocument schema
+- **AND** include processing_track metadata field
+- **AND** support identical downstream operations (PDF generation, translation)
+
+### Requirement: Enhanced OCR with Full PP-StructureV3
+The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
+
+#### Scenario: Extract comprehensive document structure
+- **WHEN** processing through OCR track
+- **THEN** the system SHALL use page_result.json['parsing_res_list']
+- **AND** extract all element types including headers, lists, tables, figures
+- **AND** preserve layout_bbox coordinates for each element
+
+#### Scenario: Maintain reading order
+- **WHEN** extracting elements from PP-StructureV3
+- **THEN** the system SHALL preserve the reading order from parsing_res_list
+- **AND** assign sequential indices to elements
+- **AND** support reordering for complex layouts
+
+#### Scenario: Extract table structure
+- **WHEN** PP-StructureV3 identifies a table
+- **THEN** the system SHALL extract cell content and boundaries
+- **AND** preserve table HTML for structure
+- **AND** extract plain text for translation
+
+### Requirement: Structure-Preserving Translation Foundation
+The system SHALL maintain document structure and layout information to support future translation features.
+
+#### Scenario: Preserve coordinates for translation
+- **WHEN** processing any document
+- **THEN** the system SHALL retain bbox coordinates for all text elements
+- **AND** calculate space requirements for text expansion/contraction
+- **AND** maintain element relationships and groupings
+
+#### Scenario: Extract translatable content
+- **WHEN** processing tables and lists
+- **THEN** the system SHALL extract plain text content
+- **AND** maintain mapping to original structure
+- **AND** preserve formatting markers for reconstruction
+
+#### Scenario: Support layout adjustment
+- **WHEN** preparing for translation
+- **THEN** the system SHALL identify flexible vs fixed layout regions
+- **AND** calculate maximum text expansion ratios
+- **AND** preserve non-translatable elements (logos, signatures)
--- a/openspec/changes/dual-track-document-processing/specs/result-export/spec.md
+++ b/openspec/changes/dual-track-document-processing/specs/result-export/spec.md
@@ -0,0 +1,74 @@
+# Result Export Spec Delta
+
+## MODIFIED Requirements
+
+### Requirement: Export Interface
+The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs, with processing track information and enhanced structure data.
+
+#### Scenario: Export page uses V2 download endpoints
+- **WHEN** user selects a format and clicks export button
+- **THEN** frontend SHALL call V2 endpoint `/api/v2/tasks/{task_id}/download/{format}`
+- **AND** frontend SHALL NOT call V1 `/api/v2/export` endpoint (which returns 404)
+- **AND** file SHALL download successfully
+
+#### Scenario: Export supports multiple formats
+- **WHEN** user exports a completed task
+- **THEN** system SHALL support downloading as TXT, JSON, Excel, Markdown, and PDF
+- **AND** each format SHALL use correct V2 download endpoint
+- **AND** downloaded files SHALL contain task OCR results
+
+#### Scenario: Export includes processing track metadata
+- **WHEN** user exports a task processed through dual-track system
+- **THEN** exported JSON SHALL include "processing_track" field indicating "ocr" or "direct"
+- **AND** SHALL include "processing_metadata" with track-specific information
+- **AND** SHALL maintain backward compatibility for clients not expecting these fields
+
+#### Scenario: Export UnifiedDocument format
+- **WHEN** user requests JSON export with unified=true parameter
+- **THEN** system SHALL return UnifiedDocument structure
+- **AND** include complete element hierarchy with coordinates
+- **AND** preserve all PP-StructureV3 element types for OCR track
+
+## ADDED Requirements
+
+### Requirement: Enhanced PDF Export with Layout Preservation
+The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks.
+
+#### Scenario: Export PDF from direct extraction track
+- **WHEN** exporting PDF from a direct-extraction processed document
+- **THEN** the PDF SHALL maintain exact text positioning from source
+- **AND** preserve original fonts and styles where possible
+- **AND** include extracted images at correct positions
+
+#### Scenario: Export PDF from OCR track with full structure
+- **WHEN** exporting PDF from OCR-processed document
+- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types
+- **AND** render tables with proper cell boundaries
+- **AND** maintain reading order from parsing_res_list
+
+#### Scenario: Handle coordinate transformations
+- **WHEN** generating PDF from UnifiedDocument
+- **THEN** system SHALL correctly transform bbox coordinates to PDF space
+- **AND** handle page size variations
+- **AND** prevent text overlap using enhanced overlap detection
+
+### Requirement: Structure Data Export
+The system SHALL provide export formats that preserve document structure for downstream processing.
+
+#### Scenario: Export structured JSON with hierarchy
+- **WHEN** user selects structured JSON format
+- **THEN** export SHALL include element hierarchy and relationships
+- **AND** preserve parent-child relationships (sections, lists)
+- **AND** include style and formatting information
+
+#### Scenario: Export for translation preparation
+- **WHEN** user exports with translation_ready=true parameter
+- **THEN** export SHALL include translatable text segments
+- **AND** maintain coordinate mappings for each segment
+- **AND** mark non-translatable regions
+
+#### Scenario: Export with layout analysis
+- **WHEN** user requests layout analysis export
+- **THEN** system SHALL include reading order indices
+- **AND** identify layout regions (header, body, footer, sidebar)
+- **AND** provide confidence scores for layout detection
--- a/openspec/changes/dual-track-document-processing/specs/task-management/spec.md
+++ b/openspec/changes/dual-track-document-processing/specs/task-management/spec.md
@@ -0,0 +1,105 @@
+# Task Management Spec Delta
+
+## MODIFIED Requirements
+
+### Requirement: Task Result Generation
+The OCR service SHALL generate both JSON and Markdown result files for completed tasks with actual content, including processing track information and enhanced structure data.
+
+#### Scenario: Markdown file contains OCR results
+- **WHEN** a task completes OCR processing successfully
+- **THEN** the generated `.md` file SHALL contain the extracted text in markdown format
+- **AND** the file size SHALL be greater than 0 bytes
+- **AND** the markdown SHALL include headings, paragraphs, and formatting based on OCR layout detection
+
+#### Scenario: Result files stored in task directory
+- **WHEN** OCR processing completes for task ID `88c6c2d2-37e1-48fd-a50f-406142987bdf`
+- **THEN** result files SHALL be stored in `storage/results/88c6c2d2-37e1-48fd-a50f-406142987bdf/`
+- **AND** both `<filename>_result.json` and `<filename>_result.md` SHALL exist
+- **AND** both files SHALL contain valid OCR output data
+
+#### Scenario: Include processing track in results
+- **WHEN** a task completes through dual-track processing
+- **THEN** the JSON result SHALL include "processing_track" field
+- **AND** SHALL indicate whether "ocr" or "direct" track was used
+- **AND** SHALL include track-specific metadata (confidence for OCR, extraction quality for direct)
+
+#### Scenario: Store UnifiedDocument format
+- **WHEN** processing completes through either track
+- **THEN** system SHALL save results in UnifiedDocument format
+- **AND** maintain backward-compatible JSON structure
+- **AND** include enhanced structure from PP-StructureV3 or PyMuPDF
+
+### Requirement: Task Detail View
+The frontend SHALL provide a dedicated page for viewing individual task details with processing track information and enhanced preview capabilities.
+
+#### Scenario: Navigate to task detail page
+- **WHEN** user clicks "View Details" button on task in Task History page
+- **THEN** browser SHALL navigate to `/tasks/{task_id}`
+- **AND** TaskDetailPage component SHALL render
+
+#### Scenario: Display task information
+- **WHEN** TaskDetailPage loads for a valid task ID
+- **THEN** page SHALL display task metadata (filename, status, processing time, confidence)
+- **AND** page SHALL show markdown preview of OCR results
+- **AND** page SHALL provide download buttons for JSON, Markdown, and PDF formats
+
+#### Scenario: Download from task detail page
+- **WHEN** user clicks download button for a specific format
+- **THEN** browser SHALL download the file using `/api/v2/tasks/{task_id}/download/{format}` endpoint
+- **AND** downloaded file SHALL contain the task's OCR results in requested format
+
+#### Scenario: Display processing track information
+- **WHEN** viewing task processed through dual-track system
+- **THEN** page SHALL display processing track used (OCR or Direct)
+- **AND** show track-specific metrics (OCR confidence or extraction quality)
+- **AND** provide option to reprocess with alternate track if applicable
+
+#### Scenario: Preview document structure
+- **WHEN** user enables structure view
+- **THEN** page SHALL display document element hierarchy
+- **AND** show bounding boxes overlay on preview
+- **AND** highlight different element types (headers, tables, lists) with distinct colors
+
+## ADDED Requirements
+
+### Requirement: Processing Track Management
+The task management system SHALL track and display processing track information for all tasks.
+
+#### Scenario: Track processing route selection
+- **WHEN** a task begins processing
+- **THEN** system SHALL record the selected processing track
+- **AND** log the reason for track selection
+- **AND** store auto-detection confidence score
+
+#### Scenario: Allow track override
+- **WHEN** user views a completed task
+- **THEN** system SHALL offer option to reprocess with different track
+- **AND** maintain both results for comparison
+- **AND** track which result user prefers
+
+#### Scenario: Display processing metrics
+- **WHEN** task completes processing
+- **THEN** system SHALL record track-specific metrics
+- **AND** OCR track SHALL show confidence scores and character count
+- **AND** Direct track SHALL show extraction coverage and structure quality
+
+### Requirement: Task Processing History
+The system SHALL maintain detailed processing history for tasks including track changes and reprocessing.
+
+#### Scenario: Record reprocessing attempts
+- **WHEN** a task is reprocessed with different track
+- **THEN** system SHALL maintain processing history
+- **AND** store results from each attempt
+- **AND** allow comparison between different processing attempts
+
+#### Scenario: Track quality improvements
+- **WHEN** viewing task history
+- **THEN** system SHALL show quality metrics over time
+- **AND** indicate if reprocessing improved results
+- **AND** suggest optimal track based on document characteristics
+
+#### Scenario: Export processing analytics
+- **WHEN** exporting task data
+- **THEN** system SHALL include processing history
+- **AND** provide track selection statistics
+- **AND** include performance metrics for each processing attempt
--- a/openspec/changes/dual-track-document-processing/tasks.md
+++ b/openspec/changes/dual-track-document-processing/tasks.md
@@ -0,0 +1,170 @@
+# Implementation Tasks: Dual-track Document Processing
+
+## 1. Core Infrastructure
+- [ ] 1.1 Add PyMuPDF and other dependencies to requirements.txt
+  - [ ] 1.1.1 Add PyMuPDF==1.23.x
+  - [ ] 1.1.2 Add pdfplumber==0.10.x
+  - [ ] 1.1.3 Add python-magic-bin==0.4.x
+  - [ ] 1.1.4 Test dependency installation
+- [ ] 1.2 Create UnifiedDocument model in backend/app/models/
+  - [ ] 1.2.1 Define UnifiedDocument dataclass
+  - [ ] 1.2.2 Add DocumentElement model
+  - [ ] 1.2.3 Add DocumentMetadata model
+  - [ ] 1.2.4 Create converters for both OCR and direct extraction outputs
+- [ ] 1.3 Create DocumentTypeDetector service
+  - [ ] 1.3.1 Implement file type detection using python-magic
+  - [ ] 1.3.2 Add PDF editability checking logic
+  - [ ] 1.3.3 Add Office document detection
+  - [ ] 1.3.4 Create routing logic to determine processing track
+  - [ ] 1.3.5 Add unit tests for detector
+
+## 2. Direct Extraction Track
+- [ ] 2.1 Create DirectExtractionEngine service
+  - [ ] 2.1.1 Implement PyMuPDF-based text extraction
+  - [ ] 2.1.2 Add structure preservation logic
+  - [ ] 2.1.3 Extract tables with coordinates
+  - [ ] 2.1.4 Extract images and their positions
+  - [ ] 2.1.5 Maintain reading order
+  - [ ] 2.1.6 Handle multi-column layouts
+- [ ] 2.2 Implement layout analysis for editable PDFs
+  - [ ] 2.2.1 Detect headers and footers
+  - [ ] 2.2.2 Identify sections and subsections
+  - [ ] 2.2.3 Parse lists and nested structures
+  - [ ] 2.2.4 Extract font and style information
+- [ ] 2.3 Create direct extraction to UnifiedDocument converter
+  - [ ] 2.3.1 Map PyMuPDF structures to UnifiedDocument
+  - [ ] 2.3.2 Preserve coordinate information
+  - [ ] 2.3.3 Maintain element relationships
+
+## 3. OCR Track Enhancement
+- [ ] 3.1 Upgrade PP-StructureV3 configuration
+  - [ ] 3.1.1 Update config for RTX 4060 8GB optimization
+  - [ ] 3.1.2 Enable batch processing for GPU efficiency
+  - [ ] 3.1.3 Configure memory management settings
+  - [ ] 3.1.4 Set up model caching
+- [ ] 3.2 Enhance OCR service to use parsing_res_list
+  - [ ] 3.2.1 Replace markdown extraction with parsing_res_list
+  - [ ] 3.2.2 Extract all 23 element types
+  - [ ] 3.2.3 Preserve bbox coordinates from PP-StructureV3
+  - [ ] 3.2.4 Maintain reading order information
+- [ ] 3.3 Create OCR to UnifiedDocument converter
+  - [ ] 3.3.1 Map PP-StructureV3 elements to UnifiedDocument
+  - [ ] 3.3.2 Handle complex nested structures
+  - [ ] 3.3.3 Preserve all metadata
+
+## 4. Unified Processing Pipeline
+- [ ] 4.1 Update main OCR service for dual-track processing
+  - [ ] 4.1.1 Integrate DocumentTypeDetector
+  - [ ] 4.1.2 Route to appropriate processing engine
+  - [ ] 4.1.3 Return UnifiedDocument from both tracks
+  - [ ] 4.1.4 Maintain backward compatibility
+- [ ] 4.2 Create unified JSON export
+  - [ ] 4.2.1 Define standardized JSON schema
+  - [ ] 4.2.2 Include processing metadata
+  - [ ] 4.2.3 Support both track outputs
+- [ ] 4.3 Update PDF generator for UnifiedDocument
+  - [ ] 4.3.1 Adapt PDF generation to use UnifiedDocument
+  - [ ] 4.3.2 Preserve layout from both tracks
+  - [ ] 4.3.3 Handle coordinate transformations
+
+## 5. Translation System Foundation
+- [ ] 5.1 Create TranslationEngine interface
+  - [ ] 5.1.1 Define translation API contract
+  - [ ] 5.1.2 Support element-level translation
+  - [ ] 5.1.3 Preserve formatting markers
+- [ ] 5.2 Implement structure-preserving translation
+  - [ ] 5.2.1 Translate text while maintaining coordinates
+  - [ ] 5.2.2 Handle table cell translations
+  - [ ] 5.2.3 Preserve list structures
+  - [ ] 5.2.4 Maintain header hierarchies
+- [ ] 5.3 Create translated document renderer
+  - [ ] 5.3.1 Generate PDF with translated text
+  - [ ] 5.3.2 Adjust layouts for text expansion/contraction
+  - [ ] 5.3.3 Handle font substitution for target languages
+
+## 6. API Updates
+- [ ] 6.1 Update OCR endpoints
+  - [ ] 6.1.1 Add processing_track parameter
+  - [ ] 6.1.2 Support track auto-detection
+  - [ ] 6.1.3 Return processing metadata
+- [ ] 6.2 Add document type detection endpoint
+  - [ ] 6.2.1 Create /analyze endpoint
+  - [ ] 6.2.2 Return recommended processing track
+  - [ ] 6.2.3 Provide confidence scores
+- [ ] 6.3 Update result export endpoints
+  - [ ] 6.3.1 Support UnifiedDocument format
+  - [ ] 6.3.2 Add format conversion options
+  - [ ] 6.3.3 Include processing track information
+
+## 7. Frontend Updates
+- [ ] 7.1 Update task detail view
+  - [ ] 7.1.1 Display processing track information
+  - [ ] 7.1.2 Show track-specific metadata
+  - [ ] 7.1.3 Add track selection UI (if manual override needed)
+- [ ] 7.2 Update results preview
+  - [ ] 7.2.1 Handle UnifiedDocument format
+  - [ ] 7.2.2 Display enhanced structure information
+  - [ ] 7.2.3 Show coordinate overlays (debug mode)
+- [ ] 7.3 Add translation UI preparation
+  - [ ] 7.3.1 Add translation toggle/button
+  - [ ] 7.3.2 Language selection dropdown
+  - [ ] 7.3.3 Translation progress indicator
+
+## 8. Testing
+- [ ] 8.1 Unit tests for DocumentTypeDetector
+  - [ ] 8.1.1 Test various file types
+  - [ ] 8.1.2 Test editability detection
+  - [ ] 8.1.3 Test edge cases
+- [ ] 8.2 Unit tests for DirectExtractionEngine
+  - [ ] 8.2.1 Test text extraction accuracy
+  - [ ] 8.2.2 Test structure preservation
+  - [ ] 8.2.3 Test coordinate extraction
+- [ ] 8.3 Integration tests for dual-track processing
+  - [ ] 8.3.1 Test routing logic
+  - [ ] 8.3.2 Test UnifiedDocument generation
+  - [ ] 8.3.3 Test backward compatibility
+- [ ] 8.4 End-to-end tests
+  - [ ] 8.4.1 Test scanned PDF processing (OCR track)
+  - [ ] 8.4.2 Test editable PDF processing (direct track)
+  - [ ] 8.4.3 Test Office document processing
+  - [ ] 8.4.4 Test image file processing
+- [ ] 8.5 Performance testing
+  - [ ] 8.5.1 Benchmark both processing tracks
+  - [ ] 8.5.2 Test GPU memory usage
+  - [ ] 8.5.3 Compare processing times
+
+## 9. Documentation
+- [ ] 9.1 Update API documentation
+  - [ ] 9.1.1 Document new endpoints
+  - [ ] 9.1.2 Update existing endpoint docs
+  - [ ] 9.1.3 Add processing track information
+- [ ] 9.2 Create architecture documentation
+  - [ ] 9.2.1 Document dual-track flow
+  - [ ] 9.2.2 Explain UnifiedDocument structure
+  - [ ] 9.2.3 Add decision trees for track selection
+- [ ] 9.3 Add deployment guide
+  - [ ] 9.3.1 Document GPU requirements
+  - [ ] 9.3.2 Add environment configuration
+  - [ ] 9.3.3 Include troubleshooting guide
+
+## 10. Deployment Preparation
+- [ ] 10.1 Update Docker configuration
+  - [ ] 10.1.1 Add new dependencies to Dockerfile
+  - [ ] 10.1.2 Configure GPU support
+  - [ ] 10.1.3 Update volume mappings
+- [ ] 10.2 Update environment variables
+  - [ ] 10.2.1 Add processing track settings
+  - [ ] 10.2.2 Configure GPU memory limits
+  - [ ] 10.2.3 Add feature flags
+- [ ] 10.3 Create migration plan
+  - [ ] 10.3.1 Plan for existing data migration
+  - [ ] 10.3.2 Create rollback procedures
+  - [ ] 10.3.3 Document breaking changes
+
+## Completion Checklist
+- [ ] All unit tests passing
+- [ ] Integration tests passing
+- [ ] Performance benchmarks acceptable
+- [ ] Documentation complete
+- [ ] Code reviewed
+- [ ] Deployment tested in staging