egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

2.0 KiB

Raw Blame History

Change: Dual-track Document Processing with Structure-Preserving Translation

Why

The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal.

What Changes

ADDED Dual-track processing architecture with intelligent routing
- OCR track for scanned documents, images, and Office files using PaddleOCR
- Direct extraction track for editable PDFs using PyMuPDF
ADDED UnifiedDocument model as common output format for both tracks
ADDED DocumentTypeDetector service for automatic track selection
MODIFIED OCR service to use PP-StructureV3's parsing_res_list instead of markdown
- Now extracts all 23 element types with bbox coordinates
- Preserves reading order and hierarchical structure
MODIFIED PDF generator to handle UnifiedDocument format
- Enhanced overlap detection to prevent text/image/table collisions
- Improved coordinate transformation for accurate layout
ADDED Foundation for structure-preserving translation system
BREAKING JSON output structure will include new fields (backward compatible with defaults)

Impact

Affected specs:
- document-processing (new capability)
- result-export (enhanced with track metadata and structure data)
- task-management (tracks processing route and history)
Affected code:
- backend/app/services/ocr_service.py - Major refactoring for dual-track
- backend/app/services/pdf_generator_service.py - UnifiedDocument support
- backend/app/api/v2/tasks.py - New endpoints for track detection
- frontend/src/pages/TaskDetailPage.tsx - Display processing track info
Performance: 5-10x faster for editable PDFs, same speed for scanned documents
Dependencies: Adds PyMuPDF, pdfplumber, python-magic-bin

2.0 KiB Raw Blame History

Change: Dual-track Document Processing with Structure-Preserving Translation

Why

What Changes

Impact

2.0 KiB

Raw Blame History