chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions

View File

@@ -0,0 +1,35 @@
# Change: Dual-track Document Processing with Structure-Preserving Translation
## Why
The current system processes all documents through PaddleOCR, causing unnecessary overhead for editable PDFs that already contain extractable text. Additionally, we're only using ~20% of PP-StructureV3's capabilities, missing out on comprehensive document structure extraction. The system needs to support structure-preserving document translation as a future goal.
## What Changes
- **ADDED** Dual-track processing architecture with intelligent routing
- OCR track for scanned documents, images, and Office files using PaddleOCR
- Direct extraction track for editable PDFs using PyMuPDF
- **ADDED** UnifiedDocument model as common output format for both tracks
- **ADDED** DocumentTypeDetector service for automatic track selection
- **MODIFIED** OCR service to use PP-StructureV3's parsing_res_list instead of markdown
- Now extracts all 23 element types with bbox coordinates
- Preserves reading order and hierarchical structure
- **MODIFIED** PDF generator to handle UnifiedDocument format
- Enhanced overlap detection to prevent text/image/table collisions
- Improved coordinate transformation for accurate layout
- **ADDED** Foundation for structure-preserving translation system
- **BREAKING** JSON output structure will include new fields (backward compatible with defaults)
## Impact
- **Affected specs**:
- `document-processing` (new capability)
- `result-export` (enhanced with track metadata and structure data)
- `task-management` (tracks processing route and history)
- **Affected code**:
- `backend/app/services/ocr_service.py` - Major refactoring for dual-track
- `backend/app/services/pdf_generator_service.py` - UnifiedDocument support
- `backend/app/api/v2/tasks.py` - New endpoints for track detection
- `frontend/src/pages/TaskDetailPage.tsx` - Display processing track info
- **Performance**: 5-10x faster for editable PDFs, same speed for scanned documents
- **Dependencies**: Adds PyMuPDF, pdfplumber, python-magic-bin