chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions

View File

@@ -0,0 +1,48 @@
# Change: Add OCR Batch Processing System with Structure Extraction
## Why
Users need a web-based solution to extract text, images, and structure from multiple document files efficiently. Current manual text extraction is time-consuming and error-prone. This system will automate the process with multi-language OCR support (Chinese, English, etc.), intelligent layout analysis to understand document structure, and provide flexible export options including searchable PDF with embedded images. The extracted content preserves logical structure and reading order (not pixel-perfect visual layout). The system also reserves architecture for future document translation capabilities.
## What Changes
- Add core OCR processing capability using **PaddleOCR-VL** (vision-language model for document parsing)
- Implement **document structure analysis** with PP-StructureV3 to identify titles, paragraphs, tables, images, formulas
- Extract and **preserve document images** alongside text content
- Support unified input preprocessing (convert any format to images/PDF for OCR processing)
- Implement batch file upload and processing (images: PNG, JPG, PDF files)
- Support multi-language text recognition (Chinese traditional/simplified, English, Japanese, Korean) - 109 languages via PaddleOCR-VL
- Add **Markdown intermediate format** for structured document representation with embedded images
- Implement **searchable PDF generation** from Markdown with images (Pandoc + WeasyPrint)
- Generate PDFs that preserve logical structure and reading order (not exact visual layout)
- Add rule-based output formatting system for organizing extracted text
- Implement multiple export formats (TXT, JSON, Excel, **Markdown with images, searchable PDF**)
- Create web UI with drag-and-drop file upload
- Build RESTful API for OCR processing with progress tracking
- Add background task processing for long-running OCR jobs
- **Reserve translation module architecture** (UI placeholders + API endpoints for future implementation)
## Impact
- **New capabilities**:
- `ocr-processing`: Core OCR text and image extraction with structure analysis (PaddleOCR-VL + PP-StructureV3)
- `file-management`: File upload, validation, and storage with format standardization
- `export-results`: Multi-format export with custom rules, including searchable PDF with embedded images
- `translation` (reserved): Architecture for future translation features
- **Affected code**:
- New backend: `app/` (FastAPI application structure)
- New frontend: `frontend/` (React + Vite application)
- New database tables: `ocr_tasks`, `ocr_results`, `export_rules`, `translation_configs` (reserved)
- **Dependencies**:
- Backend: fastapi, paddleocr (3.0+), paddlepaddle, pdf2image, pandas, pillow, weasyprint, markdown, pandoc (system)
- Frontend: react, vite, tailwindcss, shadcn/ui, axios, react-query
- Translation engines (reserved): argostranslate (offline) or API integration
- **Configuration**:
- MySQL database connection (external server)
- PaddleOCR-VL model storage (~900MB) and language packs
- Pandoc installation for PDF generation
- Basic CSS template for readable PDF output (not for visual layout replication)
- Image storage directory for extracted images
- File upload size limits and supported formats
- Port configuration (12010 for backend, 12011 for frontend dev)
- Translation service config (reserved for future)