chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions

View File

@@ -0,0 +1,125 @@
# OCR Processing Specification
## ADDED Requirements
### Requirement: Multi-Language Text Recognition with Structure Analysis
The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout).
#### Scenario: Single image OCR with Chinese text
- **WHEN** user uploads a PNG image containing Chinese text
- **THEN** the system extracts text with bounding boxes and confidence scores
- **AND** returns structured JSON with recognized text, coordinates, and language detected
- **AND** generates Markdown output preserving text layout and hierarchy
#### Scenario: PDF document OCR with layout preservation
- **WHEN** user uploads a multi-page PDF file
- **THEN** the system processes each page with PaddleOCR-VL
- **AND** performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas)
- **AND** returns Markdown organized by page with preserved reading order
- **AND** provides JSON with detailed layout structure and bounding boxes
#### Scenario: Mixed language content
- **WHEN** user uploads an image with both Chinese and English text
- **THEN** the system detects and extracts text in both languages
- **AND** preserves the spatial relationship between text regions
- **AND** maintains proper reading order in output Markdown
#### Scenario: Complex document with tables and images
- **WHEN** user uploads a scanned document containing tables, images, and text
- **THEN** the system identifies layout elements (text blocks, tables, images, formulas)
- **AND** extracts table structure as Markdown tables
- **AND** extracts and saves document images as separate files
- **AND** embeds image references in Markdown (![](path/to/image.jpg))
- **AND** preserves document hierarchy and reading order in Markdown output
### Requirement: Batch Processing
The system SHALL process multiple files concurrently with progress tracking and error handling.
#### Scenario: Batch upload success
- **WHEN** user uploads 10 image files simultaneously
- **THEN** the system creates a batch task with unique batch ID
- **AND** processes files in parallel (up to configured worker limit)
- **AND** returns real-time progress updates via WebSocket or polling
#### Scenario: Batch processing with partial failure
- **WHEN** a batch contains 5 valid images and 2 corrupted files
- **THEN** the system processes all valid files successfully
- **AND** logs errors for corrupted files with specific error messages
- **AND** marks the batch as "partially completed"
### Requirement: Image Preprocessing
The system SHALL provide optional image preprocessing to improve OCR accuracy.
#### Scenario: Low contrast image enhancement
- **WHEN** user enables preprocessing for a low-contrast image
- **THEN** the system applies contrast adjustment and denoising
- **AND** performs OCR on the enhanced image
- **AND** returns better accuracy compared to original
#### Scenario: Skipped preprocessing
- **WHEN** user disables preprocessing option
- **THEN** the system performs OCR directly on original image
- **AND** completes processing faster
### Requirement: Confidence Threshold Filtering
The system SHALL filter OCR results based on configurable confidence threshold.
#### Scenario: High confidence filter
- **WHEN** user sets confidence threshold to 0.8
- **THEN** the system returns only text segments with confidence >= 0.8
- **AND** discards low-confidence results
#### Scenario: Include all results
- **WHEN** user sets confidence threshold to 0.0
- **THEN** the system returns all recognized text regardless of confidence
- **AND** includes confidence scores in output
### Requirement: OCR Result Structure
The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata.
#### Scenario: Successful OCR result with multiple formats
- **WHEN** OCR processing completes successfully
- **THEN** the system returns JSON containing:
- File metadata (name, size, format, upload timestamp)
- Detected text regions with bounding boxes (x, y, width, height)
- Recognized text content for each region
- Confidence scores (0.0 to 1.0)
- Language detected
- Layout element types (title, paragraph, table, image, formula)
- Reading order sequence
- List of extracted image files with paths
- Processing time
- Task status (completed/failed/partial)
- **AND** generates Markdown file with logical structure
- **AND** saves extracted images to storage directory
- **AND** provides methods to export as searchable PDF with images
#### Scenario: Searchable PDF generation with images
- **WHEN** user requests PDF export from OCR results
- **THEN** the system converts Markdown to HTML with basic CSS styling
- **AND** embeds extracted images in their logical positions (not exact original positions)
- **AND** generates PDF using Pandoc + WeasyPrint
- **AND** preserves document hierarchy, tables, and reading order
- **AND** applies appropriate fonts for Chinese characters
- **AND** produces searchable PDF (text is selectable and searchable)
### Requirement: Document Translation (Reserved Architecture)
The system SHALL provide architecture and UI placeholders for future document translation features.
#### Scenario: Translation option visibility (UI placeholder)
- **WHEN** user views OCR result page
- **THEN** the system displays a "Translate Document" button (disabled or labeled "Coming Soon")
- **AND** shows target language selection dropdown (disabled)
- **AND** provides tooltip: "Translation feature will be available in future release"
#### Scenario: Translation API endpoint (reserved)
- **WHEN** backend API is queried for translation endpoints
- **THEN** the system provides `/api/v1/translate/document` endpoint specification
- **AND** returns "Not Implemented" (501) status when called
- **AND** documents expected request/response format for future implementation
#### Scenario: Translation configuration storage (database schema)
- **WHEN** database schema is created
- **THEN** the system includes `translation_configs` table
- **AND** defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at
- **AND** table remains empty until translation feature is implemented