chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00
parent 0edc56b03f
commit cd3cbea49d
64 changed files with 3573 additions and 8190 deletions
--- a/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-18-add-ocr-batch-processing/specs/ocr-processing/spec.md
@@ -0,0 +1,125 @@
+# OCR Processing Specification
+
+## ADDED Requirements
+
+### Requirement: Multi-Language Text Recognition with Structure Analysis
+The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout).
+
+#### Scenario: Single image OCR with Chinese text
+- **WHEN** user uploads a PNG image containing Chinese text
+- **THEN** the system extracts text with bounding boxes and confidence scores
+- **AND** returns structured JSON with recognized text, coordinates, and language detected
+- **AND** generates Markdown output preserving text layout and hierarchy
+
+#### Scenario: PDF document OCR with layout preservation
+- **WHEN** user uploads a multi-page PDF file
+- **THEN** the system processes each page with PaddleOCR-VL
+- **AND** performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas)
+- **AND** returns Markdown organized by page with preserved reading order
+- **AND** provides JSON with detailed layout structure and bounding boxes
+
+#### Scenario: Mixed language content
+- **WHEN** user uploads an image with both Chinese and English text
+- **THEN** the system detects and extracts text in both languages
+- **AND** preserves the spatial relationship between text regions
+- **AND** maintains proper reading order in output Markdown
+
+#### Scenario: Complex document with tables and images
+- **WHEN** user uploads a scanned document containing tables, images, and text
+- **THEN** the system identifies layout elements (text blocks, tables, images, formulas)
+- **AND** extracts table structure as Markdown tables
+- **AND** extracts and saves document images as separate files
+- **AND** embeds image references in Markdown (![](path/to/image.jpg))
+- **AND** preserves document hierarchy and reading order in Markdown output
+
+### Requirement: Batch Processing
+The system SHALL process multiple files concurrently with progress tracking and error handling.
+
+#### Scenario: Batch upload success
+- **WHEN** user uploads 10 image files simultaneously
+- **THEN** the system creates a batch task with unique batch ID
+- **AND** processes files in parallel (up to configured worker limit)
+- **AND** returns real-time progress updates via WebSocket or polling
+
+#### Scenario: Batch processing with partial failure
+- **WHEN** a batch contains 5 valid images and 2 corrupted files
+- **THEN** the system processes all valid files successfully
+- **AND** logs errors for corrupted files with specific error messages
+- **AND** marks the batch as "partially completed"
+
+### Requirement: Image Preprocessing
+The system SHALL provide optional image preprocessing to improve OCR accuracy.
+
+#### Scenario: Low contrast image enhancement
+- **WHEN** user enables preprocessing for a low-contrast image
+- **THEN** the system applies contrast adjustment and denoising
+- **AND** performs OCR on the enhanced image
+- **AND** returns better accuracy compared to original
+
+#### Scenario: Skipped preprocessing
+- **WHEN** user disables preprocessing option
+- **THEN** the system performs OCR directly on original image
+- **AND** completes processing faster
+
+### Requirement: Confidence Threshold Filtering
+The system SHALL filter OCR results based on configurable confidence threshold.
+
+#### Scenario: High confidence filter
+- **WHEN** user sets confidence threshold to 0.8
+- **THEN** the system returns only text segments with confidence >= 0.8
+- **AND** discards low-confidence results
+
+#### Scenario: Include all results
+- **WHEN** user sets confidence threshold to 0.0
+- **THEN** the system returns all recognized text regardless of confidence
+- **AND** includes confidence scores in output
+
+### Requirement: OCR Result Structure
+The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata.
+
+#### Scenario: Successful OCR result with multiple formats
+- **WHEN** OCR processing completes successfully
+- **THEN** the system returns JSON containing:
+  - File metadata (name, size, format, upload timestamp)
+  - Detected text regions with bounding boxes (x, y, width, height)
+  - Recognized text content for each region
+  - Confidence scores (0.0 to 1.0)
+  - Language detected
+  - Layout element types (title, paragraph, table, image, formula)
+  - Reading order sequence
+  - List of extracted image files with paths
+  - Processing time
+  - Task status (completed/failed/partial)
+- **AND** generates Markdown file with logical structure
+- **AND** saves extracted images to storage directory
+- **AND** provides methods to export as searchable PDF with images
+
+#### Scenario: Searchable PDF generation with images
+- **WHEN** user requests PDF export from OCR results
+- **THEN** the system converts Markdown to HTML with basic CSS styling
+- **AND** embeds extracted images in their logical positions (not exact original positions)
+- **AND** generates PDF using Pandoc + WeasyPrint
+- **AND** preserves document hierarchy, tables, and reading order
+- **AND** applies appropriate fonts for Chinese characters
+- **AND** produces searchable PDF (text is selectable and searchable)
+
+### Requirement: Document Translation (Reserved Architecture)
+The system SHALL provide architecture and UI placeholders for future document translation features.
+
+#### Scenario: Translation option visibility (UI placeholder)
+- **WHEN** user views OCR result page
+- **THEN** the system displays a "Translate Document" button (disabled or labeled "Coming Soon")
+- **AND** shows target language selection dropdown (disabled)
+- **AND** provides tooltip: "Translation feature will be available in future release"
+
+#### Scenario: Translation API endpoint (reserved)
+- **WHEN** backend API is queried for translation endpoints
+- **THEN** the system provides `/api/v1/translate/document` endpoint specification
+- **AND** returns "Not Implemented" (501) status when called
+- **AND** documents expected request/response format for future implementation
+
+#### Scenario: Translation configuration storage (database schema)
+- **WHEN** database schema is created
+- **THEN** the system includes `translation_configs` table
+- **AND** defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at
+- **AND** table remains empty until translation feature is implemented