- Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.3 KiB
OCR Processing Specification
ADDED Requirements
Requirement: Multi-Language Text Recognition with Structure Analysis
The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout).
Scenario: Single image OCR with Chinese text
- WHEN user uploads a PNG image containing Chinese text
- THEN the system extracts text with bounding boxes and confidence scores
- AND returns structured JSON with recognized text, coordinates, and language detected
- AND generates Markdown output preserving text layout and hierarchy
Scenario: PDF document OCR with layout preservation
- WHEN user uploads a multi-page PDF file
- THEN the system processes each page with PaddleOCR-VL
- AND performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas)
- AND returns Markdown organized by page with preserved reading order
- AND provides JSON with detailed layout structure and bounding boxes
Scenario: Mixed language content
- WHEN user uploads an image with both Chinese and English text
- THEN the system detects and extracts text in both languages
- AND preserves the spatial relationship between text regions
- AND maintains proper reading order in output Markdown
Scenario: Complex document with tables and images
- WHEN user uploads a scanned document containing tables, images, and text
- THEN the system identifies layout elements (text blocks, tables, images, formulas)
- AND extracts table structure as Markdown tables
- AND extracts and saves document images as separate files
- AND embeds image references in Markdown (
) - AND preserves document hierarchy and reading order in Markdown output
Requirement: Batch Processing
The system SHALL process multiple files concurrently with progress tracking and error handling.
Scenario: Batch upload success
- WHEN user uploads 10 image files simultaneously
- THEN the system creates a batch task with unique batch ID
- AND processes files in parallel (up to configured worker limit)
- AND returns real-time progress updates via WebSocket or polling
Scenario: Batch processing with partial failure
- WHEN a batch contains 5 valid images and 2 corrupted files
- THEN the system processes all valid files successfully
- AND logs errors for corrupted files with specific error messages
- AND marks the batch as "partially completed"
Requirement: Image Preprocessing
The system SHALL provide optional image preprocessing to improve OCR accuracy.
Scenario: Low contrast image enhancement
- WHEN user enables preprocessing for a low-contrast image
- THEN the system applies contrast adjustment and denoising
- AND performs OCR on the enhanced image
- AND returns better accuracy compared to original
Scenario: Skipped preprocessing
- WHEN user disables preprocessing option
- THEN the system performs OCR directly on original image
- AND completes processing faster
Requirement: Confidence Threshold Filtering
The system SHALL filter OCR results based on configurable confidence threshold.
Scenario: High confidence filter
- WHEN user sets confidence threshold to 0.8
- THEN the system returns only text segments with confidence >= 0.8
- AND discards low-confidence results
Scenario: Include all results
- WHEN user sets confidence threshold to 0.0
- THEN the system returns all recognized text regardless of confidence
- AND includes confidence scores in output
Requirement: OCR Result Structure
The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata.
Scenario: Successful OCR result with multiple formats
- WHEN OCR processing completes successfully
- THEN the system returns JSON containing:
- File metadata (name, size, format, upload timestamp)
- Detected text regions with bounding boxes (x, y, width, height)
- Recognized text content for each region
- Confidence scores (0.0 to 1.0)
- Language detected
- Layout element types (title, paragraph, table, image, formula)
- Reading order sequence
- List of extracted image files with paths
- Processing time
- Task status (completed/failed/partial)
- AND generates Markdown file with logical structure
- AND saves extracted images to storage directory
- AND provides methods to export as searchable PDF with images
Scenario: Searchable PDF generation with images
- WHEN user requests PDF export from OCR results
- THEN the system converts Markdown to HTML with basic CSS styling
- AND embeds extracted images in their logical positions (not exact original positions)
- AND generates PDF using Pandoc + WeasyPrint
- AND preserves document hierarchy, tables, and reading order
- AND applies appropriate fonts for Chinese characters
- AND produces searchable PDF (text is selectable and searchable)
Requirement: Document Translation (Reserved Architecture)
The system SHALL provide architecture and UI placeholders for future document translation features.
Scenario: Translation option visibility (UI placeholder)
- WHEN user views OCR result page
- THEN the system displays a "Translate Document" button (disabled or labeled "Coming Soon")
- AND shows target language selection dropdown (disabled)
- AND provides tooltip: "Translation feature will be available in future release"
Scenario: Translation API endpoint (reserved)
- WHEN backend API is queried for translation endpoints
- THEN the system provides
/api/v1/translate/documentendpoint specification - AND returns "Not Implemented" (501) status when called
- AND documents expected request/response format for future implementation
Scenario: Translation configuration storage (database schema)
- WHEN database schema is created
- THEN the system includes
translation_configstable - AND defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at
- AND table remains empty until translation feature is implemented