# OCR Processing Specification

## ADDED Requirements

### Requirement: Multi-Language Text Recognition with Structure Analysis
The system SHALL extract text and images from document files using PaddleOCR-VL with support for 109 languages including Chinese (traditional and simplified), English, Japanese, and Korean, while preserving document logical structure and reading order (not pixel-perfect visual layout).

#### Scenario: Single image OCR with Chinese text
- **WHEN** user uploads a PNG image containing Chinese text
- **THEN** the system extracts text with bounding boxes and confidence scores
- **AND** returns structured JSON with recognized text, coordinates, and language detected
- **AND** generates Markdown output preserving text layout and hierarchy

#### Scenario: PDF document OCR with layout preservation
- **WHEN** user uploads a multi-page PDF file
- **THEN** the system processes each page with PaddleOCR-VL
- **AND** performs layout analysis to identify document elements (titles, paragraphs, tables, images, formulas)
- **AND** returns Markdown organized by page with preserved reading order
- **AND** provides JSON with detailed layout structure and bounding boxes

#### Scenario: Mixed language content
- **WHEN** user uploads an image with both Chinese and English text
- **THEN** the system detects and extracts text in both languages
- **AND** preserves the spatial relationship between text regions
- **AND** maintains proper reading order in output Markdown

#### Scenario: Complex document with tables and images
- **WHEN** user uploads a scanned document containing tables, images, and text
- **THEN** the system identifies layout elements (text blocks, tables, images, formulas)
- **AND** extracts table structure as Markdown tables
- **AND** extracts and saves document images as separate files
- **AND** embeds image references in Markdown (![](path/to/image.jpg))
- **AND** preserves document hierarchy and reading order in Markdown output

### Requirement: Batch Processing
The system SHALL process multiple files concurrently with progress tracking and error handling.

#### Scenario: Batch upload success
- **WHEN** user uploads 10 image files simultaneously
- **THEN** the system creates a batch task with unique batch ID
- **AND** processes files in parallel (up to configured worker limit)
- **AND** returns real-time progress updates via WebSocket or polling

#### Scenario: Batch processing with partial failure
- **WHEN** a batch contains 5 valid images and 2 corrupted files
- **THEN** the system processes all valid files successfully
- **AND** logs errors for corrupted files with specific error messages
- **AND** marks the batch as "partially completed"

### Requirement: Image Preprocessing
The system SHALL provide optional image preprocessing to improve OCR accuracy.

#### Scenario: Low contrast image enhancement
- **WHEN** user enables preprocessing for a low-contrast image
- **THEN** the system applies contrast adjustment and denoising
- **AND** performs OCR on the enhanced image
- **AND** returns better accuracy compared to original

#### Scenario: Skipped preprocessing
- **WHEN** user disables preprocessing option
- **THEN** the system performs OCR directly on original image
- **AND** completes processing faster

### Requirement: Confidence Threshold Filtering
The system SHALL filter OCR results based on configurable confidence threshold.

#### Scenario: High confidence filter
- **WHEN** user sets confidence threshold to 0.8
- **THEN** the system returns only text segments with confidence >= 0.8
- **AND** discards low-confidence results

#### Scenario: Include all results
- **WHEN** user sets confidence threshold to 0.0
- **THEN** the system returns all recognized text regardless of confidence
- **AND** includes confidence scores in output

### Requirement: OCR Result Structure
The system SHALL return OCR results in multiple formats (JSON, Markdown) with extracted text, images, and structure metadata.

#### Scenario: Successful OCR result with multiple formats
- **WHEN** OCR processing completes successfully
- **THEN** the system returns JSON containing:
  - File metadata (name, size, format, upload timestamp)
  - Detected text regions with bounding boxes (x, y, width, height)
  - Recognized text content for each region
  - Confidence scores (0.0 to 1.0)
  - Language detected
  - Layout element types (title, paragraph, table, image, formula)
  - Reading order sequence
  - List of extracted image files with paths
  - Processing time
  - Task status (completed/failed/partial)
- **AND** generates Markdown file with logical structure
- **AND** saves extracted images to storage directory
- **AND** provides methods to export as searchable PDF with images

#### Scenario: Searchable PDF generation with images
- **WHEN** user requests PDF export from OCR results
- **THEN** the system converts Markdown to HTML with basic CSS styling
- **AND** embeds extracted images in their logical positions (not exact original positions)
- **AND** generates PDF using Pandoc + WeasyPrint
- **AND** preserves document hierarchy, tables, and reading order
- **AND** applies appropriate fonts for Chinese characters
- **AND** produces searchable PDF (text is selectable and searchable)

### Requirement: Document Translation (Reserved Architecture)
The system SHALL provide architecture and UI placeholders for future document translation features.

#### Scenario: Translation option visibility (UI placeholder)
- **WHEN** user views OCR result page
- **THEN** the system displays a "Translate Document" button (disabled or labeled "Coming Soon")
- **AND** shows target language selection dropdown (disabled)
- **AND** provides tooltip: "Translation feature will be available in future release"

#### Scenario: Translation API endpoint (reserved)
- **WHEN** backend API is queried for translation endpoints
- **THEN** the system provides `/api/v1/translate/document` endpoint specification
- **AND** returns "Not Implemented" (501) status when called
- **AND** documents expected request/response format for future implementation

#### Scenario: Translation configuration storage (database schema)
- **WHEN** database schema is created
- **THEN** the system includes `translation_configs` table
- **AND** defines columns: id, user_id, source_lang, target_lang, engine_type, engine_config, created_at
- **AND** table remains empty until translation feature is implemented