egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

3.5 KiB

Raw Blame History

Result Export Spec Delta

MODIFIED Requirements

Requirement: Export Interface

The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs, with processing track information and enhanced structure data.

Scenario: Export page uses V2 download endpoints

WHEN user selects a format and clicks export button
THEN frontend SHALL call V2 endpoint /api/v2/tasks/{task_id}/download/{format}
AND frontend SHALL NOT call V1 /api/v2/export endpoint (which returns 404)
AND file SHALL download successfully

Scenario: Export supports multiple formats

WHEN user exports a completed task
THEN system SHALL support downloading as TXT, JSON, Excel, Markdown, and PDF
AND each format SHALL use correct V2 download endpoint
AND downloaded files SHALL contain task OCR results

Scenario: Export includes processing track metadata

WHEN user exports a task processed through dual-track system
THEN exported JSON SHALL include "processing_track" field indicating "ocr" or "direct"
AND SHALL include "processing_metadata" with track-specific information
AND SHALL maintain backward compatibility for clients not expecting these fields

Scenario: Export UnifiedDocument format

WHEN user requests JSON export with unified=true parameter
THEN system SHALL return UnifiedDocument structure
AND include complete element hierarchy with coordinates
AND preserve all PP-StructureV3 element types for OCR track

ADDED Requirements

Requirement: Enhanced PDF Export with Layout Preservation

The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks.

Scenario: Export PDF from direct extraction track

WHEN exporting PDF from a direct-extraction processed document
THEN the PDF SHALL maintain exact text positioning from source
AND preserve original fonts and styles where possible
AND include extracted images at correct positions

Scenario: Export PDF from OCR track with full structure

WHEN exporting PDF from OCR-processed document
THEN the PDF SHALL use all 23 PP-StructureV3 element types
AND render tables with proper cell boundaries
AND maintain reading order from parsing_res_list

Scenario: Handle coordinate transformations

WHEN generating PDF from UnifiedDocument
THEN system SHALL correctly transform bbox coordinates to PDF space
AND handle page size variations
AND prevent text overlap using enhanced overlap detection

Requirement: Structure Data Export

The system SHALL provide export formats that preserve document structure for downstream processing.

Scenario: Export structured JSON with hierarchy

WHEN user selects structured JSON format
THEN export SHALL include element hierarchy and relationships
AND preserve parent-child relationships (sections, lists)
AND include style and formatting information

Scenario: Export for translation preparation

WHEN user exports with translation_ready=true parameter
THEN export SHALL include translatable text segments
AND maintain coordinate mappings for each segment
AND mark non-translatable regions

Scenario: Export with layout analysis

WHEN user requests layout analysis export
THEN system SHALL include reading order indices
AND identify layout regions (header, body, footer, sidebar)
AND provide confidence scores for layout detection

3.5 KiB Raw Blame History