egg/OCR

Files

egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

2.9 KiB

Raw Blame History

Result Export - Delta Changes

ADDED Requirements

Requirement: Image Extraction and Persistence

The OCR system SHALL save extracted images to disk during layout analysis for later use in PDF generation.

Scenario: Images extracted by PP-StructureV3 are saved to disk

WHEN OCR processes a document containing images (charts, tables, figures)
THEN system SHALL extract image objects from markdown_images dictionary
AND system SHALL create imgs/ subdirectory in result folder
AND system SHALL save each image object to disk using PIL Image.save()
AND saved file paths SHALL match paths recorded in JSON images_metadata
AND system SHALL log warnings for failed image saves but continue processing

Scenario: Multi-page documents with images on different pages

WHEN OCR processes multi-page PDF with images on multiple pages
THEN system SHALL save images from all pages to same imgs/ folder
AND image filenames SHALL include bbox coordinates for uniqueness
AND images SHALL be available for PDF generation after OCR completes

Requirement: Layout-Preserving PDF Generation

The system SHALL generate PDF files that preserve the original document layout using OCR JSON data.

Scenario: PDF generated from JSON with accurate layout

WHEN user requests PDF download for a completed task
THEN system SHALL parse OCR JSON result file
AND system SHALL extract bounding box coordinates for each text region
AND system SHALL determine page dimensions from source file or bbox maximum values
AND system SHALL generate PDF with text positioned at precise coordinates
AND system SHALL use Chinese-compatible font (e.g., Noto Sans CJK)
AND system SHALL embed images from imgs/ folder using paths in images_metadata
AND generated PDF SHALL visually resemble original document layout with images

Scenario: PDF download works correctly

WHEN user clicks PDF download button
THEN system SHALL return cached PDF if already generated
OR system SHALL generate new PDF from JSON on first request
AND system SHALL NOT return 403 Forbidden error
AND downloaded PDF SHALL contain task OCR results with layout preserved

Scenario: Multi-page PDF generation

WHEN OCR JSON contains results for multiple pages
THEN generated PDF SHALL contain same number of pages
AND each page SHALL display text regions for that page only
AND page dimensions SHALL match original document pages

MODIFIED Requirements

Requirement: Export Interface

The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs.

Scenario: PDF caching improves performance

WHEN user downloads same PDF multiple times
THEN system SHALL serve cached PDF file on subsequent requests
AND system SHALL NOT regenerate PDF unless JSON changes
AND download response time SHALL be faster than initial generation

2.9 KiB Raw Blame History