Files
egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00

2.9 KiB

Result Export - Delta Changes

ADDED Requirements

Requirement: Image Extraction and Persistence

The OCR system SHALL save extracted images to disk during layout analysis for later use in PDF generation.

Scenario: Images extracted by PP-StructureV3 are saved to disk

  • WHEN OCR processes a document containing images (charts, tables, figures)
  • THEN system SHALL extract image objects from markdown_images dictionary
  • AND system SHALL create imgs/ subdirectory in result folder
  • AND system SHALL save each image object to disk using PIL Image.save()
  • AND saved file paths SHALL match paths recorded in JSON images_metadata
  • AND system SHALL log warnings for failed image saves but continue processing

Scenario: Multi-page documents with images on different pages

  • WHEN OCR processes multi-page PDF with images on multiple pages
  • THEN system SHALL save images from all pages to same imgs/ folder
  • AND image filenames SHALL include bbox coordinates for uniqueness
  • AND images SHALL be available for PDF generation after OCR completes

Requirement: Layout-Preserving PDF Generation

The system SHALL generate PDF files that preserve the original document layout using OCR JSON data.

Scenario: PDF generated from JSON with accurate layout

  • WHEN user requests PDF download for a completed task
  • THEN system SHALL parse OCR JSON result file
  • AND system SHALL extract bounding box coordinates for each text region
  • AND system SHALL determine page dimensions from source file or bbox maximum values
  • AND system SHALL generate PDF with text positioned at precise coordinates
  • AND system SHALL use Chinese-compatible font (e.g., Noto Sans CJK)
  • AND system SHALL embed images from imgs/ folder using paths in images_metadata
  • AND generated PDF SHALL visually resemble original document layout with images

Scenario: PDF download works correctly

  • WHEN user clicks PDF download button
  • THEN system SHALL return cached PDF if already generated
  • OR system SHALL generate new PDF from JSON on first request
  • AND system SHALL NOT return 403 Forbidden error
  • AND downloaded PDF SHALL contain task OCR results with layout preserved

Scenario: Multi-page PDF generation

  • WHEN OCR JSON contains results for multiple pages
  • THEN generated PDF SHALL contain same number of pages
  • AND each page SHALL display text regions for that page only
  • AND page dimensions SHALL match original document pages

MODIFIED Requirements

Requirement: Export Interface

The Export page SHALL support downloading OCR results in multiple formats using V2 task APIs.

Scenario: PDF caching improves performance

  • WHEN user downloads same PDF multiple times
  • THEN system SHALL serve cached PDF file on subsequent requests
  • AND system SHALL NOT regenerate PDF unless JSON changes
  • AND download response time SHALL be faster than initial generation