Files
OCR/openspec/changes/archive/2025-11-18-add-office-document-support/specs/file-processing/spec.md
egg cd3cbea49d chore: project cleanup and prepare for dual-track processing refactor
- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 20:02:31 +08:00

2.0 KiB

File Processing Specification Delta

ADDED Requirements

Requirement: Office Document Support

The system SHALL support processing of Microsoft Office document formats including Word documents (.doc, .docx) and PowerPoint presentations (.ppt, .pptx).

Scenario: Upload and Process Word Document

Given a user has a Word document containing text and tables When the user uploads the .docx file Then the system converts it to PDF format And extracts all text using OCR And preserves table structure in the output

Scenario: Upload and Process PowerPoint

Given a user has a PowerPoint presentation with multiple slides When the user uploads the .pptx file Then the system converts each slide to an image And performs OCR on each slide And maintains slide order in the results

Requirement: Document Conversion Pipeline

The system SHALL implement a multi-stage conversion pipeline for Office documents using LibreOffice or equivalent tools.

Scenario: Conversion Error Handling

Given an Office document with unsupported features When the conversion process encounters an error Then the system logs the specific error details And returns a user-friendly error message And marks the file as failed with reason

MODIFIED Requirements

Requirement: File Validation

The file validation module SHALL accept Office document formats in addition to existing image and PDF formats, including .doc, .docx, .ppt, and .pptx extensions.

Scenario: Validate Office File Upload

Given a user attempts to upload a file When the file extension is .docx or .pptx Then the system accepts the file for processing And validates the MIME type matches the extension

Requirement: JWT Token Validity

The JWT token validity period SHALL be extended from 30 minutes to 1440 minutes (24 hours) to improve user experience.

Scenario: Extended Token Usage

Given a user authenticates successfully When they receive a JWT token Then the token remains valid for 24 hours And allows continuous API access without re-authentication