egg/OCR - OCR

egg/OCR

Fork 0

Commit Graph

Author	SHA1	Message	Date
egg	9f449e8a19	docs: add GPU memory management section to design.md - Document cleanup_gpu_memory() and check_gpu_memory() methods - Explain strategic cleanup points throughout OCR pipeline - Detail optional torch dependency and PaddlePaddle primary usage - List benefits and performance impact - Reference code locations with line numbers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 16:42:23 +08:00
egg	0974fc3a54	fix: resolve E2E test failures and add Office direct extraction design - Fix MySQL connection timeout by creating fresh DB session after OCR - Fix /analyze endpoint attribute errors (detect vs analyze, metadata) - Add processing_track field extraction to TaskDetailResponse - Update E2E tests to use POST for /analyze endpoint - Increase Office document timeout to 300s - Add Section 2.4 tasks for Office document direct extraction - Document Office → PDF → Direct track strategy in design.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 12:13:18 +08:00
egg	cd3cbea49d	chore: project cleanup and prepare for dual-track processing refactor - Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-18 20:02:31 +08:00

Author

SHA1

Message

Date

egg

9f449e8a19

docs: add GPU memory management section to design.md

- Document cleanup_gpu_memory() and check_gpu_memory() methods
- Explain strategic cleanup points throughout OCR pipeline
- Detail optional torch dependency and PaddlePaddle primary usage
- List benefits and performance impact
- Reference code locations with line numbers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 16:42:23 +08:00

egg

0974fc3a54

fix: resolve E2E test failures and add Office direct extraction design

- Fix MySQL connection timeout by creating fresh DB session after OCR
- Fix /analyze endpoint attribute errors (detect vs analyze, metadata)
- Add processing_track field extraction to TaskDetailResponse
- Update E2E tests to use POST for /analyze endpoint
- Increase Office document timeout to 300s
- Add Section 2.4 tasks for Office document direct extraction
- Document Office → PDF → Direct track strategy in design.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 12:13:18 +08:00

egg

cd3cbea49d

chore: project cleanup and prepare for dual-track processing refactor

- Removed all test files and directories
- Deleted outdated documentation (will be rewritten)
- Cleaned up temporary files, logs, and uploads
- Archived 5 completed OpenSpec proposals
- Created new dual-track-document-processing proposal with complete OpenSpec structure
  - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF)
  - UnifiedDocument model for consistent output
  - Support for structure-preserving translation
- Updated .gitignore to prevent future test/temp files

This is a major cleanup preparing for the complete refactoring of the document processing pipeline.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-18 20:02:31 +08:00

3 Commits