egg/OCR - OCR

egg/OCR

Author	SHA1	Message	Date
egg	8333182879	fix: correct Y-axis positioning and implement span-based rendering CRITICAL BUG FIXES (Based on expert analysis): Bug A - Y-axis Starting Position Error: - Previous code used bbox.y1 (bottom) as starting point for multi-line text - Caused first line to render at last line position, text overflowing downward - FIX: Span-based rendering now uses `page_height - span.bbox.y1 + (font_size * 0.2)` to approximate baseline position for each span individually - FIX: Block-level fallback starts from bbox.y0 (top), draws lines downward: `pdf_y_top = page_height - bbox.y0`, then `line_y = pdf_y_top - ((i + 1) * line_height)` Bug B - Spans Compressed to First Line: - Previous code forced all spans to render only on first line (if i == 0 check) - Destroyed multi-line and multi-column layouts by compressing paragraphs - FIX: Prioritize span-based rendering - each span uses its own precise bbox - FIX: Removed line iteration for spans - they already have correct coordinates - FIX: Return immediately after drawing spans to prevent block text overlap Implementation Changes: 1. Span-Based Rendering (Priority Path): - Iterate through element.children (spans) with precise bbox from PyMuPDF - Each span positioned independently using its own coordinates - Apply per-span StyleInfo (font_name, font_size, font_weight, font_style) - Transform coordinates: span_pdf_y = page_height - s_bbox.y1 + (font_size * 0.2) - Used for 84% of text elements (16/19 elements in test) 2. Block-Level Fallback (Corrected Y-Axis): - Used when no spans available (filtered/modified text) - Start from TOP: pdf_y_top = page_height - bbox.y0 - Draw lines downward: line_y = pdf_y_top - ((i + 1) * line_height) - Maintains proper line spacing and paragraph flow 3. Testing: - Added comprehensive E2E test suite (test_pdf_layout_restoration.py) - Quick visual verification test (quick_visual_test.py) - Test results documented in TEST_RESULTS_SPAN_FIX.md Test Results: ✅ PDF generation: 14,172 bytes, 3 pages with content ✅ Span rendering: 84% of elements (16/19) using precise bbox ✅ Font sizes: Correct 10pt (not 35pt from bbox_height) ✅ Line count: 152 lines (proper spacing, no compression) ✅ Reading order: Correct left-right, top-bottom pattern ✅ First line: "Technical Data Sheet" (verified correct) Files Changed: - backend/app/services/pdf_generator_service.py: Complete rewrite of _draw_text_element_direct() method (lines 1796-2024) - backend/tests/e2e/test_pdf_layout_restoration.py: New E2E test suite - backend/tests/e2e/TEST_RESULTS_SPAN_FIX.md: Comprehensive test results References: - Expert analysis identified Y-axis and span compression bugs - Solution prioritizes PyMuPDF's precise span-level bbox data - Maintains backward compatibility with block-level fallback 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-24 14:57:27 +08:00
egg	ef335cf3af	feat: implement Office document direct extraction (Section 2.4) - Update DocumentTypeDetector._analyze_office to convert Office to PDF first - Analyze converted PDF for text extractability before routing - Route text-based Office documents to direct track (10x faster) - Update OCR service to convert Office files for DirectExtractionEngine - Add unit tests for Office → PDF → Direct extraction flow - Handle conversion failures with fallback to OCR track This optimization reduces Office document processing from >300s to ~2-5s for text-based documents by avoiding unnecessary OCR processing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 12:20:50 +08:00
egg	0974fc3a54	fix: resolve E2E test failures and add Office direct extraction design - Fix MySQL connection timeout by creating fresh DB session after OCR - Fix /analyze endpoint attribute errors (detect vs analyze, metadata) - Add processing_track field extraction to TaskDetailResponse - Update E2E tests to use POST for /analyze endpoint - Increase Office document timeout to 300s - Add Section 2.4 tasks for Office document direct extraction - Document Office → PDF → Direct track strategy in design.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 12:13:18 +08:00
egg	c50a5e9d2b	test: add unit and integration tests for dual-track processing Add comprehensive test suite for DirectExtractionEngine and dual-track integration. All 65 tests pass covering text extraction, structure preservation, routing logic, and backward compatibility. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-19 12:50:44 +08:00
egg	0fcb2492c9	test: add unit tests for DocumentTypeDetector - Create test directory structure for backend - Add pytest fixtures for test files (PDF, images, Office docs) - Add 20 unit tests covering: - PDF type detection (editable, scanned, mixed) - Image file detection (PNG, JPG) - Office document detection (DOCX) - Text file detection - Edge cases (file not found, unknown types) - Batch processing and statistics - Mark tasks 1.1.4 and 1.3.5 as completed in tasks.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-19 12:16:49 +08:00
egg	cd3cbea49d	chore: project cleanup and prepare for dual-track processing refactor - Removed all test files and directories - Deleted outdated documentation (will be rewritten) - Cleaned up temporary files, logs, and uploads - Archived 5 completed OpenSpec proposals - Created new dual-track-document-processing proposal with complete OpenSpec structure - Dual-track architecture: OCR track (PaddleOCR) + Direct track (PyMuPDF) - UnifiedDocument model for consistent output - Support for structure-preserving translation - Updated .gitignore to prevent future test/temp files This is a major cleanup preparing for the complete refactoring of the document processing pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-18 20:02:31 +08:00
egg	6bb5b7691f	test: fix all failing tests - achieve 100% pass rate (18/18) Root Cause Fixed: - Tests were connecting to production MySQL database instead of test database - Solution: Monkey patch database module before importing app to use SQLite :memory: Changes: 1. conftest.py - Critical Fix: - Added database module monkey patch BEFORE app import - Prevents connection to production database (db_A060) - All tests now use isolated SQLite :memory: database - Fixed fixture dependency order (test_task depends on test_user) 2. test_tasks.py: - Fixed test_delete_task: Accept 204 No Content (correct HTTP status) 3. test_admin.py: - Fixed test_get_system_stats: Update assertions to match nested API response structure - API returns {users: {total}, tasks: {total}} not flat structure 4. test_integration.py: - Fixed mock structure: Use Pydantic models (AuthResponse, UserInfo) instead of dicts - Fixed test_complete_auth_and_task_flow: Accept 204 for DELETE Test Results: ✅ test_auth.py: 5/5 passing (100%) ✅ test_tasks.py: 6/6 passing (100%) ✅ test_admin.py: 4/4 passing (100%) ✅ test_integration.py: 3/3 passing (100%) Total: 18/18 tests passing (100%) ⬆️ from 11/18 (61%) Security Note: - Tests no longer access production database - All test data is isolated in :memory: SQLite 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 18:39:10 +08:00
egg	90fca5002b	test: run and fix V2 API tests - 11/18 passing Changes: - Fixed UserResponse schema datetime serialization bug - Fixed test_auth.py mock structure for external auth service - Updated conftest.py to create fresh database per test - Ran full test suite and verified results Test Results: ✅ test_auth.py: 5/5 passing (100%) ✅ test_tasks.py: 4/6 passing (67%) ✅ test_admin.py: 2/4 passing (50%) ❌ test_integration.py: 0/3 passing (0%) Total: 11/18 tests passing (61%) Known Issues: 1. Fixture isolation: test_user sometimes gets admin email 2. Admin API response structure doesn't match test expectations 3. Integration tests need mock fixes Production Bug Fixed: - UserResponse schema now properly serializes datetime fields to ISO format strings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 18:16:47 +08:00
egg	8f94191914	feat: add admin dashboard, audit logs, token expiry check and test suite Frontend Features: - Add ProtectedRoute component with token expiry validation - Create AdminDashboardPage with system statistics and user management - Create AuditLogsPage with filtering and pagination - Add admin-only navigation (Shield icon) for ymirliu@panjit.com.tw - Add admin API methods to apiV2 service - Add admin type definitions (SystemStats, AuditLog, etc.) Token Management: - Auto-redirect to login on token expiry - Check authentication on route change - Show loading state during auth check - Admin privilege verification Backend Testing: - Add pytest configuration (pytest.ini) - Create test fixtures (conftest.py) - Add unit tests for auth, tasks, and admin endpoints - Add integration tests for complete workflows - Test user isolation and admin access control Documentation: - Add TESTING.md with comprehensive testing guide - Include test running instructions - Document fixtures and best practices Routes: - /admin - Admin dashboard (admin only) - /admin/audit-logs - Audit logs viewer (admin only) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-16 18:01:50 +08:00
beabigegg	da700721fa	first	2025-11-12 22:53:17 +08:00

10 Commits