OCR/tasks.md at 1ec186f680829558e5d5aede1885911233c80d19

egg 1ec186f680 fix: properly implement list formatting with sequential numbering and grouping

Fix critical issues in Task 6 list formatting implementation:

**Issue 1: LIST_ITEM Elements Not Rendered**
- Problem: LIST_ITEM type not included in is_text property
- Fix: Separate list_elements from text_elements (lines 626, 636-637)
- Impact: List items were completely ignored in rendering

**Issue 2: Missing Sequential Numbering**
- Problem: Each list item independently parsed its own number
- Fix: Implement _draw_list_elements_direct method (lines 1523-1610)
- Groups list items by proximity (max_gap=30pt) and level
- Maintains list_counter across items for sequential numbering
- Starts from original number in first item

**Issue 3: Unreliable List Type Detection**
- Problem: Regex-based detection per item, not per list
- Fix: Detect type from first item in group, apply to all items
- Store computed marker in metadata (_list_marker, _list_type)
- Ensures consistency across entire list

**Issue 4: Insufficient List Spacing Control**
- Problem: No grouping logic, relied solely on bbox positions
- Fix: Proximity-based grouping with 30pt max gap threshold
- Groups consecutive items into lists
- Separates lists when gap exceeds threshold or level changes

**Technical Implementation**

New method: _draw_list_elements_direct (lines 1523-1610)
- Sort items by position (y0, x0)
- Group by proximity and level
- Detect list type from first item
- Assign sequential markers
- Store in metadata for _draw_text_element_direct

Updated: _draw_text_element_direct (lines 1662-1677)
- Use pre-computed _list_marker from metadata
- Simplified marker removal (just clean original markers)
- No longer needs to maintain counter per-item

Updated: _generate_direct_track_pdf (lines 622-663)
- Separate list_elements collection
- Call _draw_list_elements_direct before text rendering
- Updated logging to show list item count

**Modified Files**
- backend/app/services/pdf_generator_service.py
  - Lines 626, 636-637: Separate list_elements
  - Lines 644-646: Updated logging
  - Lines 658-659: Add list rendering layer
  - Lines 1523-1610: New _draw_list_elements_direct method
  - Lines 1662-1677: Simplified list detection in _draw_text_element_direct
- openspec/changes/pdf-layout-restoration/tasks.md
  - Updated Task 6.1 subtasks with accurate implementation details
  - Updated Task 6.2 subtasks with grouping and numbering logic

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

8.6 KiB

Raw Blame History

Implementation Tasks: PDF Layout Restoration

Phase 1: Critical Fixes (P0 - Immediate)

1. Fix Image Handling

2. Fix Table Rendering

Phase 2: Basic Style Preservation (P1 - Week 1)

3. Implement Style Application System

4. Track-Specific Rendering

Phase 3: Advanced Layout (P2 - Week 2)

5. Enhanced Text Rendering

6. List Formatting (Direct track only)

7. Span-Level Rendering (Advanced)

Phase 4: Testing and Optimization (P2 - Week 3)

8. Comprehensive Testing

9. Performance Optimization

10. Documentation and Deployment

Success Criteria

Must Have (Phase 1)

Should Have (Phase 2)

Nice to Have (Phase 3-4)

Timeline

8.6 KiB Raw Blame History

Implementation Tasks: PDF Layout Restoration

Phase 1: Critical Fixes (P0 - Immediate)

1. Fix Image Handling

2. Fix Table Rendering

Phase 2: Basic Style Preservation (P1 - Week 1)

3. Implement Style Application System

4. Track-Specific Rendering

Phase 3: Advanced Layout (P2 - Week 2)

5. Enhanced Text Rendering

6. List Formatting (Direct track only)

7. Span-Level Rendering (Advanced)

Phase 4: Testing and Optimization (P2 - Week 3)

8. Comprehensive Testing

9. Performance Optimization

10. Documentation and Deployment

Success Criteria

Must Have (Phase 1)

Should Have (Phase 2)

Nice to Have (Phase 3-4)

Timeline

8.6 KiB

Raw Blame History