## ADDED Requirements ### Requirement: PP-StructureV3 Configuration The system SHALL configure PP-StructureV3 with the following settings: **Preprocessing (Stage 1):** - Document orientation classification MUST be enabled (`use_doc_orientation_classify=True`) - Document unwarping MUST be enabled (`use_doc_unwarping=True`) - Textline orientation detection MUST be enabled (`use_textline_orientation=True`) **Layout Detection (Stage 3):** - The `chinese` layout model option SHALL use PP-DocLayout_plus-L (83.2% mAP) - The `default` layout model option SHALL use PubLayNet for English documents - The `cdla` layout model option SHALL use picodet_lcnet_x1_0_fgd_layout_cdla **Element Recognition (Stage 4):** - Table structure recognition SHALL use SLANeXt_wired and SLANeXt_wireless models (69.65% combined accuracy) - Formula recognition SHALL use PP-FormulaNet_plus-L (92.22% English, 90.64% Chinese BLEU) - Chart parsing SHALL use PP-Chart2Table - Seal recognition SHALL use PP-OCRv4_seal #### Scenario: Processing rotated scanned document - **WHEN** a PDF document with rotated pages is processed using OCR track - **THEN** the system SHALL automatically detect and correct the orientation before OCR processing #### Scenario: Processing complex Chinese document with tables - **WHEN** a Chinese document containing tables, images, and formulas is processed - **AND** the user selects "chinese" layout model - **THEN** the system SHALL use PP-DocLayout_plus-L for layout detection (83.2% mAP) - **AND** the system SHALL correctly identify table regions #### Scenario: Table structure recognition with wired tables - **WHEN** a document contains wired (bordered) tables - **THEN** the system SHALL use SLANeXt_wired model for structure recognition - **AND** output correct HTML table structure with proper row/column spanning #### Scenario: Table structure recognition with wireless tables - **WHEN** a document contains wireless (borderless) tables - **THEN** the system SHALL use SLANeXt_wireless model for structure recognition #### Scenario: Chinese formula recognition - **WHEN** a document contains mathematical formulas with Chinese characters - **THEN** the system SHALL use PP-FormulaNet_plus-L for recognition - **AND** output LaTeX code with correct Chinese character representation ## ADDED Requirements ### Requirement: Model Cache Cleanup The system SHALL provide documentation for cleaning up unused model caches to optimize storage space. #### Scenario: User wants to free disk space after model upgrade - **WHEN** the user has upgraded from older models (PP-DocLayout-S, SLANet) to newer models - **THEN** the documentation SHALL explain how to delete unused cached models from `~/.paddlex/official_models/` - **AND** list which model directories can be safely removed