Step 1 — Text Extraction
FlexOrch reads raw text from the document using the method best suited to the file type:| File type | Extraction method |
|---|---|
| PDF (text-based) | Direct text extraction |
| PDF (scanned / image-heavy) | OCR applied automatically when the text layer is insufficient |
| DOCX / TXT / HTML | Direct parse |
| XLSX | Cell values extracted row by row |
| EML / MSG | Body + headers, with HTML stripped |
| XML e-invoices | Structured fields parsed natively (FatturaPA, UBL/Peppol, GİB, XRechnung, ZUGFeRD) |
| Images (JPG, PNG, TIFF) | OCR always applied |
ocr_confidence score between 0 and 1 is recorded. Values below 0.7 automatically cap the quality grade at C.
Step 2 — Document Classification
FlexOrch identifies the document type using keyword-based classification. Results map to one of eleven types:| Type | Examples |
|---|---|
invoice | Sales invoices, e-invoices, purchase invoices |
expense_report | Travel expenses, reimbursement forms |
purchase_order | POs, procurement documents |
sales_proposal | Quotes, proposals, offers |
bank_statement | Account statements, transaction lists |
payroll | Payslips, salary summaries |
budget | Budget plans, financial forecasts |
delivery_note | Shipping documents, delivery confirmations |
tax_declaration | Tax forms, declarations |
contract | Agreements, statements of work |
general | Documents that don’t match a specific type |
classification_method field in results shows whether classification was deterministic (rule-based) or required LLM assistance.
Step 3 — Field Extraction
FlexOrch extracts structured fields based on the detected document type. Deterministic pattern matching runs first; an LLM fallback handles fields that patterns cannot capture. Example fields for an invoice:| Field | Example value |
|---|---|
vendor | Acme Ltd. |
document_date | 2024-01-15 |
due_date | 2024-02-15 |
total_amount | 12500.00 |
currency | EUR |
iban | DE89370400440532013000 |
document_no | INV-2024-00421 |
line_items | Array of {description, quantity, unit_price, total} |
extraction_method_per_field summary in results shows which fields were filled deterministically versus by LLM.
Step 4 — PII Detection
FlexOrch scans the full document text for personal and sensitive data across 46 PII types. Each finding is:- Counted — stored as
pii_findings_count - Categorized — broken down by type in
pii_type_summary - Optionally masked — replaced in output when
privacy_appliedistrue
masked_text version is used for dataset export by default.
See PII Detection & Privacy for the full type catalog and masking strategies.
Step 5 — Quality Scoring
A quality score (0–100) and grade (A–D) are computed from three signals:- Field fill rate — how many expected fields for the document type were successfully extracted
- Noise ratio — proportion of the document that is non-informative content (e.g., repeated headers, page numbers)
- OCR confidence — a low score here caps the grade at C for scanned documents
| Grade | Score | What it means |
|---|---|---|
| A | 85–100 | High quality — all key fields extracted |
| B | 65–84 | Good — minor gaps |
| C | 45–64 | Moderate — notable extraction issues |
| D | 0–44 | Low — significant problems; review before use |