How the FlexOrch Document Pipeline Works

Every document you upload passes through the same six-step automated pipeline. Understanding each step helps you interpret results, tune quality thresholds, and diagnose unexpected output before you build a dataset.

Upload → Extract text → Classify → Extract fields → Detect PII → Score quality → Deliver

Step 1 — Text Extraction

FlexOrch reads raw text from the document using the method best suited to the file type:

File type	Extraction method
PDF (text-based)	Direct text extraction
PDF (scanned / image-heavy)	OCR applied automatically when the text layer is insufficient
DOCX / TXT / HTML	Direct parse
XLSX	Cell values extracted row by row
EML / MSG	Body + headers, with HTML stripped
XML e-invoices	Structured fields parsed natively (FatturaPA, UBL/Peppol, GİB, XRechnung, ZUGFeRD)
Images (JPG, PNG, TIFF)	OCR always applied

When OCR runs, an ocr_confidence score between 0 and 1 is recorded. Values below 0.7 automatically cap the quality grade at C.

Step 2 — Document Classification

FlexOrch identifies the document type using keyword-based classification. Results map to one of eleven types:

Type	Examples
`invoice`	Sales invoices, e-invoices, purchase invoices
`expense_report`	Travel expenses, reimbursement forms
`purchase_order`	POs, procurement documents
`sales_proposal`	Quotes, proposals, offers
`bank_statement`	Account statements, transaction lists
`payroll`	Payslips, salary summaries
`budget`	Budget plans, financial forecasts
`delivery_note`	Shipping documents, delivery confirmations
`tax_declaration`	Tax forms, declarations
`contract`	Agreements, statements of work
`general`	Documents that don’t match a specific type

The classification_method field in results shows whether classification was deterministic (rule-based) or required LLM assistance.

Step 3 — Field Extraction

FlexOrch extracts structured fields based on the detected document type. Deterministic pattern matching runs first; an LLM fallback handles fields that patterns cannot capture. Example fields for an invoice:

Field	Example value
`vendor`	Acme Ltd.
`document_date`	2024-01-15
`due_date`	2024-02-15
`total_amount`	12500.00
`currency`	EUR
`iban`	DE89370400440532013000
`document_no`	INV-2024-00421
`line_items`	Array of `{description, quantity, unit_price, total}`

The extraction_method_per_field summary in results shows which fields were filled deterministically versus by LLM.

Step 4 — PII Detection

FlexOrch scans the full document text for personal and sensitive data across 46 PII types. Each finding is:

Counted — stored as pii_findings_count
Categorized — broken down by type in pii_type_summary
Optionally masked — replaced in output when privacy_applied is true

If PII is detected, the masked_text version is used for dataset export by default. See PII Detection & Privacy for the full type catalog and masking strategies.

Step 5 — Quality Scoring

A quality score (0–100) and grade (A–D) are computed from three signals:

Field fill rate — how many expected fields for the document type were successfully extracted
Noise ratio — proportion of the document that is non-informative content (e.g., repeated headers, page numbers)
OCR confidence — a low score here caps the grade at C for scanned documents

Grade	Score	What it means
A	85–100	High quality — all key fields extracted
B	65–84	Good — minor gaps
C	45–64	Moderate — notable extraction issues
D	0–44	Low — significant problems; review before use

See Quality Scores for filtering strategies and feedback options.

Step 6 — Delivery

Results are written to the pipeline execution record and returned in the job response:

{
  "data": {
    "status": "completed",
    "detected_language": "tr",
    "quality": {
      "score": 91,
      "grade": "A"
    },
    "pii_findings_count": 3,
    "privacy_applied": true,
    "processing_summary": {
      "fields": {
        "vendor": "Acme Ltd.",
        "total_amount": 12500.00
      }
    }
  }
}

Reprocessing a Document

You can reprocess any document at any time — for example, after a pipeline improvement or a classification correction. Reprocessing creates a new job and execution while preserving the previous results:

curl -X POST https://api.flexorch.com/v1/documents/{document_id}/reprocess \
  -H "X-API-KEY: dfx_your_key_here"

​Step 1 — Text Extraction

​Step 2 — Document Classification

​Step 3 — Field Extraction

​Step 4 — PII Detection

​Step 5 — Quality Scoring

​Step 6 — Delivery

​Reprocessing a Document