Skip to main content
Every document you upload passes through the same six-step automated pipeline. Understanding each step helps you interpret results, tune quality thresholds, and diagnose unexpected output before you build a dataset.
Upload → Extract text → Classify → Extract fields → Detect PII → Score quality → Deliver

Step 1 — Text Extraction

FlexOrch reads raw text from the document using the method best suited to the file type:
File typeExtraction method
PDF (text-based)Direct text extraction
PDF (scanned / image-heavy)OCR applied automatically when the text layer is insufficient
DOCX / TXT / HTMLDirect parse
XLSXCell values extracted row by row
EML / MSGBody + headers, with HTML stripped
XML e-invoicesStructured fields parsed natively (FatturaPA, UBL/Peppol, GİB, XRechnung, ZUGFeRD)
Images (JPG, PNG, TIFF)OCR always applied
When OCR runs, an ocr_confidence score between 0 and 1 is recorded. Values below 0.7 automatically cap the quality grade at C.

Step 2 — Document Classification

FlexOrch identifies the document type using keyword-based classification. Results map to one of eleven types:
TypeExamples
invoiceSales invoices, e-invoices, purchase invoices
expense_reportTravel expenses, reimbursement forms
purchase_orderPOs, procurement documents
sales_proposalQuotes, proposals, offers
bank_statementAccount statements, transaction lists
payrollPayslips, salary summaries
budgetBudget plans, financial forecasts
delivery_noteShipping documents, delivery confirmations
tax_declarationTax forms, declarations
contractAgreements, statements of work
generalDocuments that don’t match a specific type
The classification_method field in results shows whether classification was deterministic (rule-based) or required LLM assistance.

Step 3 — Field Extraction

FlexOrch extracts structured fields based on the detected document type. Deterministic pattern matching runs first; an LLM fallback handles fields that patterns cannot capture. Example fields for an invoice:
FieldExample value
vendorAcme Ltd.
document_date2024-01-15
due_date2024-02-15
total_amount12500.00
currencyEUR
ibanDE89370400440532013000
document_noINV-2024-00421
line_itemsArray of {description, quantity, unit_price, total}
The extraction_method_per_field summary in results shows which fields were filled deterministically versus by LLM.

Step 4 — PII Detection

FlexOrch scans the full document text for personal and sensitive data across 46 PII types. Each finding is:
  • Counted — stored as pii_findings_count
  • Categorized — broken down by type in pii_type_summary
  • Optionally masked — replaced in output when privacy_applied is true
If PII is detected, the masked_text version is used for dataset export by default. See PII Detection & Privacy for the full type catalog and masking strategies.

Step 5 — Quality Scoring

A quality score (0–100) and grade (A–D) are computed from three signals:
  • Field fill rate — how many expected fields for the document type were successfully extracted
  • Noise ratio — proportion of the document that is non-informative content (e.g., repeated headers, page numbers)
  • OCR confidence — a low score here caps the grade at C for scanned documents
GradeScoreWhat it means
A85–100High quality — all key fields extracted
B65–84Good — minor gaps
C45–64Moderate — notable extraction issues
D0–44Low — significant problems; review before use
See Quality Scores for filtering strategies and feedback options.

Step 6 — Delivery

Results are written to the pipeline execution record and returned in the job response:
{
  "data": {
    "status": "completed",
    "detected_language": "tr",
    "quality": {
      "score": 91,
      "grade": "A"
    },
    "pii_findings_count": 3,
    "privacy_applied": true,
    "processing_summary": {
      "fields": {
        "vendor": "Acme Ltd.",
        "total_amount": 12500.00
      }
    }
  }
}

Reprocessing a Document

You can reprocess any document at any time — for example, after a pipeline improvement or a classification correction. Reprocessing creates a new job and execution while preserving the previous results:
curl -X POST https://api.flexorch.com/v1/documents/{document_id}/reprocess \
  -H "X-API-KEY: dfx_your_key_here"