Build and Export Datasets from Processed Documents

A dataset in FlexOrch is a curated collection of pipeline execution results — structured fields, extracted text, quality scores, and PII-masked content — packaged and ready for LLM training, RAG pipelines, or business analytics. Building and exporting datasets does not consume processing credits.

Build a Dataset

Once your documents are processed and jobs show a completed status, group any subset of executions into a named dataset:

curl -X POST https://api.flexorch.com/v1/datasets \
  -H "X-API-KEY: dfx_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"name": "invoices-q1", "execution_ids": ["exec_001", "exec_002"]}'

You can include executions from multiple document types in a single dataset.

Export Formats

FlexOrch supports nine export formats to fit your downstream workflow:

Format	Extension	Best For
JSONL	`.jsonl`	LLM fine-tuning (OpenAI, Anthropic)
CSV	`.csv`	Spreadsheet analysis
Parquet	`.parquet`	Data pipelines, analytics
Markdown	`.md`	RAG — LlamaIndex, LangChain
XML	`.xml`	Enterprise integrations
XLSX	`.xlsx`	Excel-based workflows
HuggingFace Arrow	`.arrow`	`datasets.load_from_disk()`
RAG chunks	`.json`	Semantic chunking with metadata

Export a Dataset

Download a dataset in your chosen format:

curl "https://api.flexorch.com/v1/datasets/{id}/export?format=jsonl" \
  -H "X-API-KEY: dfx_your_key_here" \
  -o output.jsonl

Swap the format query parameter for any extension listed in the table above.

Dataset Profile

Get aggregate statistics about a dataset before you export it:

curl "https://api.flexorch.com/v1/datasets/{id}/profile" \
  -H "X-API-KEY: dfx_your_key_here"

The profile returns quality grade distribution, average quality score, PII type summary, detected jurisdictions, and available export formats. See PII & Privacy for details on the compliance fields included in the profile.

Semantic Indexing

Semantic indexing is available on Pro and Enterprise plans.

Index a dataset to enable natural-language search across all documents:

curl -X POST "https://api.flexorch.com/v1/datasets/{id}/index" \
  -H "X-API-KEY: dfx_your_key_here"

Once indexed, run a semantic search against the dataset:

curl -X POST "https://api.flexorch.com/v1/search" \
  -H "X-API-KEY: dfx_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"query": "invoices over 10000 EUR from Germany", "top_k": 5}'

The response returns the top matching documents ranked by semantic relevance.

Dataset Retention

Datasets are stored for a period determined by your plan:

Plan	Retention
Trial	7 days
Starter	30 days
Pro	90 days
Enterprise	Configurable

Export critical datasets to your own storage before retention expires. Use the dataset.ready webhook with auto_export to automate this — see Webhooks.

​Build a Dataset

​Export Formats

​Export a Dataset

​Dataset Profile

​Semantic Indexing

​Dataset Retention

Build a Dataset

Export Formats

Export a Dataset

Dataset Profile

Semantic Indexing

Dataset Retention