Skip to main content
A dataset in FlexOrch is a curated collection of pipeline execution results — structured fields, extracted text, quality scores, and PII-masked content — packaged and ready for LLM training, RAG pipelines, or business analytics. Building and exporting datasets does not consume processing credits.

Build a Dataset

Once your documents are processed and jobs show a completed status, group any subset of executions into a named dataset:
curl -X POST https://api.flexorch.com/v1/datasets \
  -H "X-API-KEY: dfx_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"name": "invoices-q1", "execution_ids": ["exec_001", "exec_002"]}'
You can include executions from multiple document types in a single dataset.

Export Formats

FlexOrch supports nine export formats to fit your downstream workflow:
FormatExtensionBest For
JSONL.jsonlLLM fine-tuning (OpenAI, Anthropic)
CSV.csvSpreadsheet analysis
Parquet.parquetData pipelines, analytics
Markdown.mdRAG — LlamaIndex, LangChain
XML.xmlEnterprise integrations
XLSX.xlsxExcel-based workflows
HuggingFace Arrow.arrowdatasets.load_from_disk()
RAG chunks.jsonSemantic chunking with metadata

Export a Dataset

Download a dataset in your chosen format:
curl "https://api.flexorch.com/v1/datasets/{id}/export?format=jsonl" \
  -H "X-API-KEY: dfx_your_key_here" \
  -o output.jsonl
Swap the format query parameter for any extension listed in the table above.

Dataset Profile

Get aggregate statistics about a dataset before you export it:
curl "https://api.flexorch.com/v1/datasets/{id}/profile" \
  -H "X-API-KEY: dfx_your_key_here"
The profile returns quality grade distribution, average quality score, PII type summary, detected jurisdictions, and available export formats. See PII & Privacy for details on the compliance fields included in the profile.

Semantic Indexing

Semantic indexing is available on Pro and Enterprise plans.
Index a dataset to enable natural-language search across all documents:
curl -X POST "https://api.flexorch.com/v1/datasets/{id}/index" \
  -H "X-API-KEY: dfx_your_key_here"
Once indexed, run a semantic search against the dataset:
curl -X POST "https://api.flexorch.com/v1/search" \
  -H "X-API-KEY: dfx_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"query": "invoices over 10000 EUR from Germany", "top_k": 5}'
The response returns the top matching documents ranked by semantic relevance.

Dataset Retention

Datasets are stored for a period determined by your plan:
PlanRetention
Trial7 days
Starter30 days
Pro90 days
EnterpriseConfigurable
Export critical datasets to your own storage before retention expires. Use the dataset.ready webhook with auto_export to automate this — see Webhooks.