A dataset in FlexOrch is a curated collection of pipeline execution results — structured fields, extracted text, quality scores, and PII-masked content — packaged and ready for LLM training, RAG pipelines, or business analytics. Building and exporting datasets does not consume processing credits.
Build a Dataset
Once your documents are processed and jobs show a completed status, group any subset of executions into a named dataset:
curl -X POST https://api.flexorch.com/v1/datasets \
-H "X-API-KEY: dfx_your_key_here" \
-H "Content-Type: application/json" \
-d '{"name": "invoices-q1", "execution_ids": ["exec_001", "exec_002"]}'
You can include executions from multiple document types in a single dataset.
FlexOrch supports nine export formats to fit your downstream workflow:
| Format | Extension | Best For |
|---|
| JSONL | .jsonl | LLM fine-tuning (OpenAI, Anthropic) |
| CSV | .csv | Spreadsheet analysis |
| Parquet | .parquet | Data pipelines, analytics |
| Markdown | .md | RAG — LlamaIndex, LangChain |
| XML | .xml | Enterprise integrations |
| XLSX | .xlsx | Excel-based workflows |
| HuggingFace Arrow | .arrow | datasets.load_from_disk() |
| RAG chunks | .json | Semantic chunking with metadata |
Export a Dataset
Download a dataset in your chosen format:
curl "https://api.flexorch.com/v1/datasets/{id}/export?format=jsonl" \
-H "X-API-KEY: dfx_your_key_here" \
-o output.jsonl
Swap the format query parameter for any extension listed in the table above.
Dataset Profile
Get aggregate statistics about a dataset before you export it:
curl "https://api.flexorch.com/v1/datasets/{id}/profile" \
-H "X-API-KEY: dfx_your_key_here"
The profile returns quality grade distribution, average quality score, PII type summary, detected jurisdictions, and available export formats. See PII & Privacy for details on the compliance fields included in the profile.
Semantic Indexing
Semantic indexing is available on Pro and Enterprise plans.
Index a dataset to enable natural-language search across all documents:
curl -X POST "https://api.flexorch.com/v1/datasets/{id}/index" \
-H "X-API-KEY: dfx_your_key_here"
Once indexed, run a semantic search against the dataset:
curl -X POST "https://api.flexorch.com/v1/search" \
-H "X-API-KEY: dfx_your_key_here" \
-H "Content-Type: application/json" \
-d '{"query": "invoices over 10000 EUR from Germany", "top_k": 5}'
The response returns the top matching documents ranked by semantic relevance.
Dataset Retention
Datasets are stored for a period determined by your plan:
| Plan | Retention |
|---|
| Trial | 7 days |
| Starter | 30 days |
| Pro | 90 days |
| Enterprise | Configurable |
Export critical datasets to your own storage before retention expires. Use the dataset.ready webhook with auto_export to automate this — see Webhooks.