Build and Export Datasets with the Python SDK

After processing your documents, FlexOrch lets you group completed jobs into a named dataset and export the structured output in the format your downstream pipeline expects. This page covers building, profiling, exporting, and deleting datasets.

Build a dataset

Pass a list of completed job IDs and a human-readable name to client.datasets.build(). The method returns a Dataset object once the dataset is assembled.

import os
from flexorch import FlexOrch

client = FlexOrch(api_key=os.environ["FLEXORCH_API_KEY"])

dataset = client.datasets.build(
    job_ids=["job_abc123", "job_def456", "job_ghi789"],
    name="q1-invoices",
)

print(f"Dataset ID:   {dataset.id}")
print(f"Name:         {dataset.name}")
print(f"Status:       {dataset.status}")
print(f"Job count:    {dataset.job_count}")

All job IDs passed to build() must be in the completed state. Jobs with status queued, running, or failed are silently skipped unless you set strict=True, which raises IncompleteJobError instead.

Build from filtered jobs

A common pattern is to filter jobs by quality grade before building:

all_jobs = client.jobs.list(limit=200)

high_quality_ids = [
    j.id for j in all_jobs
    if j.quality_grade in ("A", "B") and j.status == "completed"
]

dataset = client.datasets.build(
    job_ids=high_quality_ids,
    name="high-quality-contracts",
)

Export a dataset

client.datasets.export() returns the dataset contents as raw bytes. Write them to disk with standard Python file I/O.

data = client.datasets.export(dataset.id, format="jsonl")

with open("q1-invoices.jsonl", "wb") as f:
    f.write(data)

print("Export saved.")

Supported export formats

Format	`format` value	Best for
JSON Lines	`"jsonl"`	LLM fine-tuning, streaming ingestion
CSV	`"csv"`	Spreadsheet tools, quick inspection
Parquet	`"parquet"`	Columnar analytics, Spark / DuckDB pipelines
Markdown	`"markdown"`	Human review, RAG document stores
Arrow	`"arrow"`	High-performance in-memory data exchange

# Export the same dataset in multiple formats
for fmt in ["jsonl", "parquet", "csv"]:
    data = client.datasets.export(dataset.id, format=fmt)
    with open(f"q1-invoices.{fmt}", "wb") as f:
        f.write(data)
    print(f"Saved q1-invoices.{fmt}")

Profile a dataset

client.datasets.profile() returns statistics about your dataset — token counts, field coverage, PII distribution, and grade breakdown — useful for quality checks before fine-tuning.

profile = client.datasets.profile(dataset.id)

print(f"Total records:        {profile.record_count}")
print(f"Total tokens:         {profile.total_tokens}")
print(f"Grade A records:      {profile.grade_counts['A']}")
print(f"Unique PII types:     {profile.pii_type_count}")
print(f"Avg quality score:    {profile.avg_quality_score:.2f}")

Run profile() after building and before exporting to catch low-quality datasets early — especially when assembling training data for fine-tuning.

Delete a dataset

Deleting a dataset removes the assembled export artifact. The underlying jobs and their extracted data are not deleted.

client.datasets.delete(dataset.id)
print(f"Dataset {dataset.id} deleted.")

Deletion is immediate and irreversible. If you need the data again, you must call client.datasets.build() to reassemble it from the original jobs.

Build and Export Datasets with the Python SDK

Build a dataset

Build from filtered jobs

Export a dataset

Supported export formats

Profile a dataset

Delete a dataset

Next steps

Jobs

API Reference

​Build a dataset

​Build from filtered jobs

​Export a dataset

​Supported export formats

​Profile a dataset

​Delete a dataset

​Next steps

Jobs

API Reference

Build a dataset

Build from filtered jobs

Export a dataset

Supported export formats

Profile a dataset

Delete a dataset

Next steps