Transform Unstructured Files into Structured Data
Note
Roset is currently in public beta (v0.1.1). The API is stable, but expect new features and providers regularly. Breaking changes will be communicated in advance.
Roset is the transformation engine for unstructured data. Upload any file -- PDF, image, audio, document -- and Roset routes it to the right extraction provider, generates embeddings, and gives you back four structured variants: markdown, embeddings, metadata, and a searchable index.
Quick Start
python
from roset import Client
client = Client(api_key="rsk_...")
# Upload a file -- Roset routes it to the right extraction provider
file = client.files.upload(
filename="invoice.pdf",
content_type="application/pdf",
size_bytes=45678,
)
# Poll until processing completes
import time
status = file["status"]
while status not in ("completed", "failed"):
current = client.files.get(file["id"])
status = current["status"]
time.sleep(2)
# Retrieve the extracted markdown variant
variants = client.files.list_variants(file["id"])
print(variants)Choose Your Path
I have files to upload
Upload documents via the SDK and get structured variants back -- markdown, embeddings, metadata, and more.
I have files in S3/GCS/Azure
Connect your existing bucket, sync metadata, and auto-process files already in cloud storage.
Get started faster with AI
Roset (roset.dev) is a developer API that transforms unstructured files (PDFs, images, audio, DOCX) into structured variants: markdown, vector embeddings, metadata, and a searchable index. Python SDK: `pip install roset`. Init: `from roset import Client; client = Client(api_key='rsk_...')`. Key methods: `client.files.upload(filename, content_type, size_bytes)`, `client.files.get(file_id)` to poll status, `client.files.list_variants(file_id)` to retrieve outputs. Help me install the SDK, upload a file, poll until status is 'completed', then retrieve the markdown and embeddings variants.
What Can You Build?
- Transform Any File -- Upload a file and get markdown + embeddings + metadata in one call. The "hello world" story.
- Build a Knowledge Base -- Upload many files, search across them, and Q&A with citations. End-to-end RAG pipeline.
- Multi-Tenant Spaces -- Scope files per customer with spaces and portal tokens. B2B SaaS isolation pattern.
- Sync Cloud Storage -- Connect an S3/GCS/Azure bucket, sync files, auto-process, and get webhook notifications.
How It Works
- Upload a file via multipart form data or a signed URL. Roset stores metadata only -- file bytes go directly to your storage.
- Route to the right extraction provider. Roset selects Reducto for documents, Gemini for images, or Whisper for audio based on content type.
- Extract structured content. The provider returns markdown, and Roset stores the result as a variant on the file.
- Embed the extracted content. Vector embeddings are generated automatically via OpenAI as a second variant.
- Retrieve the original file metadata, extracted markdown, or embeddings through a single unified API.
Upload --> Route (Reducto / Gemini / Whisper) --> Extract --> Embed (OpenAI) --> Variants
Core Concepts
- Files are documents tracked by Roset's metadata store. Each file has a processing status and zero or more variants.
- Variants are the extraction outputs linked to a parent file -- extracted markdown, vector embeddings, or structured metadata. See Variant Types for details.
- Jobs represent the processing pipeline for a file. Each job moves through a state machine:
queued->processing->completedorfailed. - Connections are linked storage buckets (S3, GCS, Azure Blob Storage, MinIO, Cloudflare R2, Supabase Storage).
- Spaces provide optional namespace isolation for multi-tenant applications. Defaults to
"default". - Provider Keys are optional BYOK credentials for extraction providers. Roset uses managed keys by default.
- Webhooks deliver HTTP callbacks when processing events occur.
Next Steps
- Quickstart -- upload your first file in under 5 minutes.
- Installation -- set up the Python or TypeScript SDK.
- Python SDK -- full Python client reference.
- TypeScript SDK -- full TypeScript client reference.
- API Reference -- complete endpoint documentation.