Roset Documentation
Roset is the transformation engine for unstructured data. Upload any document, and Roset routes it to the right extraction provider -- Reducto for PDFs, Gemini for images, Whisper for audio -- then generates vector embeddings via OpenAI. You get back five structured outputs: markdown, embeddings, metadata, thumbnails, and a searchable index.
Roset is not an extraction service. It orchestrates extraction services, managing the queues, retries, provider routing, variant tracking, and space isolation so you do not have to.
How it works: You upload files via signed URLs (bytes go directly to storage, never through Roset). Roset coordinates extraction providers to transform your files and tracks every variant with lineage back to the source.
Quick Start
How It Works
- Upload a file via multipart form data or a signed URL. Roset stores metadata only -- file bytes go directly to your storage.
- Route to the right extraction provider. Roset selects Reducto for documents, Gemini for images, or Whisper for audio based on content type.
- Extract structured content. The provider returns markdown, and Roset stores the result as a variant on the file.
- Embed the extracted content. If an OpenAI key is configured, vector embeddings are generated automatically as a second variant.
- Retrieve the original file metadata, extracted markdown, or embeddings through a single unified API.
Upload --> Route (Reducto / Gemini / Whisper) --> Extract --> Embed (OpenAI) --> Variants
Endpoints
| Endpoint | Description |
|---|---|
POST /v1/upload | Upload a file for processing |
GET /v1/files | List files |
GET /v1/files/:id | Get a file and its variants |
DELETE /v1/files/:id | Delete a file and its variants |
GET /v1/jobs | List processing jobs |
GET /v1/jobs/:id | Get job details |
POST /v1/jobs/:id/cancel | Cancel a queued or in-progress job |
POST /v1/jobs/:id/retry | Retry a failed job |
GET /v1/files/:id/variants | List extraction outputs for a file |
GET /v1/files/:id/variants/:type | Get a specific variant (markdown, embeddings) |
GET /v1/spaces | List spaces |
GET /v1/spaces/:name/stats | Get space statistics |
Core Concepts
- Files are documents tracked by Roset's metadata store. Each file has a processing status and zero or more variants.
- Jobs represent the processing pipeline for a file. Each job moves through a state machine:
queued->processing->completedorfailed. Jobs are created automatically on upload. - Variants are the extraction outputs linked to a parent file -- extracted markdown, vector embeddings, thumbnails, or structured metadata. Each variant has a type and is produced by a specific provider.
- Connections are linked storage buckets (S3, GCS, Azure Blob Storage, MinIO, Cloudflare R2, Supabase Storage). Roset reads file metadata from your bucket without copying bytes.
- Spaces provide optional namespace isolation for multi-space applications. If you are not building a B2B SaaS product, you can ignore spaces entirely -- files default to a
"default"space. - Provider Keys are the API credentials for extraction providers (Reducto, OpenAI, Gemini, Whisper) managed through Roset. All tiers include managed extraction keys. BYOK is available on Growth+ plans for a 40% discount on overage rates.
- Webhooks deliver HTTP callbacks when processing events occur (file completed, variant ready, job failed).
Next Steps
- Quickstart -- upload your first file in under 5 minutes.
- Installation -- set up the TypeScript or Python SDK.
- Authentication -- configure API keys and bearer tokens.
- API Reference -- full endpoint documentation.