Skip to content

Transform Unstructured Files into Structured Data

Note

Roset is currently in public beta (v0.1.1). The API is stable, but expect new features and providers regularly. Breaking changes will be communicated in advance.

Roset is the transformation engine for unstructured data. Upload any file -- PDF, image, audio, document -- and Roset routes it to the right extraction provider, generates embeddings, and gives you back four structured variants: markdown, embeddings, metadata, and a searchable index.

Quick Start

python
from roset import Client
 
client = Client(api_key="rsk_...")
 
# Upload a file -- Roset routes it to the right extraction provider
file = client.files.upload(
    filename="invoice.pdf",
    content_type="application/pdf",
    size_bytes=45678,
)
 
# Poll until processing completes
import time
status = file["status"]
while status not in ("completed", "failed"):
    current = client.files.get(file["id"])
    status = current["status"]
    time.sleep(2)
 
# Retrieve the extracted markdown variant
variants = client.files.list_variants(file["id"])
print(variants)

Choose Your Path

Get started faster with AI
Roset (roset.dev) is a developer API that transforms unstructured files (PDFs, images, audio, DOCX) into structured variants: markdown, vector embeddings, metadata, and a searchable index. Python SDK: `pip install roset`. Init: `from roset import Client; client = Client(api_key='rsk_...')`. Key methods: `client.files.upload(filename, content_type, size_bytes)`, `client.files.get(file_id)` to poll status, `client.files.list_variants(file_id)` to retrieve outputs. Help me install the SDK, upload a file, poll until status is 'completed', then retrieve the markdown and embeddings variants.

What Can You Build?

  • Transform Any File -- Upload a file and get markdown + embeddings + metadata in one call. The "hello world" story.
  • Build a Knowledge Base -- Upload many files, search across them, and Q&A with citations. End-to-end RAG pipeline.
  • Multi-Tenant Spaces -- Scope files per customer with spaces and portal tokens. B2B SaaS isolation pattern.
  • Sync Cloud Storage -- Connect an S3/GCS/Azure bucket, sync files, auto-process, and get webhook notifications.

How It Works

  1. Upload a file via multipart form data or a signed URL. Roset stores metadata only -- file bytes go directly to your storage.
  2. Route to the right extraction provider. Roset selects Reducto for documents, Gemini for images, or Whisper for audio based on content type.
  3. Extract structured content. The provider returns markdown, and Roset stores the result as a variant on the file.
  4. Embed the extracted content. Vector embeddings are generated automatically via OpenAI as a second variant.
  5. Retrieve the original file metadata, extracted markdown, or embeddings through a single unified API.
Upload --> Route (Reducto / Gemini / Whisper) --> Extract --> Embed (OpenAI) --> Variants

Core Concepts

  • Files are documents tracked by Roset's metadata store. Each file has a processing status and zero or more variants.
  • Variants are the extraction outputs linked to a parent file -- extracted markdown, vector embeddings, or structured metadata. See Variant Types for details.
  • Jobs represent the processing pipeline for a file. Each job moves through a state machine: queued -> processing -> completed or failed.
  • Connections are linked storage buckets (S3, GCS, Azure Blob Storage, MinIO, Cloudflare R2, Supabase Storage).
  • Spaces provide optional namespace isolation for multi-tenant applications. Defaults to "default".
  • Provider Keys are optional BYOK credentials for extraction providers. Roset uses managed keys by default.
  • Webhooks deliver HTTP callbacks when processing events occur.

Next Steps