Variant Types
Variants are the structured outputs of Roset's transformation pipeline. Each variant is a different representation of the same source file, linked by lineage. When you upload a file, Roset produces up to 4 variant types automatically.
Overview
| Type | What it contains | Produced by | Use case |
|---|---|---|---|
markdown | Extracted text in markdown format | Reducto, Gemini, Whisper | Display, downstream processing |
embeddings | Vector embeddings for each chunk | OpenAI | Semantic search, RAG |
metadata | Page count, language, confidence | Extraction provider | Filtering, quality checks |
searchable-index | Full-text search index | Roset | Keyword search |
Markdown
The primary extraction output. Contains the full text content of the file converted to markdown format.
Fields:
| Field | Type | Description |
|---|---|---|
content | string | The extracted text in markdown |
pageCount | number | Number of pages extracted |
wordCount | number | Total word count |
characterCount | number | Total character count |
When it's produced: For every file that has textual content. Documents (PDF, DOCX) via Reducto, images via Gemini (OCR), audio via Whisper (transcription).
markdown = client.files.get_variant(file_id, "markdown")
print(f"Pages: {markdown.get('pageCount')}")
print(f"Words: {markdown.get('wordCount')}")
print(markdown["content"][:500])Embeddings
Vector embeddings generated from the extracted text, chunked for semantic search and RAG.
Fields:
| Field | Type | Description |
|---|---|---|
chunks | array | Array of chunk objects with text and vector |
model | string | Embedding model used (e.g., text-embedding-3-small) |
dimensions | number | Vector dimensions (e.g., 1536) |
totalChunks | number | Number of chunks generated |
When it's produced: After markdown extraction completes, if an OpenAI key is available (managed by default).
embeddings = client.files.get_variant(file_id, "embeddings")
print(f"Model: {embeddings.get('model')}")
print(f"Chunks: {embeddings.get('totalChunks')}")
print(f"Dimensions: {embeddings.get('dimensions')}")Metadata
Extraction metadata including page count, detected language, and quality signals.
Fields:
| Field | Type | Description |
|---|---|---|
pageCount | number | Number of pages in the source file |
language | string | Detected language (ISO 639-1) |
extractionConfidence | number | Confidence score (0--1) |
qualityWarnings | string[] | Any quality issues detected |
When it's produced: Alongside the markdown variant during extraction.
metadata = client.files.get_variant(file_id, "metadata")
print(f"Language: {metadata.get('language')}")
print(f"Confidence: {metadata.get('extractionConfidence')}")
if metadata.get("qualityWarnings"):
print(f"Warnings: {', '.join(metadata['qualityWarnings'])}")Searchable Index
A full-text search index built from the extracted content. Powers text and hybrid search modes.
Fields:
| Field | Type | Description |
|---|---|---|
indexedAt | string | When the index was last built |
termCount | number | Number of unique terms indexed |
segmentCount | number | Number of text segments |
When it's produced: After markdown extraction completes. Used internally by the search API.
index = client.files.get_variant(file_id, "searchable-index")
print(f"Terms: {index.get('termCount')}")
print(f"Segments: {index.get('segmentCount')}")
print(f"Indexed at: {index.get('indexedAt')}")List All Variants for a File
result = client.files.list_variants(file_id)
for v in result["variants"]:
print(f" {v['type']}: {v['size_bytes']} bytes (provider: {v.get('provider', 'roset')})")Selective Variant Generation
You can request only specific variant types when uploading:
# Only generate markdown and embeddings
file = client.files.upload(
filename="report.pdf",
content_type="application/pdf",
size_bytes=45678,
variants=["markdown", "embeddings"],
)Next Steps
- Transform Any File -- the complete transformation workflow.
- Search -- how variants power search.
- API Reference -- full variant endpoint documentation.