Skip to content

Variant Types

Variants are the structured outputs of Roset's transformation pipeline. Each variant is a different representation of the same source file, linked by lineage. When you upload a file, Roset produces up to 4 variant types automatically.

Overview

TypeWhat it containsProduced byUse case
markdownExtracted text in markdown formatReducto, Gemini, WhisperDisplay, downstream processing
embeddingsVector embeddings for each chunkOpenAISemantic search, RAG
metadataPage count, language, confidenceExtraction providerFiltering, quality checks
searchable-indexFull-text search indexRosetKeyword search

Markdown

The primary extraction output. Contains the full text content of the file converted to markdown format.

Fields:

FieldTypeDescription
contentstringThe extracted text in markdown
pageCountnumberNumber of pages extracted
wordCountnumberTotal word count
characterCountnumberTotal character count

When it's produced: For every file that has textual content. Documents (PDF, DOCX) via Reducto, images via Gemini (OCR), audio via Whisper (transcription).

python
markdown = client.files.get_variant(file_id, "markdown")
print(f"Pages: {markdown.get('pageCount')}")
print(f"Words: {markdown.get('wordCount')}")
print(markdown["content"][:500])

Embeddings

Vector embeddings generated from the extracted text, chunked for semantic search and RAG.

Fields:

FieldTypeDescription
chunksarrayArray of chunk objects with text and vector
modelstringEmbedding model used (e.g., text-embedding-3-small)
dimensionsnumberVector dimensions (e.g., 1536)
totalChunksnumberNumber of chunks generated

When it's produced: After markdown extraction completes, if an OpenAI key is available (managed by default).

python
embeddings = client.files.get_variant(file_id, "embeddings")
print(f"Model: {embeddings.get('model')}")
print(f"Chunks: {embeddings.get('totalChunks')}")
print(f"Dimensions: {embeddings.get('dimensions')}")

Metadata

Extraction metadata including page count, detected language, and quality signals.

Fields:

FieldTypeDescription
pageCountnumberNumber of pages in the source file
languagestringDetected language (ISO 639-1)
extractionConfidencenumberConfidence score (0--1)
qualityWarningsstring[]Any quality issues detected

When it's produced: Alongside the markdown variant during extraction.

python
metadata = client.files.get_variant(file_id, "metadata")
print(f"Language: {metadata.get('language')}")
print(f"Confidence: {metadata.get('extractionConfidence')}")
if metadata.get("qualityWarnings"):
    print(f"Warnings: {', '.join(metadata['qualityWarnings'])}")

Searchable Index

A full-text search index built from the extracted content. Powers text and hybrid search modes.

Fields:

FieldTypeDescription
indexedAtstringWhen the index was last built
termCountnumberNumber of unique terms indexed
segmentCountnumberNumber of text segments

When it's produced: After markdown extraction completes. Used internally by the search API.

python
index = client.files.get_variant(file_id, "searchable-index")
print(f"Terms: {index.get('termCount')}")
print(f"Segments: {index.get('segmentCount')}")
print(f"Indexed at: {index.get('indexedAt')}")

List All Variants for a File

python
result = client.files.list_variants(file_id)
for v in result["variants"]:
    print(f"  {v['type']}: {v['size_bytes']} bytes (provider: {v.get('provider', 'roset')})")

Selective Variant Generation

You can request only specific variant types when uploading:

python
# Only generate markdown and embeddings
file = client.files.upload(
    filename="report.pdf",
    content_type="application/pdf",
    size_bytes=45678,
    variants=["markdown", "embeddings"],
)

Next Steps