Skip to content

Build a Knowledge Base

Upload a collection of files, let Roset transform them into structured data, then search across everything and ask questions with citations. This workflow builds a complete RAG (Retrieval Augmented Generation) pipeline with no infrastructure to manage.

The Workflow

Upload Files --> Transform --> Search Index + Embeddings --> Search / Q&A
  1. Upload your files (contracts, reports, manuals -- any mix)
  2. Roset transforms each file into markdown + embeddings + searchable index
  3. Search across all files with text, vector, or hybrid search
  4. Ask questions and get answers with source citations

Step 1: Upload Your Files

python
import os
from roset import Client
 
client = Client(api_key=os.getenv("ROSET_API_KEY"))
 
# Upload a batch of documents
files = [
    {"filename": "employee-handbook.pdf", "content_type": "application/pdf", "size_bytes": 245000},
    {"filename": "benefits-guide.pdf", "content_type": "application/pdf", "size_bytes": 180000},
    {"filename": "it-policies.pdf", "content_type": "application/pdf", "size_bytes": 95000},
    {"filename": "onboarding-checklist.docx", "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "size_bytes": 42000},
]
 
batch = client.files.upload_batch(files)
print(f"Uploaded {len(batch['files'])} files")

Step 2: Wait for Transformation

Use webhooks for production (recommended) or poll until all files complete.

python
import time
 
# Poll until all files are done
file_ids = [f["id"] for f in batch["files"]]
while True:
    statuses = [client.files.get(fid)["status"] for fid in file_ids]
    done = all(s in ("completed", "failed") for s in statuses)
    print(f"Statuses: {statuses}")
    if done:
        break
    time.sleep(3)
 
completed = sum(1 for s in statuses if s == "completed")
print(f"{completed}/{len(file_ids)} files transformed successfully")

Step 3: Search Your Knowledge Base

Once files are transformed, search across all of them using text, vector, or hybrid search.

python
# Hybrid search (default) -- combines text + vector for best results
results = client.search.query(query="vacation policy")
 
for r in results["results"]:
    print(f"{r['fileId']} (score: {r['score']})")
    if r.get("snippet"):
        print(f"  {r['snippet'][:200]}")

Step 4: Ask Questions with Citations

Use Q&A to ask natural language questions. Roset finds relevant documents via vector search, then generates an answer with source citations.

python
# Ask a question -- Roset searches, retrieves context, and generates an answer
result = client.qa.ask(question="How many vacation days do new employees get?")
 
print(result["answer"])
print()
for source in result["sources"]:
    print(f"  Source: {source['filename']} (score: {source['score']})")
Note

Q&A uses OpenAI for both embedding and answer generation. Roset provides a managed key by default -- no configuration needed.

Putting It Together

Here's the complete end-to-end flow in a single script:

python
import os
import time
from roset import Client
 
client = Client(api_key=os.getenv("ROSET_API_KEY"))
 
# Upload
file = client.files.upload(
    filename="company-handbook.pdf",
    content_type="application/pdf",
    size_bytes=245000,
)
 
# Wait
while client.files.get(file["id"])["status"] not in ("completed", "failed"):
    time.sleep(2)
 
# Search
results = client.search.query(query="remote work policy")
print(f"Found {results['total']} results")
 
# Q&A
answer = client.qa.ask(question="What is the remote work policy?")
print(answer["answer"])
Build this with AI
Roset (roset.dev) is a developer API that transforms unstructured files into structured data (markdown, embeddings, searchable index). Python SDK: `pip install roset`. Init: `from roset import Client; client = Client(api_key='rsk_...')`. Upload: `client.files.upload(filename, content_type, size_bytes)`. Search: `client.search.query(query='...', mode='hybrid')` returns ranked results with snippets. Q&A: `client.qa.ask(question='...')` returns an answer with source citations. Write me a complete Python script that: 1) uploads 3 PDF files, 2) waits for all to finish processing, 3) runs a hybrid search query, 4) asks a Q&A question and prints the answer with citations.

Next Steps

  • Search -- deep dive on text, vector, and hybrid search modes.
  • Q&A -- streaming responses, parameters, and configuration.
  • Multi-Tenant Spaces -- scope knowledge bases per customer.
  • Webhooks -- replace polling with real-time notifications.