Build a Knowledge Base
Upload a collection of files, let Roset transform them into structured data, then search across everything and ask questions with citations. This workflow builds a complete RAG (Retrieval Augmented Generation) pipeline with no infrastructure to manage.
The Workflow
Upload Files --> Transform --> Search Index + Embeddings --> Search / Q&A
- Upload your files (contracts, reports, manuals -- any mix)
- Roset transforms each file into markdown + embeddings + searchable index
- Search across all files with text, vector, or hybrid search
- Ask questions and get answers with source citations
Step 1: Upload Your Files
python
import os
from roset import Client
client = Client(api_key=os.getenv("ROSET_API_KEY"))
# Upload a batch of documents
files = [
{"filename": "employee-handbook.pdf", "content_type": "application/pdf", "size_bytes": 245000},
{"filename": "benefits-guide.pdf", "content_type": "application/pdf", "size_bytes": 180000},
{"filename": "it-policies.pdf", "content_type": "application/pdf", "size_bytes": 95000},
{"filename": "onboarding-checklist.docx", "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "size_bytes": 42000},
]
batch = client.files.upload_batch(files)
print(f"Uploaded {len(batch['files'])} files")Step 2: Wait for Transformation
Use webhooks for production (recommended) or poll until all files complete.
python
import time
# Poll until all files are done
file_ids = [f["id"] for f in batch["files"]]
while True:
statuses = [client.files.get(fid)["status"] for fid in file_ids]
done = all(s in ("completed", "failed") for s in statuses)
print(f"Statuses: {statuses}")
if done:
break
time.sleep(3)
completed = sum(1 for s in statuses if s == "completed")
print(f"{completed}/{len(file_ids)} files transformed successfully")Step 3: Search Your Knowledge Base
Once files are transformed, search across all of them using text, vector, or hybrid search.
python
# Hybrid search (default) -- combines text + vector for best results
results = client.search.query(query="vacation policy")
for r in results["results"]:
print(f"{r['fileId']} (score: {r['score']})")
if r.get("snippet"):
print(f" {r['snippet'][:200]}")Step 4: Ask Questions with Citations
Use Q&A to ask natural language questions. Roset finds relevant documents via vector search, then generates an answer with source citations.
python
# Ask a question -- Roset searches, retrieves context, and generates an answer
result = client.qa.ask(question="How many vacation days do new employees get?")
print(result["answer"])
print()
for source in result["sources"]:
print(f" Source: {source['filename']} (score: {source['score']})")Note
Q&A uses OpenAI for both embedding and answer generation. Roset provides a managed key by default -- no configuration needed.
Putting It Together
Here's the complete end-to-end flow in a single script:
python
import os
import time
from roset import Client
client = Client(api_key=os.getenv("ROSET_API_KEY"))
# Upload
file = client.files.upload(
filename="company-handbook.pdf",
content_type="application/pdf",
size_bytes=245000,
)
# Wait
while client.files.get(file["id"])["status"] not in ("completed", "failed"):
time.sleep(2)
# Search
results = client.search.query(query="remote work policy")
print(f"Found {results['total']} results")
# Q&A
answer = client.qa.ask(question="What is the remote work policy?")
print(answer["answer"])Build this with AI
Roset (roset.dev) is a developer API that transforms unstructured files into structured data (markdown, embeddings, searchable index). Python SDK: `pip install roset`. Init: `from roset import Client; client = Client(api_key='rsk_...')`. Upload: `client.files.upload(filename, content_type, size_bytes)`. Search: `client.search.query(query='...', mode='hybrid')` returns ranked results with snippets. Q&A: `client.qa.ask(question='...')` returns an answer with source citations. Write me a complete Python script that: 1) uploads 3 PDF files, 2) waits for all to finish processing, 3) runs a hybrid search query, 4) asks a Q&A question and prints the answer with citations.
Next Steps
- Search -- deep dive on text, vector, and hybrid search modes.
- Q&A -- streaming responses, parameters, and configuration.
- Multi-Tenant Spaces -- scope knowledge bases per customer.
- Webhooks -- replace polling with real-time notifications.