How Documents Are Processed
After upload, Thallus processes your document in the background — extracting content, splitting it into searchable chunks, generating vector embeddings, and creating a structured synopsis. No action is needed from you; documents become searchable automatically once processing completes.
Processing pipeline
- Upload — File saved and validation passed (see Uploading Documents)
- Extract content — Text and structure pulled from the file based on its type
- Chunk — Content split into overlapping passages optimized for search
- Embed — Each chunk converted to a vector embedding for semantic search
- Synopsis — An LLM generates structured metadata about the document
- Searchable — Document appears in search results across all active conversations
Content extraction by file type
Pages are extracted individually. Tables within pages are detected and rendered as markdown. Long pages are split at paragraph boundaries to keep chunks at a searchable size.
Word documents
Structure-aware extraction splits at heading boundaries with a target chunk size optimized for search. Tables are converted to markdown and included in the surrounding chunk. This preserves the document's logical structure rather than splitting at arbitrary character counts.
Text and Markdown
Plain text files are split at paragraph boundaries with an optimized target size and overlap between consecutive chunks. Markdown files preserve heading structure, using headers as natural split points.
Spreadsheets (CSV, XLSX, XLS)
Spreadsheets are processed using intelligent strategies that adapt to your data structure, producing searchable chunks. The system analyzes your table's shape, columns, and content to choose the best approach automatically.
All spreadsheets also get automatic schema, statistics, and sample data chunks so agents understand the structure of your data even before searching specific rows.
Embedding
Each chunk is converted to a high-dimensional vector that captures its semantic meaning. Embeddings are processed in batches for efficiency.
The embedding model can be configured per organization if you're using Bring Your Own Key.
Synopsis generation
After chunking, an LLM reads the document and produces structured metadata that powers document discovery:
The synopsis captures key metadata about your document — its type, summary, topics, and entities mentioned — to power fast discovery. This metadata is also indexed for keyword search alongside vector similarity.
Processing status
You can track each document's progress from the Documents page:
| Status | Meaning |
|---|---|
| Pending | Queued, waiting to start |
| Processing | Extracting, chunking, embedding |
| Completed | Fully processed, searchable |
| Completed (partial) | Truncated due to plan limits |
| Failed | Error occurred (retried up to 3 times) |
If processing fails, Thallus retries automatically up to 3 times. Persistent failures typically indicate a corrupted or password-protected file.
Cache refresh
When processing completes, conversation caches refresh automatically. New documents become searchable in all active conversations — there's no need to start a new conversation to access recently processed files.
Related pages
- Uploading Documents — File types, size limits, and upload flow
- Document Search Pipeline — How agents find and rank content from processed documents
- Document Collections — Organizing documents and controlling research scope