How Documents Are Processed

After upload, Thallus processes your document in the background — extracting content, splitting it into searchable chunks, generating vector embeddings, and creating a structured synopsis. No action is needed from you; documents become searchable automatically once processing completes.

Processing pipeline

Upload

→

Extract content

→

Chunk

→

Embed

→

Synopsis

→

Searchable

Upload — File saved and validation passed (see Uploading Documents)
Extract content — Text and structure pulled from the file based on its type
Chunk — Content split into overlapping passages optimized for search
Embed — Each chunk converted to a vector embedding for semantic search
Synopsis — An LLM generates structured metadata about the document
Searchable — Document appears in search results across all active conversations

Content extraction by file type

PDF

Pages are extracted individually. Tables within pages are detected and rendered as markdown. Long pages are split at paragraph boundaries to keep chunks at a searchable size.

Word documents

Structure-aware extraction splits at heading boundaries with a target chunk size optimized for search. Tables are converted to markdown and included in the surrounding chunk. This preserves the document's logical structure rather than splitting at arbitrary character counts.

Text and Markdown

Plain text files are split at paragraph boundaries with an optimized target size and overlap between consecutive chunks. Markdown files preserve heading structure, using headers as natural split points.

Spreadsheets (CSV, XLSX, XLS)

Spreadsheets are processed using intelligent strategies that adapt to your data structure, producing searchable chunks. The system analyzes your table's shape, columns, and content to choose the best approach automatically.

All spreadsheets also get automatic schema, statistics, and sample data chunks so agents understand the structure of your data even before searching specific rows.

Embedding

Each chunk is converted to a high-dimensional vector that captures its semantic meaning. Embeddings are processed in batches for efficiency.

The embedding model can be configured per organization if you're using Bring Your Own Key.

Synopsis generation

After chunking, an LLM reads the document and produces structured metadata that powers document discovery:

Generated Synopsis

Document type report

Entities Acme Corp, Q4 2025, West Region

Topics sales performance, quota attainment

Scope Sales department

The synopsis captures key metadata about your document — its type, summary, topics, and entities mentioned — to power fast discovery. This metadata is also indexed for keyword search alongside vector similarity.

Processing status

You can track each document's progress from the Documents page:

Status	Meaning
Pending	Queued, waiting to start
Processing	Extracting, chunking, embedding
Completed	Fully processed, searchable
Completed (partial)	Truncated due to plan limits
Failed	Error occurred (retried up to 3 times)

◯

quarterly-report.pdf

Pending

↻

employee-handbook.docx

Processing

✓

sales-data.xlsx

Completed

⚠

corrupted-file.pdf

Failed

If processing fails, Thallus retries automatically up to 3 times. Persistent failures typically indicate a corrupted or password-protected file.

Cache refresh

When processing completes, conversation caches refresh automatically. New documents become searchable in all active conversations — there's no need to start a new conversation to access recently processed files.

Uploading Documents — File types, size limits, and upload flow
Document Search Pipeline — How agents find and rank content from processed documents
Document Collections — Organizing documents and controlling research scope