How Documents Are Processed

After upload, Thallus processes your document in the background — extracting content, splitting it into searchable chunks, generating vector embeddings, and creating a structured synopsis. No action is needed from you; documents become searchable automatically once processing completes.

Processing pipeline

Upload
Extract content
Chunk
Embed
Synopsis
Searchable
  1. Upload — File saved and validation passed (see Uploading Documents)
  2. Extract content — Text and structure pulled from the file based on its type
  3. Chunk — Content split into overlapping passages optimized for search
  4. Embed — Each chunk converted to a vector embedding for semantic search
  5. Synopsis — An LLM generates structured metadata about the document
  6. Searchable — Document appears in search results across all active conversations

Content extraction by file type

PDF

Pages are extracted individually. Tables within pages are detected and rendered as markdown. Long pages are split at paragraph boundaries to keep chunks at a searchable size.

Word documents

Structure-aware extraction splits at heading boundaries with a target chunk size optimized for search. Tables are converted to markdown and included in the surrounding chunk. This preserves the document's logical structure rather than splitting at arbitrary character counts.

Text and Markdown

Plain text files are split at paragraph boundaries with an optimized target size and overlap between consecutive chunks. Markdown files preserve heading structure, using headers as natural split points.

Spreadsheets (CSV, XLSX, XLS)

Spreadsheets are processed using intelligent strategies that adapt to your data structure, producing searchable chunks. The system analyzes your table's shape, columns, and content to choose the best approach automatically.

All spreadsheets also get automatic schema, statistics, and sample data chunks so agents understand the structure of your data even before searching specific rows.


Embedding

Each chunk is converted to a high-dimensional vector that captures its semantic meaning. Embeddings are processed in batches for efficiency.

The embedding model can be configured per organization if you're using Bring Your Own Key.


Synopsis generation

After chunking, an LLM reads the document and produces structured metadata that powers document discovery:

Generated Synopsis
Document type report
Entities Acme Corp, Q4 2025, West Region
Topics sales performance, quota attainment
Scope Sales department

The synopsis captures key metadata about your document — its type, summary, topics, and entities mentioned — to power fast discovery. This metadata is also indexed for keyword search alongside vector similarity.


Processing status

You can track each document's progress from the Documents page:

Status Meaning
Pending Queued, waiting to start
Processing Extracting, chunking, embedding
Completed Fully processed, searchable
Completed (partial) Truncated due to plan limits
Failed Error occurred (retried up to 3 times)
quarterly-report.pdf
Pending
employee-handbook.docx
Processing
sales-data.xlsx
Completed
corrupted-file.pdf
Failed

If processing fails, Thallus retries automatically up to 3 times. Persistent failures typically indicate a corrupted or password-protected file.


Cache refresh

When processing completes, conversation caches refresh automatically. New documents become searchable in all active conversations — there's no need to start a new conversation to access recently processed files.