Document Processing

How George AI converts documents into searchable, AI-ready content

What is Document Processing?

Document processing is the automated pipeline that transforms raw files (PDFs, Word docs, images) into text content and vector embeddings that can be searched semantically and used by AI assistants.

Every file uploaded or crawled into a Library goes through this processing pipeline automatically, managed by a background queue system.

Extraction

Converts documents to markdown text using format-specific parsers and OCR for images

Embedding

Splits text into chunks and generates vector embeddings for semantic search

Processing Pipeline

When a file is uploaded or crawled, a processing task is automatically created and queued:

1

Task Creation

A content processing task is created and added to the queue with status pending

Status: pending
Processing started: (waiting)
2

Extraction Phase

Text and images are extracted from the document based on its format

Extraction Methods:

  • PDF: Text extraction + OCR for images
  • Office (Word, Excel, PPT): Native parsers
  • Images: Vision model OCR
  • Archives: Extract and process contents
  • HTML/Markdown: Direct conversion

Configuration Options:

  • • Enable/disable text extraction
  • • Enable/disable image processing
  • • OCR model selection
  • • OCR prompt customization
  • • OCR image scale
  • • OCR timeout

Settings configured at Library level

Status: extracting
Extraction started: 2025-01-15 10:30:15
Output: document.md (extracted markdown)
Extraction can timeout if the file is very large or complex. Default timeout is configurable per Library.
3

Embedding Phase

Extracted text is chunked and converted to vector embeddings for semantic search

Step Description
1. Chunking Text is split into smaller chunks (paragraphs or sections) for efficient processing
2. Embedding Each chunk is converted to a vector embedding using the configured AI model
3. Storage Embeddings are stored in Typesense vector database for fast semantic search
Status: embedding
Embedding started: 2025-01-15 10:30:45
Chunks created: 127
Embedding model: nomic-embed-text

Embedding Configuration (Library-level):

  • Embedding Model: Which AI model to use (e.g., nomic-embed-text, mxbai-embed-large)
  • Embedding Timeout: Maximum time allowed for embedding generation
4

Processing Complete

Task is marked as completed. File is now searchable and available for AI assistants.

Status: completed
Processing finished: 2025-01-15 10:31:02
Total processing time: 47,000 ms
File is now searchable!

Processing Queue System

George AI uses a background queue system to manage processing tasks efficiently:

Content Processing Queue

Handles text extraction and embedding generation for all files

Enrichment Queue

Handles AI-powered data extraction for List enrichment fields

Queue Worker Behavior

Running

Worker continuously picks up pending tasks and processes them

Stopped

Worker is paused. No new tasks are processed, but tasks already in progress continue

Automatic Processing

Files are processed automatically when added to a Library. This behavior can be configured in Library settings using the "Auto-process crawled files" option.

Processing Task States

State Description Next Step
none
No processing has been initiated Wait for task creation or trigger manually
pending
Task is queued and waiting for a worker Worker will pick it up automatically
validating
File format and integrity are being checked Moves to extracting or validationFailed
extracting
Text and images are being extracted Moves to embedding or extractionFailed
embedding
Vector embeddings are being generated Moves to completed or embeddingFailed
completed
Processing finished successfully File is searchable and ready
failed
Processing failed at some stage Retry via file menu or queue management
timedOut
Processing exceeded configured timeout Retry with adjusted timeout or check file
cancelled
Task was manually cancelled Create new task if needed

Monitoring Processing

You can monitor and manage processing tasks through the Admin Panel:

Processing Queue Dashboard

Admin Panel → Processing Queue

View Task Statistics:

  • • Pending tasks count
  • • Currently processing tasks
  • • Failed tasks
  • • Completed tasks
  • • Last processed timestamp

Management Actions:

  • • Start/Stop queue workers
  • • Retry failed tasks
  • • Clear failed tasks
  • • Clear pending tasks
  • • Cancel specific tasks

Troubleshooting Processing Issues

Files stuck in "pending" state

Possible Causes:

  • Queue worker is stopped
  • Too many tasks overwhelming the queue
  • System resources exhausted

Solutions:

  • Check Admin Panel → Processing Queue to verify worker is running
  • Start the worker if stopped
  • Monitor system CPU/memory usage
  • Consider adding more AI Service servers for parallel processing
Extraction fails or times out

Possible Causes:

  • File is corrupted or unsupported format
  • File is extremely large (100+ pages)
  • OCR timeout too low for complex images
  • AI model not available

Solutions:

  • Verify file can be opened in its native application
  • Increase extraction timeout in Library settings
  • Increase OCR timeout for image-heavy documents
  • Check AI Services status
  • Try re-processing the file via file menu
Embedding fails or times out

Possible Causes:

  • Embedding model not loaded on AI Services
  • Document extracted to extremely large text
  • Embedding timeout too low
  • Typesense vector database connectivity issues

Solutions:

  • Verify embedding model is available (check Library settings)
  • Increase embedding timeout in Library settings
  • Check Typesense service status
  • Verify AI Services can connect to Typesense
  • Retry embedding via file menu
Poor OCR quality for images/scans

Possible Causes:

  • Low-resolution images
  • Poor scan quality
  • OCR prompt not optimized for document type
  • OCR image scale too low

Solutions:

  • Increase OCR image scale in Library settings (try 1.5 or 2.0)
  • Customize OCR prompt to describe document structure
  • Use higher-resolution source images if possible
  • Try a different OCR model (e.g., qwen2.5vl:latest)
  • Re-process file after adjusting settings

Related Topics

Learn more about related features:

George-Cloud