Document Processing
How George AI converts documents into searchable, AI-ready content
What is Document Processing?
Document processing is the automated pipeline that transforms raw files (PDFs, Word docs, images) into text content and vector embeddings that can be searched semantically and used by AI assistants.
Every file uploaded or crawled into a Library goes through this processing pipeline automatically, managed by a background queue system.
Extraction
Converts documents to markdown text using format-specific parsers and OCR for images
Embedding
Splits text into chunks and generates vector embeddings for semantic search
Processing Pipeline
When a file is uploaded or crawled, a processing task is automatically created and queued:
Task Creation
A content processing task is created and added to the queue with status pending
Status: pending Processing started: (waiting) Extraction Phase
Text and images are extracted from the document based on its format
Extraction Methods:
- • PDF: Text extraction + OCR for images
- • Office (Word, Excel, PPT): Native parsers
- • Images: Vision model OCR
- • Archives: Extract and process contents
- • HTML/Markdown: Direct conversion
Configuration Options:
- • Enable/disable text extraction
- • Enable/disable image processing
- • OCR model selection
- • OCR prompt customization
- • OCR image scale
- • OCR timeout
Settings configured at Library level
Status: extracting Extraction started: 2025-01-15 10:30:15 Output: document.md (extracted markdown) Embedding Phase
Extracted text is chunked and converted to vector embeddings for semantic search
| Step | Description |
|---|---|
| 1. Chunking | Text is split into smaller chunks (paragraphs or sections) for efficient processing |
| 2. Embedding | Each chunk is converted to a vector embedding using the configured AI model |
| 3. Storage | Embeddings are stored in Typesense vector database for fast semantic search |
Status: embedding Embedding started: 2025-01-15 10:30:45 Chunks created: 127 Embedding model: nomic-embed-text Embedding Configuration (Library-level):
- • Embedding Model: Which AI model to use (e.g., nomic-embed-text, mxbai-embed-large)
- • Embedding Timeout: Maximum time allowed for embedding generation
Processing Complete
Task is marked as completed. File is now searchable and available for AI assistants.
Status: completed Processing finished: 2025-01-15 10:31:02 Total processing time: 47,000 ms File is now searchable! Processing Queue System
George AI uses a background queue system to manage processing tasks efficiently:
Content Processing Queue
Handles text extraction and embedding generation for all files
Enrichment Queue
Handles AI-powered data extraction for List enrichment fields
Queue Worker Behavior
Worker continuously picks up pending tasks and processes them
Worker is paused. No new tasks are processed, but tasks already in progress continue
Automatic Processing
Files are processed automatically when added to a Library. This behavior can be configured in Library settings using the "Auto-process crawled files" option.
Processing Task States
| State | Description | Next Step |
|---|---|---|
none | No processing has been initiated | Wait for task creation or trigger manually |
pending | Task is queued and waiting for a worker | Worker will pick it up automatically |
validating | File format and integrity are being checked | Moves to extracting or validationFailed |
extracting | Text and images are being extracted | Moves to embedding or extractionFailed |
embedding | Vector embeddings are being generated | Moves to completed or embeddingFailed |
completed | Processing finished successfully | File is searchable and ready |
failed | Processing failed at some stage | Retry via file menu or queue management |
timedOut | Processing exceeded configured timeout | Retry with adjusted timeout or check file |
cancelled | Task was manually cancelled | Create new task if needed |
Monitoring Processing
You can monitor and manage processing tasks through the Admin Panel:
Processing Queue Dashboard
Admin Panel → Processing Queue
View Task Statistics:
- • Pending tasks count
- • Currently processing tasks
- • Failed tasks
- • Completed tasks
- • Last processed timestamp
Management Actions:
- • Start/Stop queue workers
- • Retry failed tasks
- • Clear failed tasks
- • Clear pending tasks
- • Cancel specific tasks
Troubleshooting Processing Issues
Possible Causes:
- Queue worker is stopped
- Too many tasks overwhelming the queue
- System resources exhausted
Solutions:
- Check Admin Panel → Processing Queue to verify worker is running
- Start the worker if stopped
- Monitor system CPU/memory usage
- Consider adding more AI Service servers for parallel processing
Possible Causes:
- File is corrupted or unsupported format
- File is extremely large (100+ pages)
- OCR timeout too low for complex images
- AI model not available
Solutions:
- Verify file can be opened in its native application
- Increase extraction timeout in Library settings
- Increase OCR timeout for image-heavy documents
- Check AI Services status
- Try re-processing the file via file menu
Possible Causes:
- Embedding model not loaded on AI Services
- Document extracted to extremely large text
- Embedding timeout too low
- Typesense vector database connectivity issues
Solutions:
- Verify embedding model is available (check Library settings)
- Increase embedding timeout in Library settings
- Check Typesense service status
- Verify AI Services can connect to Typesense
- Retry embedding via file menu
Possible Causes:
- Low-resolution images
- Poor scan quality
- OCR prompt not optimized for document type
- OCR image scale too low
Solutions:
- Increase OCR image scale in Library settings (try 1.5 or 2.0)
- Customize OCR prompt to describe document structure
- Use higher-resolution source images if possible
- Try a different OCR model (e.g., qwen2.5vl:latest)
- Re-process file after adjusting settings