Libraries
Organize your data into isolated collections with independent processing settings and access control
What are Libraries?
A Library in George AI is an isolated workspace that contains files, crawlers, and processing configurations. Each Library has its own:
- Files and documents (from crawlers or manual uploads)
- Embedding model configuration for AI-powered search
- File processing settings (text extraction, OCR)
- Access control (owner and participants)
- API keys for programmatic access
Creating a Library
- Navigate to Libraries
- Create New Library
- Configure Settings
- Add Crawlers or Upload Files
Minimal Setup Required
You only need to provide a Library name to get started. All other settings have sensible defaults and can be configured later.
Library Settings
| Name | Required A descriptive name for your Library (e.g., "Pharmaceutical Packaging", "Customer Support Emails") |
|---|---|
| Description | Optional Detailed description of what this Library contains and its purpose |
| Embedding Model | AI Feature Which AI model to use for creating vector embeddings of your documents. This enables semantic search. Available models depend on your AI service configuration |
|---|---|
| Embedding Timeout | Default: 180000ms (3 min) Maximum time to wait for embedding generation per document before timing out |
| Auto-process Hash-skipped Files | Advanced Automatically create processing tasks for files that were skipped during crawling because their content hash hasn't changed When enabled: Files with unchanged content will still be reprocessed if they have no successful or pending processing task Note: Uploaded files and new/updated crawled files always get processed automatically regardless of this setting |
Per-Library Configuration
Each Library can have different processing settings. For example, one Library might use OCR for scanned documents while another uses only text extraction.
| Option | Description |
|---|---|
| Enable Text Extraction | Extract text directly from PDFs and documents that contain selectable text Fast and accurate for digital documents |
| Enable Image OCR Processing | Use AI vision models to extract text from images and scanned documents Enables processing of screenshots, photos, and scanned PDFs |
OCR Settings When Image OCR is enabled
| OCR Prompt | Instructions for the AI vision model on how to extract content from images |
|---|---|
| OCR Model | AI vision model to use for OCR Default: |
| OCR Timeout | Maximum time in seconds to wait for OCR processing per image Default: 120 seconds |
| OCR Image Scale | Scale factor for images before OCR processing (higher = more detail, slower) Default: 1.5 |
| Max Consecutive Repeats | Stop OCR if the same line repeats this many times (prevents hallucination) Default: 5 |
| Owner | The user who created the Library. Owners have full control over Library settings and can delete it. |
|---|---|
| Participants | Users who have access to view and use files in this Library Participants can search files, use assistants, but cannot modify Library settings |
Generate API keys to access this Library programmatically via GraphQL or REST APIs
Keep API Keys Secret
API keys provide full access to this Library. Store them securely and never commit them to version control.
Use API keys in n8n workflows, custom scripts, or integrations to upload files, trigger processing, or query data.
Real-World Library Configurations
Pharmaceutical Packaging PDFs
Files: 30,000+ packaging specification PDFs
Text Extraction: ✓ Enabled
Image OCR: ✓ Enabled (for diagrams and technical drawings)
Embedding Model: nomic-embed-text
Why: PDFs contain mix of text and visual specifications that need both extraction methods
Email Project Tracking
Files: Emails from shared mailboxes
Text Extraction: ✓ Enabled
Image OCR: ✗ Disabled (plain text emails)
Auto-process Hash-skipped: ✓ Enabled (re-analyze if metadata missing)
Why: Emails are text-only, no need for OCR. Auto-processing ensures enrichment runs.
Historical Scanned Documents
Files: Scanned archives (images only)
Text Extraction: ✗ Disabled (no selectable text)
Image OCR: ✓ Enabled
OCR Timeout: 180s (complex documents)
Why: Old scanned documents require only OCR, no need for text extraction
Hospital Intranet Knowledge Base
Files: Crawled HTML pages from intranet
Text Extraction: ✓ Enabled
Image OCR: ✓ Enabled (for embedded images/charts)
Embedding Timeout: 300000ms (large documents)
Why: Web pages have text and images that both need processing
Best Practices
Separate by Processing Needs
Create different Libraries for different document types. For example, keep scanned PDFs (needing OCR) separate from digital documents (text extraction only) to optimize processing.
Start with Defaults
The default processing settings work well for most use cases. Only adjust OCR settings if you have specific requirements or quality issues.
Monitor Processing Performance
If processing is slow or timing out, check the Processing Queue to identify bottlenecks. You may need to adjust timeouts or disable unnecessary features.
Next Steps
Now that you understand Library configuration, learn how to populate them with data: