Core Concept

Libraries

Organize your data into isolated collections with independent processing settings and access control

What are Libraries?

A Library in George AI is an isolated workspace that contains files, crawlers, and processing configurations. Each Library has its own:

Files and documents (from crawlers or manual uploads)
Embedding model configuration for AI-powered search
File processing settings (text extraction, OCR)
Access control (owner and participants)
API keys for programmatic access

Creating a Library

Navigate to Libraries
Create New Library
Configure Settings
Add Crawlers or Upload Files

Minimal Setup Required

You only need to provide a Library name to get started. All other settings have sensible defaults and can be configured later.

Library Settings

Basic Information

Name	Required A descriptive name for your Library (e.g., "Pharmaceutical Packaging", "Customer Support Emails")
Description	Optional Detailed description of what this Library contains and its purpose

Library Processing Options

Embedding Model

AI Feature

Which AI model to use for creating vector embeddings of your documents. This enables semantic search.

Available models depend on your AI service configuration

Embedding Timeout

Default: 180000ms (3 min)

Maximum time to wait for embedding generation per document before timing out

Auto-process Hash-skipped Files

Advanced

Automatically create processing tasks for files that were skipped during crawling because their content hash hasn't changed

When enabled: Files with unchanged content will still be reprocessed if they have no successful or pending processing task

Note: Uploaded files and new/updated crawled files always get processed automatically regardless of this setting

File Processing Options

Per-Library Configuration

Each Library can have different processing settings. For example, one Library might use OCR for scanned documents while another uses only text extraction.

Option

Description

Enable Text Extraction

Extract text directly from PDFs and documents that contain selectable text

Fast and accurate for digital documents

Enable Image OCR Processing

Use AI vision models to extract text from images and scanned documents

Enables processing of screenshots, photos, and scanned PDFs

OCR Settings
When Image OCR is enabled

OCR Prompt	Instructions for the AI vision model on how to extract content from images `Default: "Please give me the content of this image as` `markdown structured as follows:` `Short summary what you see in the image` `List all visual blocks with a headline and its content` `Return plain and well structured Markdown."`
OCR Model	AI vision model to use for OCR Default: `qwen2.5vl:latest`
OCR Timeout	Maximum time in seconds to wait for OCR processing per image Default: 120 seconds
OCR Image Scale	Scale factor for images before OCR processing (higher = more detail, slower) Default: 1.5
Max Consecutive Repeats	Stop OCR if the same line repeats this many times (prevents hallucination) Default: 5

Access Control

Owner

The user who created the Library. Owners have full control over Library settings and can delete it.

Participants

Users who have access to view and use files in this Library

Participants can search files, use assistants, but cannot modify Library settings

API Keys

For Developers

Generate API keys to access this Library programmatically via GraphQL or REST APIs

Keep API Keys Secret

API keys provide full access to this Library. Store them securely and never commit them to version control.

Use API keys in n8n workflows, custom scripts, or integrations to upload files, trigger processing, or query data.

Real-World Library Configurations

Pharmaceutical Packaging PDFs

Files: 30,000+ packaging specification PDFs

Text Extraction: ✓ Enabled

Image OCR: ✓ Enabled (for diagrams and technical drawings)

Embedding Model: nomic-embed-text

Why: PDFs contain mix of text and visual specifications that need both extraction methods

Email Project Tracking

Files: Emails from shared mailboxes

Text Extraction: ✓ Enabled

Image OCR: ✗ Disabled (plain text emails)

Auto-process Hash-skipped: ✓ Enabled (re-analyze if metadata missing)

Why: Emails are text-only, no need for OCR. Auto-processing ensures enrichment runs.

Historical Scanned Documents

Files: Scanned archives (images only)

Text Extraction: ✗ Disabled (no selectable text)

Image OCR: ✓ Enabled

OCR Timeout: 180s (complex documents)

Why: Old scanned documents require only OCR, no need for text extraction

Hospital Intranet Knowledge Base

Files: Crawled HTML pages from intranet

Text Extraction: ✓ Enabled

Image OCR: ✓ Enabled (for embedded images/charts)

Embedding Timeout: 300000ms (large documents)

Why: Web pages have text and images that both need processing

Best Practices

Separate by Processing Needs

Create different Libraries for different document types. For example, keep scanned PDFs (needing OCR) separate from digital documents (text extraction only) to optimize processing.

Start with Defaults

The default processing settings work well for most use cases. Only adjust OCR settings if you have specific requirements or quality issues.

Monitor Processing Performance

If processing is slow or timing out, check the Processing Queue to identify bottlenecks. You may need to adjust timeouts or disable unnecessary features.

Next Steps

Now that you understand Library configuration, learn how to populate them with data:

Set Up Crawlers → Manage Files Learn About Processing

Libraries

What are Libraries?

Creating a Library

Minimal Setup Required

Library Settings

Per-Library Configuration

OCR Settings When Image OCR is enabled

Keep API Keys Secret

Real-World Library Configurations

Pharmaceutical Packaging PDFs

Email Project Tracking

Historical Scanned Documents

Hospital Intranet Knowledge Base

Best Practices

Separate by Processing Needs

Start with Defaults

Monitor Processing Performance

Next Steps

OCR Settings
When Image OCR is enabled