Core Concept

Libraries

Organize your data into isolated collections with independent processing settings and access control

What are Libraries?

A Library in George AI is an isolated workspace that contains files, crawlers, and processing configurations. Each Library has its own:

  • Files and documents (from crawlers or manual uploads)
  • Embedding model configuration for AI-powered search
  • File processing settings (text extraction, OCR)
  • Access control (owner and participants)
  • API keys for programmatic access

Creating a Library

  • Navigate to Libraries
  • Create New Library
  • Configure Settings
  • Add Crawlers or Upload Files

Minimal Setup Required

You only need to provide a Library name to get started. All other settings have sensible defaults and can be configured later.

Library Settings

Basic Information
Name
Required

A descriptive name for your Library (e.g., "Pharmaceutical Packaging", "Customer Support Emails")

Description
Optional

Detailed description of what this Library contains and its purpose

Library Processing Options
Embedding Model
AI Feature

Which AI model to use for creating vector embeddings of your documents. This enables semantic search.

Available models depend on your AI service configuration

Embedding Timeout
Default: 180000ms (3 min)

Maximum time to wait for embedding generation per document before timing out

Auto-process Hash-skipped Files
Advanced

Automatically create processing tasks for files that were skipped during crawling because their content hash hasn't changed

When enabled: Files with unchanged content will still be reprocessed if they have no successful or pending processing task

Note: Uploaded files and new/updated crawled files always get processed automatically regardless of this setting

File Processing Options

Per-Library Configuration

Each Library can have different processing settings. For example, one Library might use OCR for scanned documents while another uses only text extraction.

Option Description
Enable Text Extraction

Extract text directly from PDFs and documents that contain selectable text

Fast and accurate for digital documents

Enable Image OCR Processing

Use AI vision models to extract text from images and scanned documents

Enables processing of screenshots, photos, and scanned PDFs

OCR Settings
When Image OCR is enabled

OCR Prompt

Instructions for the AI vision model on how to extract content from images

Default: "Please give me the content of this image as
markdown structured as follows:
Short summary what you see in the image
List all visual blocks with a headline and its content
Return plain and well structured Markdown."
OCR Model

AI vision model to use for OCR

Default: qwen2.5vl:latest

OCR Timeout

Maximum time in seconds to wait for OCR processing per image

Default: 120 seconds

OCR Image Scale

Scale factor for images before OCR processing (higher = more detail, slower)

Default: 1.5

Max Consecutive Repeats

Stop OCR if the same line repeats this many times (prevents hallucination)

Default: 5

Access Control
Owner

The user who created the Library. Owners have full control over Library settings and can delete it.

Participants

Users who have access to view and use files in this Library

Participants can search files, use assistants, but cannot modify Library settings

API Keys
For Developers

Generate API keys to access this Library programmatically via GraphQL or REST APIs

Keep API Keys Secret

API keys provide full access to this Library. Store them securely and never commit them to version control.

Use API keys in n8n workflows, custom scripts, or integrations to upload files, trigger processing, or query data.

Real-World Library Configurations

Pharmaceutical Packaging PDFs

Files: 30,000+ packaging specification PDFs

Text Extraction: ✓ Enabled

Image OCR: ✓ Enabled (for diagrams and technical drawings)

Embedding Model: nomic-embed-text

Why: PDFs contain mix of text and visual specifications that need both extraction methods

Email Project Tracking

Files: Emails from shared mailboxes

Text Extraction: ✓ Enabled

Image OCR: ✗ Disabled (plain text emails)

Auto-process Hash-skipped: ✓ Enabled (re-analyze if metadata missing)

Why: Emails are text-only, no need for OCR. Auto-processing ensures enrichment runs.

Historical Scanned Documents

Files: Scanned archives (images only)

Text Extraction: ✗ Disabled (no selectable text)

Image OCR: ✓ Enabled

OCR Timeout: 180s (complex documents)

Why: Old scanned documents require only OCR, no need for text extraction

Hospital Intranet Knowledge Base

Files: Crawled HTML pages from intranet

Text Extraction: ✓ Enabled

Image OCR: ✓ Enabled (for embedded images/charts)

Embedding Timeout: 300000ms (large documents)

Why: Web pages have text and images that both need processing

Best Practices

Separate by Processing Needs

Create different Libraries for different document types. For example, keep scanned PDFs (needing OCR) separate from digital documents (text extraction only) to optimize processing.

Start with Defaults

The default processing settings work well for most use cases. Only adjust OCR settings if you have specific requirements or quality issues.

Monitor Processing Performance

If processing is slow or timing out, check the Processing Queue to identify bottlenecks. You may need to adjust timeouts or disable unnecessary features.

Next Steps

Now that you understand Library configuration, learn how to populate them with data:

George-Cloud