Embeddings & Vector Search

Transform documents into searchable vector embeddings for semantic similarity and AI-powered search

What are Embeddings?

Embeddings are numerical vector representations of text that capture semantic meaning. Unlike traditional keyword search, embeddings enable similarity-based search where documents with similar meanings are found even if they use different words.

Semantic Search

Find documents by meaning, not just keywords

Distance Metrics

Measure semantic similarity with vector distance

Fast Retrieval

Lightning-fast vector search powered by Typesense

The Embedding Pipeline

George AI converts your documents into searchable vectors through a multi-stage pipeline:

1

Document Processing

Documents are converted to a common intermediate format

Common Markdown Format

All documents (PDFs, Word, Excel, images, HTML) are extracted to Markdown (.md) format. This standardized format ensures consistent processing regardless of the original file type.

Supported Input Formats:

PDF
Word (.docx)
Excel (.xlsx)
PowerPoint (.pptx)
Images (OCR)
HTML
Markdown
Plain Text
2

Text Chunking

Markdown content is split into manageable chunks for embedding

Chunking Strategy:

  • • Preserves document structure (headings, sections)
  • • Creates overlapping chunks for context
  • • Maintains semantic coherence
  • • Tracks heading paths for navigation

Chunk Metadata:

  • chunkIndex - Position in file
  • subChunkIndex - Sub-section index
  • headingPath - Hierarchical location
  • section - Content text
3

Vector Embedding

Each chunk is converted to a high-dimensional vector using an embedding model

Chunk Text: "George AI enables semantic search..."
                    ↓
Embedding Model (configured per Library)
                    ↓
Vector: [0.123, -0.456, 0.789, ..., 0.234]
         (typically 384-1536 dimensions)

Library-Specific Configuration

Each Library has its own embeddingModel setting. All files in that Library use the same model for consistent vector space.

4

Typesense Storage

Vectors are indexed in Typesense for fast similarity search

Storage Structure

  • • One collection per Library
  • • Documents grouped by File ID
  • • Each chunk stored with metadata
  • • Vectors indexed for fast retrieval

Indexed Fields

  • libraryId - Library identifier
  • fileId - Source file ID
  • fileName - Original filename
  • originUri - Source location

Similarity Search & Distance Metrics

One of the most powerful features of embeddings is the ability to find similar content using distance metrics:

How Distance Works

Vectors are points in high-dimensional space. The distance between two vectors indicates how similar their semantic meanings are:

Distance Interpretation

0.0 - 0.3
Very Similar
0.3 - 0.6
Moderately Similar
0.6 - 0.9
Somewhat Related
0.9 - 1.0+
Not Related

Finding Distance Gaps

Large distance gaps can reveal topic boundaries or missing information:

Chunk 1: distance 0.15 ← very similar
Chunk 2: distance 0.18 ← very similar
Chunk 3: distance 0.21 ← very similar
Chunk 4: distance 0.87 ← GAP! Topic change
Chunk 5: distance 0.91 ← new topic

Similarity Search Query

Find chunks similar to a specific file or search term:

query {
  aiSimilarFileChunks(
    fileId: "file-id-here"
    term: "optional search term"
    hits: 20
  ) {
    id
    text
    distance        # ← Similarity score
    chunkIndex
    headingPath
    fileName
    fileId
  }
}

Use Case: Content Recommendations

Use distance values to recommend related documents or identify content clusters. Large gaps help detect topic changes or missing documentation.

Browsing File Chunks

You can retrieve and examine all chunks for a specific file:

Query File Chunks

query {
  aiFileChunks(
    fileId: "file-id-here"
    skip: 0
    take: 20
  ) {
    fileId
    fileName
    libraryId
    count          # Total chunks
    chunks {
      id
      chunkIndex
      subChunkIndex
      section      # Chunk text content
      headingPath  # Document structure
      text         # Full chunk text
    }
  }
}

Understanding Chunk Structure

Field Description Example
chunkIndex Sequential chunk number 0, 1, 2, ...
subChunkIndex Sub-section within chunk 0, 1, 2, ...
headingPath Hierarchical document path "Introduction > Setup"
section Main content text "George AI provides..."
distance Similarity score (in search) 0.234

Library Embedding Configuration

Each Library has its own embedding configuration that applies to all files within it:

Key Settings

Embedding Model

libraryEmbeddingModel

The AI model used to generate vector embeddings. All files in the Library share the same model for consistent vector space.

Vector Store

useVectorStore

Enable/disable vector storage in Typesense. When enabled, chunks are embedded and indexed for similarity search.

Important: Model Consistency

Changing the embedding model after files are processed requires re-embedding all files. Different models produce incompatible vector spaces.

Practical Use Cases

Semantic Search

Users can search by concept instead of exact keywords. "How do I reset my password?" finds relevant docs even if they use "credential recovery".

AI Question Answering

Retrieve the most relevant chunks for a user's question, providing context to LLMs for accurate, grounded answers.

Content Recommendations

Suggest related documents based on similarity scores. "Users viewing this document also found these helpful."

Gap Analysis

Use distance gaps to identify missing documentation, detect topic boundaries, or find duplicate content.

Monitoring Embedding Status

Track the embedding status of files in your library:

Embedding Status Values

Status Description Next Step
none
File has not been processed yet Trigger processing task
pending
Queued for embedding Wait for worker to process
running
Currently being embedded Processing in progress
completed
Successfully embedded and indexed Ready for search
failed
Embedding failed (timeout, error) Retry or check logs
skipped
Intentionally skipped (vector store disabled) Enable vector store if needed
query {
  aiLibraryFile(fileId: "file-id") {
    id
    title
    embeddingStatus
    lastEmbedding {
      embeddingStatus
      embeddingModelName
      chunksCount
      embeddingTimeMs
    }
  }
}

Next Steps

George-Cloud