Files

Understanding the file lifecycle from upload to searchable content

What are Files?

Files are the core content in George AI. Each file belongs to a Library and goes through automated processing to extract text, generate embeddings, and make content searchable.

Files can be added manually via upload or automatically through Crawlers that collect documents from external sources like SharePoint, file shares, or email.

Manual Upload

Upload files directly through the web interface into a Library

Automated Crawling

Configure Crawlers to automatically collect files from external systems

File Lifecycle

Every file in George AI goes through a processing pipeline to make it searchable and usable for AI assistants:

  • Upload or Crawl

    File is added to the Library (manually uploaded or collected by a Crawler)



  • Validation

    File format and integrity are checked



  • Extraction

    Text and images are extracted from the document (supports PDF, Office docs, images with OCR, etc.)



  • Embedding

    Text is split into chunks and converted to vector embeddings for semantic search



  • Completed

    File is now searchable and available for AI assistants

Processing Can Fail

Files can fail at Validation (unsupported format), Extraction (corrupted file), or Embedding (timeout). You can retry processing via the file menu.

File Processing Status

Files have three status indicators that track their progress:

Status Type Values Description
Processing Status
none
pending
validating
extracting
embedding
completed
failed
Overall processing state through the entire pipeline
Extraction Status
none
pending
running
completed
failed
Text and image extraction stage
Embedding Status
none
pending
running
completed
failed
Vector embedding generation stage

Status Badges in the UI

Extraction 2025-01-15
Embedding 2025-01-15
Unsupported Format
Legacy File

These badges appear in the file list and indicate processing completion times or errors.

File Metadata

Each file stores metadata that can be used for filtering, sorting, and enrichment:

Property Description Source
name File name with extension From upload or crawler
mimeType File type (e.g., application/pdf, image/png) Detected automatically
size File size in bytes Actual file size
originUri Original location (file path, SharePoint URL, etc.) From upload or crawler
originModificationDate When the file was last modified at its source From file system or crawler
uploadedAt When the file was added to George AI Set at creation time
createdAt When the file record was created in the database Set at creation time
archivedAt When the file was archived (if applicable) Set when file is archived
taskCount Number of processing tasks associated with this file Counted from processing queue
chunksCount Number of vector embedding chunks generated From embedding process

Using Metadata in Lists

You can create List fields with sourceType: file_property to display file metadata (name, size, modified date, source) without AI processing.

File Actions

You can perform several actions on files through the file menu:

Reprocess (Re-extract)

Triggers a new extraction task to re-extract text and images from the file

Use when:

  • Extraction failed or timed out
  • Library extraction settings changed (e.g., updated OCR prompt)
  • File content was updated at the source

Re-embed

Triggers a new embedding task to regenerate vector embeddings

Use when:

  • Embedding failed or timed out
  • Library embedding model changed
  • Extraction was re-run with new content

View Info

Shows detailed file metadata and processing information

Displays:

  • File size and format
  • Processing status
  • Number of chunks generated
  • Number of processing tasks
  • Crawler source (if applicable)
  • Origin modification date

View Extraction

Shows the extracted markdown content from the file

Use for:

  • Verifying extraction quality
  • Debugging enrichment issues
  • Understanding what content AI assistants see

Supported File Types

George AI supports a wide range of file formats for automatic text extraction:

Documents

  • • PDF (.pdf)
  • • Word (.docx, .doc)
  • • PowerPoint (.pptx, .ppt)
  • • Excel (.xlsx, .xls)
  • • Text (.txt, .md, .csv)
  • • HTML (.html, .htm)

Images (with OCR)

  • • JPEG (.jpg, .jpeg)
  • • PNG (.png)
  • • TIFF (.tiff, .tif)
  • • BMP (.bmp)
  • • GIF (.gif)

Videos

  • • MP4 (.mp4)
  • • WebM (.webm)
  • • AVI (.avi)
  • • MOV (.mov)
  • • MKV (.mkv)

Audio transcription and visual content extraction

Archives

  • • ZIP (.zip)
  • • 7-Zip (.7z)
  • • TAR (.tar, .tar.gz)

Archives are extracted and files inside are processed individually

Unsupported Formats

If a file format is not supported, it will be marked with "Unsupported Format" badge and no extraction will be performed. The file metadata is still stored and searchable.

Need Additional Format Support?

We can add support for any file format for paying customers. Alternatively, you can build automation workflows to transform files with your own automation and ingest them into George AI. Contact us to discuss your specific requirements.

Related Topics

Learn more about how files are processed and used:

George-Cloud