Crawlers

Automatically collect files from external sources into your Libraries

What are Crawlers?

Crawlers are automated collectors that continuously gather files from external sources like SharePoint, file shares, websites, and cloud storage, bringing them into your George AI Libraries.

Instead of manually uploading files, you configure a crawler once, and it automatically discovers and imports files—keeping your Library synchronized with the source.

One-Time Setup

Configure crawler once with source URL and credentials

Automatic Updates

Schedule crawlers to run daily, weekly, or manually trigger runs

Supported Sources

SharePoint Online

Crawl SharePoint document libraries and OneDrive folders

URI Format: https://yourcompany.sharepoint.com/sites/SiteName

Authentication: Browser cookies (FedAuth, rtFa)

SMB / Windows File Share

Access network file shares and Windows servers

URI Format: smb://server/share/folder

Authentication: Username + Password

HTTP/HTTPS Websites

Crawl public or internal websites and download files

URI Format: https://docs.example.com/files

Authentication: None (public sites only)

Box.com

Access Box.com enterprise cloud storage folders

URI Format: https://app.box.com/folder/123456789

Authentication: Customer ID + Developer Token

Creating a Crawler

  • Select Source Type
    SharePoint, SMB, HTTP, or Box
  • Configure URI & Limits
    URL, depth, max pages
  • Set File Filters
    Optional: patterns, size, MIME types
  • Add Credentials
    Authentication details
  • Schedule Runs
    Optional: daily/weekly automation
1
Basic Configuration
URI

Required

The source location to crawl (URL, network path, etc.)

https://company.sharepoint.com/sites/Docs
smb://fileserver/shared/documents
https://docs.example.com/files
https://app.box.com/folder/123456789
Max Depth

Required • Default: 2

How many folder levels deep to crawl (0 = only root folder)

Example: Depth 2 crawls /docs/docs/2024/docs/2024/Q1

Max Pages

Required • Default: 10

Maximum number of files to collect per crawler run

Prevents overwhelming the system with too many files at once

2
File Filters (Optional)
Include Patterns

Regex patterns to include specific files

\.pdf$, \.docx?$, \.txt$
Only collect PDF, DOC, DOCX, and TXT files
Exclude Patterns

Regex patterns to exclude files/folders

archive, _old, backup, temp, \.tmp$
Skip folders named "archive", "_old", etc.
Min File Size

Minimum file size in MB (e.g., 0.1 = 100 KB)

Use to skip tiny files that likely contain no useful content

Max File Size

Maximum file size in MB (e.g., 50 = 50 MB)

Use to skip extremely large files that may timeout during processing

Allowed MIME Types

Comma-separated list of allowed file types

application/pdf, text/plain, application/msword
3
Authentication

SharePoint Online

Requires browser authentication cookies:

  1. Open your SharePoint site in a browser and log in
  2. Open Developer Tools (F12) → Network tab
  3. Refresh the page or navigate to a document library
  4. Find any request to your SharePoint site
  5. Copy the complete "Cookie" header value (must include FedAuth and rtFa cookies)
  6. Paste into the "SharePoint Authentication Cookies" field
Cookies are session-based and expire. You may need to refresh them periodically if crawler runs start failing.

SMB / Windows File Share

Requires network credentials:

  • Username: Domain username (e.g., DOMAIN\username or username@domain.com)
  • Password: User's password

Box.com

Requires Box API credentials:

  • Customer ID: Your Box enterprise customer ID (10+ characters)
  • Developer Token: Box API developer token (20+ characters)

Contact your Box administrator to obtain these credentials.

HTTP/HTTPS Websites

No authentication required. Only public websites are supported.

4
Scheduling (Optional)

Schedule crawlers to run automatically on a recurring basis:

Cron Schedule Configuration

  • Active: Enable/disable scheduled runs
  • Time: Hour (0-23) and Minute (0-59) in 24-hour format
  • Days: Select which days of the week to run (Monday - Sunday)
Every weekday at 3:00 AM
Hour: 3, Minute: 0, Days: Mon-Fri
Sundays at 11:30 PM
Hour: 23, Minute: 30, Days: Sunday

Manual Runs

You can always trigger a crawler run manually from the Library crawler list, regardless of schedule settings.

Crawler Runs & Monitoring

Each time a crawler executes, it creates a "crawler run" with detailed statistics:

Run Information

Run Metadata:

  • • Start time and end time
  • • Duration
  • • Triggered by user (manual) or scheduler
  • • Success or failure status
  • • Error message (if failed)

Discovered Updates:

  • • New files found
  • • Modified files detected
  • • Deleted files identified
  • • Skipped files (hash unchanged)
  • • Error files (couldn't access)
Running Status
Crawler is running...
Started 5 minutes ago
Last Run
Completed
127 files discovered
Total Runs
48
Since creation

How Crawling Works

1. Discovery

Crawler navigates the source (folders, links, etc.) up to the configured max depth, discovering files that match filters

2. Change Detection

For each discovered file, crawler checks if it already exists in the Library by comparing content hash and modification date

3. File Import

New or modified files are downloaded and added to the Library. Unchanged files are skipped to save processing time.

4. Automatic Processing

If "Auto-process crawled files" is enabled in Library settings, new files are automatically queued for extraction and embedding

Best Practices

Start with Small Max Pages

Begin with maxPages=10 or 50 to test your crawler configuration. Once confirmed working, increase to larger numbers.

Use File Filters Wisely

Exclude unnecessary folders (archive, temp, backup) and limit file types to what you actually need. This speeds up crawling and reduces noise.

Schedule During Off-Hours

Run scheduled crawlers during low-traffic times (e.g., 3 AM) to avoid impacting system performance during business hours.

Monitor SharePoint Cookie Expiration

SharePoint authentication cookies typically expire after a few hours or days. If crawler runs start failing, refresh the cookies.

Troubleshooting

Issue Possible Cause Solution
Crawler run fails immediately Invalid credentials or expired cookies Verify credentials and refresh SharePoint cookies if applicable
No files discovered Filters too restrictive or maxDepth too low Review include/exclude patterns and increase maxDepth
Crawler times out Too many files or very slow network Reduce maxPages or increase timeout in crawler configuration
Files not processing after crawl "Auto-process crawled files" disabled Enable in Library settings or manually trigger processing
Scheduled runs not executing Cron job inactive or system scheduler stopped Verify cronJob.active is true and check system status

Related Topics

Learn more about file management and processing:

George-Cloud