Crawlers
Automatically collect files from external sources into your Libraries
What are Crawlers?
Crawlers are automated collectors that continuously gather files from external sources like SharePoint, file shares, websites, and cloud storage, bringing them into your George AI Libraries.
Instead of manually uploading files, you configure a crawler once, and it automatically discovers and imports files—keeping your Library synchronized with the source.
One-Time Setup
Configure crawler once with source URL and credentials
Automatic Updates
Schedule crawlers to run daily, weekly, or manually trigger runs
Supported Sources
SharePoint Online
Crawl SharePoint document libraries and OneDrive folders
URI Format: https://yourcompany.sharepoint.com/sites/SiteName
Authentication: Browser cookies (FedAuth, rtFa)
SMB / Windows File Share
Access network file shares and Windows servers
URI Format: smb://server/share/folder
Authentication: Username + Password
HTTP/HTTPS Websites
Crawl public or internal websites and download files
URI Format: https://docs.example.com/files
Authentication: None (public sites only)
Box.com
Access Box.com enterprise cloud storage folders
URI Format: https://app.box.com/folder/123456789
Authentication: Customer ID + Developer Token
Creating a Crawler
- Select Source TypeSharePoint, SMB, HTTP, or Box
- Configure URI & LimitsURL, depth, max pages
- Set File FiltersOptional: patterns, size, MIME types
- Add CredentialsAuthentication details
- Schedule RunsOptional: daily/weekly automation
| URI | Required The source location to crawl (URL, network path, etc.) |
|---|---|
| Max Depth | Required • Default: 2 How many folder levels deep to crawl (0 = only root folder) Example: Depth 2 crawls |
| Max Pages | Required • Default: 10 Maximum number of files to collect per crawler run Prevents overwhelming the system with too many files at once |
| Include Patterns | Regex patterns to include specific files |
|---|---|
| Exclude Patterns | Regex patterns to exclude files/folders |
| Min File Size | Minimum file size in MB (e.g., 0.1 = 100 KB) Use to skip tiny files that likely contain no useful content |
| Max File Size | Maximum file size in MB (e.g., 50 = 50 MB) Use to skip extremely large files that may timeout during processing |
| Allowed MIME Types | Comma-separated list of allowed file types |
SharePoint Online
Requires browser authentication cookies:
- Open your SharePoint site in a browser and log in
- Open Developer Tools (F12) → Network tab
- Refresh the page or navigate to a document library
- Find any request to your SharePoint site
- Copy the complete "Cookie" header value (must include FedAuth and rtFa cookies)
- Paste into the "SharePoint Authentication Cookies" field
SMB / Windows File Share
Requires network credentials:
- Username: Domain username (e.g., DOMAIN\username or username@domain.com)
- Password: User's password
Box.com
Requires Box API credentials:
- Customer ID: Your Box enterprise customer ID (10+ characters)
- Developer Token: Box API developer token (20+ characters)
Contact your Box administrator to obtain these credentials.
HTTP/HTTPS Websites
No authentication required. Only public websites are supported.
Schedule crawlers to run automatically on a recurring basis:
Cron Schedule Configuration
- Active: Enable/disable scheduled runs
- Time: Hour (0-23) and Minute (0-59) in 24-hour format
- Days: Select which days of the week to run (Monday - Sunday)
Every weekday at 3:00 AM Hour: 3, Minute: 0, Days: Mon-Fri Sundays at 11:30 PM Hour: 23, Minute: 30, Days: Sunday Manual Runs
You can always trigger a crawler run manually from the Library crawler list, regardless of schedule settings.
Crawler Runs & Monitoring
Each time a crawler executes, it creates a "crawler run" with detailed statistics:
Run Information
Run Metadata:
- • Start time and end time
- • Duration
- • Triggered by user (manual) or scheduler
- • Success or failure status
- • Error message (if failed)
Discovered Updates:
- • New files found
- • Modified files detected
- • Deleted files identified
- • Skipped files (hash unchanged)
- • Error files (couldn't access)
How Crawling Works
1. Discovery
Crawler navigates the source (folders, links, etc.) up to the configured max depth, discovering files that match filters
2. Change Detection
For each discovered file, crawler checks if it already exists in the Library by comparing content hash and modification date
3. File Import
New or modified files are downloaded and added to the Library. Unchanged files are skipped to save processing time.
4. Automatic Processing
If "Auto-process crawled files" is enabled in Library settings, new files are automatically queued for extraction and embedding
Best Practices
Start with Small Max Pages
Begin with maxPages=10 or 50 to test your crawler configuration. Once confirmed working, increase to larger numbers.
Use File Filters Wisely
Exclude unnecessary folders (archive, temp, backup) and limit file types to what you actually need. This speeds up crawling and reduces noise.
Schedule During Off-Hours
Run scheduled crawlers during low-traffic times (e.g., 3 AM) to avoid impacting system performance during business hours.
Monitor SharePoint Cookie Expiration
SharePoint authentication cookies typically expire after a few hours or days. If crawler runs start failing, refresh the cookies.
Troubleshooting
| Issue | Possible Cause | Solution |
|---|---|---|
| Crawler run fails immediately | Invalid credentials or expired cookies | Verify credentials and refresh SharePoint cookies if applicable |
| No files discovered | Filters too restrictive or maxDepth too low | Review include/exclude patterns and increase maxDepth |
| Crawler times out | Too many files or very slow network | Reduce maxPages or increase timeout in crawler configuration |
| Files not processing after crawl | "Auto-process crawled files" disabled | Enable in Library settings or manually trigger processing |
| Scheduled runs not executing | Cron job inactive or system scheduler stopped | Verify cronJob.active is true and check system status |