AI Services & Providers
Configure workspace-scoped AI providers (Ollama, OpenAI) and manage connection settings
Overview
AI providers are configured per workspace, giving you complete flexibility in how each team or project uses AI models.
Each workspace can have its own Ollama and OpenAI configurations, allowing you to mix providers based on privacy requirements, performance needs, or budget constraints.
Workspace Scoping
Provider configurations are isolated per workspace. Switching workspaces shows different available models based on each workspace's configured providers.
Accessing AI Services Admin
Navigation
Switch to target workspace
Use the workspace switcher in the top navigation
Open Settings menu → Admin → AI Services
URL: /admin/ai-services
Configure providers for this workspace
Changes only affect the current workspace
Configuring AI Providers
Ollama (Local)
Self-hosted AI models for privacy and offline use
- Complete data privacy
- No API costs
- Offline operation
- GPU required for performance
OpenAI (Cloud)
Cloud-based AI models from OpenAI
- Latest GPT models
- High performance
- No infrastructure needed
- Pay per use (API costs)
Required Settings:
- Name: Descriptive name (e.g., "Production Ollama")
- Base URL: Ollama server address (e.g.,
http://ollama:11434) - API Key: Optional (if Ollama authentication enabled)
- VRAM (GB): Total GPU memory available (helps with load balancing)
Connection Testing:
Click "Test Connection" to verify the configuration before saving. This will:
- Check network connectivity to Ollama server
- Verify API authentication (if enabled)
- List available models on the server
Required Settings:
- Name: Descriptive name (e.g., "OpenAI Production")
- Base URL: API endpoint (default:
https://api.openai.com/v1) - API Key: Your OpenAI API key (starts with
sk-...)
Security:
- API keys are encrypted in the database
- Keys never exposed to frontend (shown as
sk-...xy) - Can test connection without re-entering key
Connection Testing:
Click "Test Connection" to verify your API key and check available models.
Provider Performance
George AI caches workspace provider configurations for optimal performance:
Advanced: Multi-Instance Ollama Load Balancing
Advanced Configuration
This section covers Ollama clustering with multiple GPU servers. Most users can skip this.
George AI can distribute AI processing across multiple Ollama servers, automatically balancing load based on each server's capabilities.
This ensures reliable, high-performance AI processing even under heavy workload by using all available GPU resources intelligently.
How It Works
Intelligent Routing
George AI monitors each Ollama server and routes requests based on:
- Available GPU memory
- Current load (requests in progress)
- GPU processing speed
- Which models are loaded on each server
Automatic Failover
If a server goes offline or becomes unresponsive, requests are automatically routed to available servers
Model-Aware Distribution
Each server can run different models. George AI only sends requests to servers that have the required model loaded
Adding an Ollama Server
Navigate to AI Services
Admin Panel → AI Services → Add Server
Enter Server Details
Name: Descriptive name (e.g., "GPU Server 1 - NVIDIA A100")
URL: Server address (e.g., "http://ollama-server-1:11434")
API Key: If authentication is enabled
Configure Capabilities
GPU Memory: Total GPU VRAM (e.g., "80GB")
Relative Speed: Performance multiplier (1.0 = baseline, 2.0 = twice as fast)
Max Concurrent: Maximum simultaneous requests (default: 4)
Test Connection
George AI will verify connectivity and detect available models
Monitoring Server Health
The AI Services dashboard shows real-time status:
Load Balancing Strategies
Distributes requests evenly across all available servers
Best for: Balanced workloads with similar server capabilities
Sends requests to the server with the lowest current load
Best for: Mixed workloads with varying request complexity
Faster servers receive proportionally more requests based on their speed rating
Best for: Clusters with different GPU generations (e.g., mixing A100 and V100)
Best Practices
Start Small, Scale Up
Begin with 2-3 servers and add more as needed based on usage patterns
Keep Models Consistent
Load the same models on all servers for best distribution. Different models = fewer routing options
Monitor GPU Memory
If servers frequently hit 100% GPU memory, reduce max_concurrent or add more servers
Troubleshooting
| Issue | Possible Cause | Solution |
|---|---|---|
| Server shows "Offline" | Network connectivity or Ollama not running | Check server URL, verify Ollama service is running |
| Slow processing | All servers at max capacity | Add more servers or increase max_concurrent carefully |
| Requests failing | Model not available on any server | Pull required model on at least one server |
| Uneven distribution | Server speed ratings incorrect | Adjust speed multipliers based on actual performance |