Auto Caching - AltrumAI Docs

Overview

Auto Caching represents a critical performance optimisation feature that stores AI model responses for reuse, eliminating redundant API calls to upstream providers. This caching system operates transparently across all 7 supported AI providers, ensuring consistent performance improvements regardless of the underlying model or provider.

Automatic Response Caching

Zero-configuration caching for all non-streaming responses.

Automatic Cache Key Generation

Intelligent cache key generation based on request content.

Configurable TTL

Flexible time-to-live settings with 60 second default.

Provider Agnostic

Works seamlessly across all 7 AI providers.

Performance Metrics

Built-in cache hit/miss tracking and latency measurements.

Memory-Efficient Storage

Optimised in-memory cache with automatic expiration.

What Gets Cached

Successful Responses Only: Only 2xx responses are cached
Non-Streaming Content: Standard JSON responses (streaming excluded)
Complete Request Context: Cache keys include full request body and path
Provider-Specific Metadata: Model, provider, and latency information preserved

How Auto Caching Works

Cache Key Generation

The caching system generates unique cache keys by combining the request path with the validated request body, ensuring that identical requests receive cached responses while maintaining isolation between different queries.

Cache Lifecycle Management

Write Path

Request Reception: Incoming request processed by middleware stack
Cache Lookup: Check for existing valid cache entry
Cache Miss: Forward request to AI provider
Response Storage: Store successful responses with TTL
Client Return: Send response to client

Read Path

Request Reception: Incoming request enters pipeline
Key Generation: Create cache key from validated request
Cache Hit: Retrieve stored response if valid
Instant Return: Serve cached response with minimal latency
Metrics Update: Record cache hit and time saved

Expiration Handling

TTL Based Expiration: Entries expire after configured duration
Lazy Cleanup: Expired entries removed on next access attempt
Memory Protection: Prevents unbounded cache growth
Graceful Degradation: Expired entries trigger fresh requests

Business Benefits

Cost Optimisation

Direct Cost Savings

70-95% Reduction in API Calls: Eliminate redundant requests to AI providers
Token Usage Optimisation: Reuse responses without consuming additional tokens
Bandwidth Savings: Reduce network transfer costs
Provider Cost Reduction: Lower monthly bills from AI service providers

Example Cost Impact

Without Caching:
- 10,000 identical requests/day
- $0.02 per request (GPT-4)
- Daily cost: $200
- Monthly cost: $6,000

With 80% Cache Hit Rate:
- 2,000 actual API calls/day
- Daily cost: $40
- Monthly cost: $1,200
- Savings: $4,800/month (80% reduction)

Indirect Cost Benefits

Reduced Infrastructure Needs: Lower compute requirements
Decreased Operational Overhead: Fewer rate limit issues
Improved Resource Utilisation: Better throughput per dollar spent

Performance Enhancement

Latency Reduction

95% Faster Response Times: Cache hits return in less than 5ms vs 500-2000ms
Consistent Performance: Eliminate provider variability
Predictable SLAs: Meet strict latency requirements
Enhanced User Experience: Near-instant responses for common queries

Performance Metrics

Typical Latency Comparison:
- Provider API Call: 500-2000ms
- Cache Hit: 1-5ms
- Performance Gain: 100-2000x faster
- Time Saved per Hit: 495-1995ms

Throughput Improvements

10x Higher Request Capacity: Handle more concurrent users
Reduced Provider Dependencies: Less reliance on external services
Smoother Traffic Patterns: Level out usage spikes
Better Resource Allocation: CPU cycles for business logic

Operational Excellence

System Reliability

Provider Outage Protection: Serve cached responses during downtime
Rate Limit Mitigation: Reduce hitting provider limits
Graceful Degradation: Fallback to cache when providers slow
Improved Availability: Higher overall system uptime

Development Productivity

Faster Testing Cycles: Instant responses during development
Reduced Debugging Time: Consistent responses for testing
Cost-Effective Development: No API costs for repeated tests
Improved CI/CD Performance: Faster pipeline execution

Compliance and Governance

Response Consistency: Identical responses for identical requests
Audit Trail: Cache metrics for compliance reporting
Data Residency: Responses stay within your infrastructure
Security: No additional external data transmission

Scalability and Reliability

Horizontal Scaling

Stateless Architecture: Each instance maintains its own cache
Linear Performance: Add nodes for more cache capacity
Geographic Distribution: Deploy caches close to users
Load Balancing: Distribute cache hits across instances

Vertical Scaling

Memory Optimisation: Efficient storage per cache entry
Configurable Limits: Control maximum cache size
Automatic Pruning: Remove least recently used entries
Resource Management: Predictable memory consumption

High Availability

Zero Single Points of Failure: Cache operates independently
Automatic Failover: Seamless fallback to providers
Self-Healing: Automatic cache rebuilding
No Coordination Overhead: No distributed cache complexity

Use Cases

Enterprise Applications

Customer Support Automation

Scenario: Chatbot handling repetitive customer queries
Cache Benefit: 90% cache hit rate for FAQs
Impact: 10x faster response times, 90% cost reduction
Configuration: 5-minute TTL for support content

Documentation Assistant

Scenario: AI-powered documentation search
Cache Benefit: Consistent answers for documentation queries
Impact: Instant responses for common questions
Configuration: 30-minute TTL for stable content

Code Generation Platform

Scenario: IDE plugin generating boilerplate code
Cache Benefit: Reuse common code patterns
Impact: Sub-second code suggestions
Configuration: 1-hour TTL for code templates

Analytics Dashboard

Scenario: AI-generated insights and summaries
Cache Benefit: Cache computed analytics
Impact: Instant dashboard loading
Configuration: 5 minute TTL for near real-time data

Development Scenarios

API Testing

Scenario: Automated testing of AI integrations
Cache Benefit: Consistent test responses
Impact: 100x faster test execution
Configuration: Long TTL for deterministic testing

Load Testing

Scenario: Performance testing with high request volumes
Cache Benefit: Test infrastructure without provider limits
Impact: Accurate performance baselines
Configuration: Pre-warm cache with test data

Development Environment

Scenario: Local development with AI features
Cache Benefit: No API costs during development
Impact: Faster iteration cycles
Configuration: Extended TTL for development

Demo Environments

Scenario: Product demonstrations and POCs
Cache Benefit: Reliable, fast demos
Impact: Impressive performance showcase
Configuration: Pre-cached demo scenarios

Conclusion

The Auto Caching feature of the AI Gateway delivers immediate and measurable benefits through intelligent response caching. By automatically storing and serving repeated AI model responses, organisations can achieve dramatic improvements in performance, cost efficiency, and system reliability. With up to 95% reduction in response latency and 70-95% cost savings for cached requests, Auto Caching transforms the economics and performance characteristics of AI-powered applications. The transparent, provider-agnostic implementation ensures that these benefits are realised across all supported AI providers without any code changes or complex configuration. Whether optimising customer-facing applications for speed, reducing development costs, or ensuring consistent performance at scale, the Auto Caching feature provides the foundation for efficient, cost-effective AI integration in enterprise environments.

Introduction

Product

Release Notes

​Overview

Automatic Response Caching

Automatic Cache Key Generation

Configurable TTL

Provider Agnostic

Performance Metrics

Memory-Efficient Storage

​What Gets Cached

​How Auto Caching Works

​Cache Key Generation

​Cache Lifecycle Management

​Write Path

​Read Path

​Expiration Handling

​Business Benefits

​Cost Optimisation

​Direct Cost Savings

​Example Cost Impact

​Indirect Cost Benefits

​Performance Enhancement

​Latency Reduction

​Performance Metrics

​Throughput Improvements

​Operational Excellence

​System Reliability

​Development Productivity

​Compliance and Governance

​Scalability and Reliability

​Horizontal Scaling

​Vertical Scaling

​High Availability

​Use Cases

​Enterprise Applications

​Customer Support Automation

​Documentation Assistant

​Code Generation Platform

​Analytics Dashboard

​Development Scenarios

​API Testing

​Load Testing

​Development Environment

​Demo Environments

​Conclusion

Overview

What Gets Cached

How Auto Caching Works

Cache Key Generation

Cache Lifecycle Management

Write Path

Read Path

Expiration Handling

Business Benefits

Cost Optimisation

Direct Cost Savings

Example Cost Impact

Indirect Cost Benefits

Performance Enhancement

Latency Reduction

Performance Metrics

Throughput Improvements

Operational Excellence

System Reliability

Development Productivity

Compliance and Governance

Scalability and Reliability

Horizontal Scaling

Vertical Scaling

High Availability

Use Cases

Enterprise Applications

Customer Support Automation

Documentation Assistant

Code Generation Platform

Analytics Dashboard

Development Scenarios

API Testing

Load Testing

Development Environment

Demo Environments

Conclusion