Overview

The Multi-Model feature provides access to 50+ AI models across 7 major providers. This comprehensive model library includes various model sizes, capabilities, and price points, enabling organisations to select the optimal model for each use case while maintaining API consistency.

Key Capabilities

50+ Production Models

From lightweight to state-of-the-art models across all AI providers.

Automatic Model Validation

Built-in controls ensure only supported models are used ensuring consistent AI output quality.

Model-Specific Optimisations

Automatic parameter adjustments based on model capabilities to ensure optimal performance.

Transparent Pricing

Real-time cost calculation and monitoring for all supported models in AI Gateway.

Business Benefits

1. Best of Breed Model Selection

  • Task-Optimised Performance
    Choose the ideal model for each specific use case — GPT-4o for complex reasoning, Claude for long-context analysis, Gemini for multi-modal tasks etc..
  • Cost-Performance Optimisation
    Select cost effective models for simple tasks (e.g., GPT-4o-mini, Claude Haiku etc.) and premium models for complex operations.
  • Competitive Advantage
    Leverage unique capabilities of different models to outperform competitors using single model approaches.
  • Innovation Velocity
    Immediately access new models as they are released without infrastructure changes.

2. Risk Mitigation & Reliability

  • Model Diversification
    Avoid dependency on a single model’s availability, performance or pricing changes.
  • Automatic Failover
    Seamlessly switch to alternative models during outages or degraded performance.
  • Compliance Flexibility
    Use region specific or compliance certified models (Azure AI, AWS Bedrock, Google AI etc.) for regulated workloads.
  • Quality Assurance A/B test different models to ensure consistent quality across providers.

3. Cost Management & Optimisation

  • Dynamic Cost Control
    Route requests to cheaper models based on complexity and budget constraints.
  • Volume Discounts
    Leverage pricing tiers across multiple providers simultaneously.
  • Budget Allocation
    Set model specific budgets and automatically switch when limits are reached.
  • ROI Maximisation
    Use premium models only where their advanced capabilities justify the cost.

4. Enterprise Scalability

  • Load Distribution
    Distribute high volume workloads across multiple models to avoid rate limits.
  • Geographic Optimisation
    Use region specific models for lower latency and data residency compliance.
  • Capacity Management
    Access combined capacity of all providers during peak demand.
  • Performance Benchmarking
    Compare model performance in production with real workloads.

Supported Models by Provider

OpenAI Models (15 Models)

Model NameVersionContextStrengthsUse Cases
GPT-4ogpt-4o128KMultimodal understanding, complex reasoning, code generationComplex analysis, creative tasks, multimodal applications
gpt-4o-2024-11-20128KLatest November 2024 versionProduction workloads
gpt-4o-2024-08-06128KAugust 2024 versionStable deployments
gpt-4o-2024-05-13128KMay 2024 versionLegacy compatibility
GPT-4o-minigpt-4o-mini128KFast responses, cost-effective, good for simple tasksChatbots, simple queries, high-volume processing
gpt-4o-mini-2024-07-18128KJuly 2024 versionCost-optimised workloads
GPT-4 Turbogpt-4-turbo128KEnhanced GPT-4 with visionDocument analysis, complex reasoning with vision
gpt-4-turbo-2024-04-09128KApril 2024 versionStable vision applications
gpt-4-turbo-preview128KPreview versionEarly access features
GPT-4 Classicgpt-48KOriginal GPT-4 modelProven reliability for production workloads
gpt-4-06138KJune 2023 stable versionStable production deployments
gpt-4-03148KMarch 2023 versionLegacy compatibility
GPT-3.5 Turbogpt-3.5-turbo16KFast, cost-effective modelHigh-speed responses, cost-sensitive applications
gpt-3.5-turbo-012516KJanuary 2024 versionLatest GPT-3.5 features
gpt-3.5-turbo-110616KNovember 2023 versionStable GPT-3.5 deployment
O1 Serieso1-previewStandardComplex problem-solving, mathematical reasoningScientific research, complex analysis
o1-miniStandardFaster reasoning at lower costCode debugging, logical problems

Anthropic Claude Models (8 Models)

Model NameVersionContextStrengthsUse Cases
Claude 3.5 Sonnetclaude-3-5-sonnet-20241022200KBest balance of intelligence and speed, excellent codingCode generation, complex analysis, creative writing
claude-3-5-sonnet-20240620200KJune 2024 versionStable Claude 3.5 deployment
Claude 3.5 Haikuclaude-3-5-haiku-20241022200KLightning-fast responses, cost-effectiveReal-time applications, high-volume processing
Claude 3 Opusclaude-3-opus-20240229200KComplex reasoning, nuanced understandingResearch, complex document analysis
Claude 3 Sonnetclaude-3-sonnet-20240229200KBalanced Claude 3 modelGeneral purpose, good price-performance
Claude 3 Haikuclaude-3-haiku-20240307200KFastest Claude 3 modelHigh-speed, cost-sensitive applications

Amazon Bedrock Models (18 Models)

Model NameVersionContextStrengthsUse Cases
Anthropic Claude on Bedrockanthropic.claude-3-5-sonnet-20241022-v2:0200KAWS integration, enterprise security, complianceEnterprise Claude deployments
anthropic.claude-3-5-sonnet-20240620-v1:0200KAWS-native Claude 3.5AWS-integrated applications
anthropic.claude-3-5-haiku-20241022-v1:0200KFast Claude with AWS benefitsHigh-speed AWS applications
anthropic.claude-3-opus-20240229-v1:0200KMost capable Claude with AWSComplex AWS workloads
anthropic.claude-3-sonnet-20240229-v1:0200KBalanced Claude with AWSGeneral AWS applications
anthropic.claude-3-haiku-20240307-v1:0200KFast Claude with AWSCost-effective AWS applications
Meta Llama 3.1meta.llama3-1-70b-instruct-v1:08KLarge Llama 3.1 model, open-source heritageCustom deployments, fine-tuning base
meta.llama3-1-8b-instruct-v1:08KEfficient Llama 3.1 modelLightweight applications
Meta Llama 3meta.llama3-70b-instruct-v1:08KLarge Llama 3 modelHigh-performance open-source needs
meta.llama3-8b-instruct-v1:08KSmall Llama 3 modelCost-effective open-source
Amazon Titanamazon.titan-text-premier-v1:08KPremium Titan model, AWS-nativeAWS-integrated applications, Amazon-specific tasks
amazon.titan-text-express-v18KFast Titan modelHigh-speed AWS applications
Cohere Commandcohere.command-r-plus-v1:0StandardAdvanced Command model, retrieval-augmented generationRAG applications, document search
cohere.command-r-v1:0StandardStandard Command modelEnterprise search applications
Mistralmistral.mistral-large-2402-v1:032KLarge Mistral model, European modelMultilingual applications, European compliance
mistral.mixtral-8x7b-instruct-v0:132KMoE architecture, efficient inferenceEfficient multilingual processing

Azure OpenAI Models

FeatureDescriptionUse Cases
Model AvailabilitySame models as OpenAI with Azure enterprise capabilitiesEnterprise applications, Microsoft ecosystem integration
Regional DeploymentsDeploy models in specific Azure regionsData residency compliance, low latency applications
Private EndpointsMulti-authentication support for secure enterprise accessSecure enterprise deployments
Enterprise SLAsGuaranteed uptime and supportMission critical applications
Content FilteringBuilt-in content moderationCompliance and safety requirements

Azure AI Inference Models

Model TypeDescriptionUse Cases
Custom Model DeploymentsDeploy any model from Azure AI catalogCustom ML pipelines, specialised applications
Fine-tuned ModelsDeploy your custom fine-tuned modelsDomain specific applications
Open-Source ModelsLlama, Mistral, Falcon, and moreOpen source AI development
Specialised ModelsDomain specific models for healthcare, finance, etc.Industry specific applications

Google AI Models (6+ Models)

Model NameVersionContextStrengthsUse Cases
Gemini 1.5 Progemini-1.5-pro2MMost capable Gemini model, massive context, multimodal, video understandingLong document analysis, video processing
Gemini 1.5 Flashgemini-1.5-flash1MFast, efficient model, speed, cost-effectiveness, multimodalReal time applications, high volume processing
Gemini 1.0 Progemini-1.0-proStandardPrevious generation Pro modelStable, proven performance
Gemini 1.0 Pro Visiongemini-1.0-pro-visionStandardVision enabled versionImage analysis applications
Gemini Experimentalgemini-exp-1121StandardNovember 2024 experimentalTesting cutting edge capabilities
gemini-exp-1114StandardNovember 2024 experimentalEarly access to new features

Google Vertex AI Models

FeatureDescriptionUse Cases
Model AvailabilitySame Gemini models as Google AI with enterprise featuresGCP native applications, enterprise deployments
Private EndpointsVPC Service ControlsSecure enterprise deployments
Regional DeploymentsData residency complianceCompliance requirements
Model GardenAccess to 100+ open source modelsOpen source AI development
AutoML IntegrationCustom model trainingCustom ML model development

Model Selection Guide

By Use Case

Complex Reasoning & Analysis

models = {
    "premium": ["gpt-4o", "claude-3-5-sonnet-20241022", "gemini-1.5-pro"],
    "balanced": ["gpt-4-turbo", "claude-3-sonnet-20240229"],
    "reasoning": ["o1-preview", "o1-mini"]
}

High Volume Processing

models = {
    "fastest": ["claude-3-5-haiku-20241022", "gpt-4o-mini", "gemini-1.5-flash"],
    "cost_optimised": ["gpt-3.5-turbo", "claude-3-haiku-20240307"],
    "open_source": ["meta.llama3-8b-instruct-v1:0"]
}

Long Context Applications

models = {
    "maximum_context": ["gemini-1.5-pro"],  # 2M tokens
    "large_context": ["gemini-1.5-flash"],  # 1M tokens
    "standard_large": ["claude-3-5-sonnet-20241022", "gpt-4o"]  # 200K/128K
}

Multimodal Applications

models = {
    "vision": ["gpt-4o", "gemini-1.5-pro", "gpt-4-turbo"],
    "video": ["gemini-1.5-pro", "gemini-1.5-flash"],
    "images": ["gpt-4o", "gemini-1.0-pro-vision"]
}

Dynamic Model Selection (Example Script)

from openai import OpenAI
from typing import Dict, List

class ModelSelector:
    """Intelligent model selection based on requirements"""
    
    # Model capabilities and costs
    MODEL_PROFILES = {
        "gpt-4o": {
            "cost": "high",
            "speed": "medium",
            "capability": "highest",
            "context": 128000
        },
        "gpt-4o-mini": {
            "cost": "low",
            "speed": "fast",
            "capability": "good",
            "context": 128000
        },
        "claude-3-5-sonnet-20241022": {
            "cost": "medium",
            "speed": "fast",
            "capability": "highest",
            "context": 200000
        },
        "claude-3-5-haiku-20241022": {
            "cost": "very_low",
            "speed": "fastest",
            "capability": "good",
            "context": 200000
        },
        "gemini-1.5-pro": {
            "cost": "medium",
            "speed": "medium",
            "capability": "highest",
            "context": 2000000
        },
        "gemini-1.5-flash": {
            "cost": "low",
            "speed": "fast",
            "capability": "good",
            "context": 1000000
        }
    }
    
    def select_model(self, 
                     task_complexity: str,
                     context_size: int,
                     budget: str,
                     speed_requirement: str) -> str:
        """Select optimal model based on requirements"""
        
        suitable_models = []
        
        for model, profile in self.MODEL_PROFILES.items():
            # Check context size
            if context_size > profile["context"]:
                continue
                
            # Check budget constraints
            if budget == "low" and profile["cost"] in ["high", "medium"]:
                continue
                
            # Check speed requirements
            if speed_requirement == "real-time" and profile["speed"] == "slow":
                continue
                
            # Check capability requirements
            if task_complexity == "complex" and profile["capability"] != "highest":
                continue
                
            suitable_models.append(model)
        
        # Return best match or default
        return suitable_models[0] if suitable_models else "gpt-4o-mini"

# Usage example
selector = ModelSelector()

# Select model for different scenarios
model_for_chat = selector.select_model(
    task_complexity="simple",
    context_size=1000,
    budget="low",
    speed_requirement="real-time"
)  # Returns: gpt-4o-mini or claude-3-5-haiku-20241022

model_for_analysis = selector.select_model(
    task_complexity="complex",
    context_size=150000,
    budget="high",
    speed_requirement="normal"
)  # Returns: claude-3-5-sonnet-20241022 or gemini-1.5-pro

Cost Optimised Model Routing (Example Script)

import json
from typing import Dict, Optional
from openai import OpenAI

class CostOptimisedRouter:
    """Route requests to most cost-effective model"""
    
    # Cost per 1K tokens (input/output estimated)
    MODEL_COSTS = {
        "gpt-4o": {"input": 0.00250, "output": 0.01000},
        "gpt-4o-mini": {"input": 0.00015, "output": 0.00060},
        "claude-3-5-sonnet-20241022": {"input": 0.00300, "output": 0.01500},
        "claude-3-5-haiku-20241022": {"input": 0.00025, "output": 0.00125},
        "gemini-1.5-flash": {"input": 0.00010, "output": 0.00040},
        "gpt-3.5-turbo": {"input": 0.00050, "output": 0.00150}
    }
    
    def estimate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
        """Estimate cost for a request"""
        costs = self.MODEL_COSTS.get(model, {"input": 0, "output": 0})
        input_cost = (input_tokens / 1000) * costs["input"]
        output_cost = (output_tokens / 1000) * costs["output"]
        return input_cost + output_cost
    
    def select_cheapest_capable_model(self, 
                                      task_type: str,
                                      estimated_tokens: int) -> str:
        """Select cheapest model capable of the task"""
        
        # Define capable models by task type
        capable_models = {
            "simple": ["gpt-4o-mini", "claude-3-5-haiku-20241022", "gpt-3.5-turbo"],
            "moderate": ["gpt-4o-mini", "claude-3-5-haiku-20241022", "gemini-1.5-flash"],
            "complex": ["gpt-4o", "claude-3-5-sonnet-20241022", "gemini-1.5-pro"]
        }
        
        models = capable_models.get(task_type, capable_models["simple"])
        
        # Calculate costs and select cheapest
        cheapest = min(models, key=lambda m: self.estimate_cost(
            m, estimated_tokens, estimated_tokens // 2
        ))
        
        return cheapest
    
    def route_request(self, prompt: str, complexity: str = "auto") -> Dict:
        """Route request to optimal model"""
        
        # Auto-detect complexity if needed
        if complexity == "auto":
            prompt_length = len(prompt)
            if prompt_length < 100:
                complexity = "simple"
            elif prompt_length < 500:
                complexity = "moderate"
            else:
                complexity = "complex"
        
        # Estimate tokens (rough estimate)
        estimated_tokens = len(prompt) // 4
        
        # Select model
        model = self.select_cheapest_capable_model(complexity, estimated_tokens)
        
        # Get provider for model
        provider_map = {
            "gpt-4o": "openai",
            "gpt-4o-mini": "openai",
            "gpt-3.5-turbo": "openai",
            "claude-3-5-sonnet-20241022": "anthropic",
            "claude-3-5-haiku-20241022": "anthropic",
            "gemini-1.5-pro": "google",
            "gemini-1.5-flash": "google"
        }
        
        return {
            "model": model,
            "provider": provider_map[model],
            "estimated_cost": self.estimate_cost(model, estimated_tokens, estimated_tokens // 2),
            "reasoning": f"Selected {model} as cheapest option for {complexity} task"
        }

# Usage
router = CostOptimisedRouter()

# Simple query - routes to cheapest model
result = router.route_request("What is 2+2?")
print(f"Model: {result['model']}, Cost: ${result['estimated_cost']:.6f}")

# Complex query - routes to capable but cost-effective model
result = router.route_request(
    "Analyse this 10-page legal document and identify key risks...",
    complexity="complex"
)
print(f"Model: {result['model']}, Cost: ${result['estimated_cost']:.4f}")

A/B Testing Different Models

import asyncio
import time
from typing import List, Dict
from openai import OpenAI

class ModelABTester:
    """A/B test different models for quality and performance"""
    
    def __init__(self, gateway_url: str):
        self.gateway_url = gateway_url
        self.results = []
    
    def create_client(self, provider: str, credentials: dict) -> OpenAI:
        """Create client for specific provider"""
        headers = {"x-provider-name": provider}
        headers.update(credentials)
        
        return OpenAI(
            api_key="dummy",
            base_url=self.gateway_url,
            default_headers=headers
        )
    
    async def test_model(self, 
                         model: str, 
                         provider: str,
                         credentials: dict,
                         prompt: str) -> Dict:
        """Test a single model"""
        
        client = self.create_client(provider, credentials)
        
        start_time = time.time()
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=500
            )
            
            end_time = time.time()
            
            return {
                "model": model,
                "provider": provider,
                "success": True,
                "latency": end_time - start_time,
                "response": response.choices[0].message.content,
                "tokens_used": response.usage.total_tokens if response.usage else 0
            }
        except Exception as e:
            return {
                "model": model,
                "provider": provider,
                "success": False,
                "error": str(e)
            }
    
    async def run_ab_test(self, 
                         test_configs: List[Dict],
                         prompt: str) -> List[Dict]:
        """Run A/B test across multiple models"""
        
        tasks = []
        for config in test_configs:
            task = self.test_model(
                config["model"],
                config["provider"],
                config["credentials"],
                prompt
            )
            tasks.append(task)
        
        results = await asyncio.gather(*tasks)
        return results
    
    def analyse_results(self, results: List[Dict]) -> Dict:
        """Analyse A/B test results"""
        
        successful = [r for r in results if r.get("success")]
        
        if not successful:
            return {"error": "All models failed"}
        
        # Find best by latency
        fastest = min(successful, key=lambda x: x["latency"])
        
        # Calculate averages
        avg_latency = sum(r["latency"] for r in successful) / len(successful)
        
        return {
            "models_tested": len(results),
            "successful": len(successful),
            "fastest_model": fastest["model"],
            "fastest_latency": fastest["latency"],
            "average_latency": avg_latency,
            "results": results
        }

# Usage example
async def main():
    tester = ModelABTester("https://gateway.altrum.ai/v1")
    
    # Configure models to test
    test_configs = [
        {
            "model": "gpt-4o-mini",
            "provider": "openai",
            "credentials": {"Authorization": "Bearer key"}
        },
        {
            "model": "claude-3-5-haiku-20241022",
            "provider": "anthropic",
            "credentials": {"x-api-key": "key"}
        },
        {
            "model": "gemini-1.5-flash",
            "provider": "google",
            "credentials": {"x-goog-api-key": "key"}
        }
    ]
    
    # Run test
    prompt = "Write a haiku about cloud computing"
    results = await tester.run_ab_test(test_configs, prompt)
    
    # Analyse
    analysis = tester.analyse_results(results)
    print(f"Fastest model: {analysis['fastest_model']}")
    print(f"Latency: {analysis['fastest_latency']:.2f}s")

# Run the test
asyncio.run(main())

Model Comparison Matrix

ProviderModelContextSpeedCostBest For
OpenAIgpt-4o128KMediumHighComplex reasoning, multimodal
gpt-4o-mini128KFastLowSimple tasks, high volume
gpt-4-turbo128KMediumHighVision tasks, analysis
gpt-3.5-turbo16KFastLowQuick responses, chatbots
o1-previewStandardSlowHighComplex reasoning
Anthropicclaude-3-5-sonnet200KFastMediumCoding, analysis
claude-3-5-haiku200KFastestVery LowReal-time apps
claude-3-opus200KSlowVery HighComplex research
Googlegemini-1.5-pro2MMediumMediumMassive documents
gemini-1.5-flash1MFastLowHigh-speed processing
Bedrockllama3-1-70b8KMediumLowOpen-source needs
titan-premier8KFastLowAWS integration
mistral-large32KMediumMediumEuropean compliance

Best Practices

1. Model Selection Strategy

def select_model_strategy(requirements):
    """Strategic model selection based on requirements"""
    
    strategies = {
        "quality_first": [
            "gpt-4o",
            "claude-3-5-sonnet-20241022",
            "gemini-1.5-pro"
        ],
        "speed_first": [
            "claude-3-5-haiku-20241022",
            "gpt-4o-mini",
            "gemini-1.5-flash"
        ],
        "cost_first": [
            "gpt-3.5-turbo",
            "claude-3-haiku-20240307",
            "meta.llama3-8b-instruct-v1:0"
        ],
        "context_first": [
            "gemini-1.5-pro",  # 2M tokens
            "gemini-1.5-flash",  # 1M tokens
            "claude-3-5-sonnet-20241022"  # 200K tokens
        ]
    }
    
    return strategies.get(requirements["priority"], strategies["quality_first"])

2. Fallback Chains

class ModelFallbackChain:
    """Implement fallback chains for reliability"""
    
    def __init__(self):
        self.fallback_chains = {
            "premium": [
                "gpt-4o",
                "claude-3-5-sonnet-20241022",
                "gemini-1.5-pro",
                "gpt-4-turbo"
            ],
            "efficient": [
                "gpt-4o-mini",
                "claude-3-5-haiku-20241022",
                "gemini-1.5-flash",
                "gpt-3.5-turbo"
            ]
        }
    
    def execute_with_fallback(self, chain_type, prompt):
        """Execute with automatic fallback"""
        
        chain = self.fallback_chains[chain_type]
        
        for model in chain:
            try:
                return self.call_model(model, prompt)
            except Exception as e:
                print(f"Model {model} failed: {e}")
                continue
        
        raise Exception("All models in fallback chain failed")

3. Cost Monitoring

class CostMonitor:
    """Monitor and control model costs"""
    
    def __init__(self, monthly_budget: float):
        self.monthly_budget = monthly_budget
        self.current_spend = 0.0
        self.model_usage = {}
    
    def track_usage(self, model: str, tokens_in: int, tokens_out: int):
        """Track model usage and costs"""
        
        cost = self.calculate_cost(model, tokens_in, tokens_out)
        self.current_spend += cost
        
        if model not in self.model_usage:
            self.model_usage[model] = {"calls": 0, "cost": 0}
        
        self.model_usage[model]["calls"] += 1
        self.model_usage[model]["cost"] += cost
        
        # Alert if approaching budget
        if self.current_spend > self.monthly_budget * 0.8:
            self.send_budget_alert()
    
    def get_usage_report(self):
        """Generate usage report"""
        
        return {
            "total_spend": self.current_spend,
            "budget_remaining": self.monthly_budget - self.current_spend,
            "model_breakdown": self.model_usage
        }

Migration Guide

From Single Model to Multi-Model

# Before: Single model deployment
client = OpenAI(api_key="key")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[...]
)

# After: Multi-model with intelligent selection
class MultiModelClient:
    def __init__(self):
        self.gateway = "https://gateway.altrum.ai/v1"
    
    def create_completion(self, messages, requirements=None):
        # Select model based on requirements
        model = self.select_optimal_model(requirements)
        
        # Create client for selected model
        client = self.create_client_for_model(model)
        
        # Execute with automatic fallback
        return self.execute_with_fallback(client, model, messages)

Conclusion

The Multi-Model Support feature transforms AI deployment from a single model dependency to a flexible, optimised multi-model strategy in your Production AI Stack. With access to 50+ models across 7 providers, organisations can select the perfect model for each use case, optimise costs, ensure reliability through redundancy, and stay at the forefront of AI innovation.