Skip to main content

Overview

The Multi-Model feature provides access to 55+ AI models across 7 major providers. This comprehensive model library includes various model sizes, capabilities, and price points, enabling organisations to select the optimal model for each use case while maintaining API consistency.

Key Capabilities

55+ Production Models

From lightweight to state-of-the-art models across all AI providers.

Automatic Model Validation

Built-in controls ensure only supported models are used ensuring consistent AI output quality.

Model-Specific Optimisations

Automatic parameter adjustments based on model capabilities to ensure optimal performance.

Transparent Pricing

Real-time cost calculation and monitoring for all supported models in AI Gateway.

Business Benefits

1. Best of Breed Model Selection

  • Task-Optimised Performance
    Choose the ideal model for each specific use case — GPT-4o for complex reasoning, Claude for long-context analysis, Gemini for multi-modal tasks etc..
  • Cost-Performance Optimisation
    Select cost effective models for simple tasks (e.g., GPT-4o-mini, Claude Haiku etc.) and premium models for complex operations.
  • Competitive Advantage
    Leverage unique capabilities of different models to outperform competitors using single model approaches.
  • Innovation Velocity
    Immediately access new models as they are released without infrastructure changes.

2. Risk Mitigation & Reliability

  • Model Diversification
    Avoid dependency on a single model’s availability, performance or pricing changes.
  • Automatic Failover
    Seamlessly switch to alternative models during outages or degraded performance.
  • Compliance Flexibility
    Use region specific or compliance certified models (Azure AI, AWS Bedrock, Google AI etc.) for regulated workloads.
  • Quality Assurance A/B test different models to ensure consistent quality across providers.

3. Cost Management & Optimisation

  • Dynamic Cost Control
    Route requests to cheaper models based on complexity and budget constraints.
  • Volume Discounts
    Leverage pricing tiers across multiple providers simultaneously.
  • Budget Allocation
    Set model specific budgets and automatically switch when limits are reached.
  • ROI Maximisation
    Use premium models only where their advanced capabilities justify the cost.

4. Enterprise Scalability

  • Load Distribution
    Distribute high volume workloads across multiple models to avoid rate limits.
  • Geographic Optimisation
    Use region specific models for lower latency and data residency compliance.
  • Capacity Management
    Access combined capacity of all providers during peak demand.
  • Performance Benchmarking
    Compare model performance in production with real workloads.

Supported Models by Provider

OpenAI Models (20 Models)

ModelDescriptionRelease DateModalitiesContext WindowMax Output TokensKnowledge Cut-Off
GPT-5 Chat (gpt-5-chat-latest)Snapshot used in ChatGPT. Recommended for testing latest improvements in chat use cases.LatestText, Image128,00016,384Sep 30, 2024
GPT-5 (gpt-5-2025-08-07)Flagship model for coding, reasoning, and agentic tasks across domains.2025-08-07Text, Image400,000128,000Sep 30, 2024
GPT-5 Mini (gpt-5-mini-2025-08-07)Faster, more cost-efficient GPT-5 variant for well-defined tasks and precise prompts.2025-08-07Text, Image400,000128,000May 31, 2024
GPT-5 Nano (gpt-5-nano-2025-08-07)Cheapest, fastest GPT-5 variant. Ideal for summarization and classification.2025-08-07Text, Image400,000128,000May 31, 2024
GPT-4.1 (gpt-4.1-2025-04-14)Excels at instruction following and tool use. Supports 1M token context with low latency.2025-04-14Text, Image1,047,57632,768Jun 01, 2024
GPT-4.1 Mini (gpt-4.1-mini-2025-04-14)Smaller, faster GPT-4.1 variant. Maintains broad capabilities with 1M token context.2025-04-14Text, Image1,047,57632,768Jun 01, 2024
GPT-4.1 Nano (gpt-4.1-nano-2025-04-14)Ultra-light GPT-4.1 variant for efficiency with 1M token context.2025-04-14Text, Image1,047,57632,768Jun 01, 2024
GPT-4 Preview (gpt-4-0125-preview)Research preview of GPT-4 Turbo, an older high-intelligence model.2024-01-25Text128,0004,096Dec 01, 2023
GPT-4 Legacy (gpt-4-0613)Older GPT-4 model, still available for compatibility.2023-06-13Text8,1928,192Dec 01, 2023
GPT-4 Turbo (gpt-4-turbo-2024-04-09)Cheaper, faster variant of GPT-4. Superseded by GPT-4o.2024-04-09Text, Image128,0004,096Dec 01, 2023
GPT-4o (gpt-4o-2024-05-13)Versatile, high-intelligence flagship model. Multimodal (text + image).2024-05-13Text, Image128,0004,096Oct 01, 2023
GPT-4o (gpt-4o-2024-08-06)Updated GPT-4o snapshot.2024-08-06Text, Image128,00016,384Oct 01, 2023
GPT-4o (gpt-4o-2024-11-20)Updated GPT-4o snapshot.2024-11-20Text, Image128,00016,384Oct 01, 2023
GPT-4o Latest (chatgpt-4o-latest)Points to the GPT-4o snapshot used in ChatGPT.RollingText, Image128,00016,384Oct 01, 2023
GPT-4o Mini (gpt-4o-mini-2024-07-18)Lightweight GPT-4o variant. Fast, affordable, and fine-tuning friendly.2024-07-18Text, Image128,00016,384Oct 01, 2023
O1 (o1-2024-12-17)RL-trained reasoning model. Thinks step-by-step before answering.2024-12-17Text, Image200,000100,000Oct 01, 2023
O3 (o3-2025-04-16)High-performance reasoning model for math, science, coding, and multimodal analysis.2025-04-16Text, Image200,000100,000Jun 01, 2024
O3 Mini (o3-mini-2025-01-31)Small reasoning model. Supports structured outputs, function calling, and batch API.2025-01-31Text200,000100,000Oct 01, 2023
O4 Mini (o4-mini-2025-04-16)Latest small o-series model. Optimized for fast reasoning, coding, and visual tasks.2025-04-16Text, Image200,000100,000Jun 01, 2024
GPT-3.5 Turbo (gpt-3.5-turbo-0125)Legacy GPT-3.5 model. Still supported, but GPT-4o Mini is recommended instead.2024-01-25Text16,3854,096Sep 01, 2021

Anthropic Claude Models (7 Models)

ModelDescriptionRelease DateModalitiesContext WindowMax Output TokensKnowledge Cut-Off
Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)Best model for complex agents and coding.2025-09-29Text, Image (Vision), Multilingual200K / 1M (beta)64,000Reliable: Jan 2025 · Training data: Jul 2025
Claude Sonnet 4 (claude-sonnet-4-20250514)High-performance model.2025-05-14Text, Image (Vision), Multilingual200K / 1M (beta)64,000Reliable: Jan 2025 · Training data: Mar 2025
Claude Sonnet 3.7 (claude-3-7-sonnet-20250219, alias: claude-3-7-sonnet-latest)High-performance model with early extended thinking.2025-02-19Text, Image (Vision), Multilingual200K64,000Reliable: Oct 2024 · Training data: Nov 2024
Claude Opus 4.1 (claude-opus-4-1-20250805)Exceptional model for specialized complex tasks.2025-08-05Text, Image (Vision), Multilingual200K32,000Reliable: Jan 2025 · Training data: Mar 2025
Claude Opus 4 (claude-opus-4-20250514)Previous flagship model.2025-05-14Text, Image (Vision), Multilingual200K32,000Reliable: Jan 2025 · Training data: Mar 2025
Claude Haiku 3.5 (claude-3-5-haiku-20241022, alias: claude-3-5-haiku-latest)Fastest Claude model.2024-10-22Text, Image (Vision), Multilingual200K8,192Reliable: Jul 2024 · Training data: Jul 2024
Claude Haiku 3 (claude-3-haiku-20240307)Compact model for near-instant responsiveness.2024-03-07Text, Image (Vision), Multilingual200K4,096Reliable: 2023 · Training data: Aug 2023

Amazon Bedrock Models (18 Models)

ModelDescriptionRelease DateModalitiesContext WindowMax Output TokensKnowledge Cut-Off
Claude Sonnet 4.5 (anthropic.claude-sonnet-4-5-20250929-v1:0)Latest Claude Sonnet reasoning/chat model.2025-09-29Text, Image200K
Claude Sonnet 4 (anthropic.claude-sonnet-4-20250514-v1:0)Advanced Claude Sonnet v4.2025-05-14Text, Image200K
Claude Sonnet 3.7 (anthropic.claude-3-7-sonnet-20250219-v1:0)Claude 3.7 Sonnet generation model.2025-02-19Text, Image200K
Claude Sonnet 3.5 v2 (anthropic.claude-3-5-sonnet-20241022-v2:0)Updated Claude 3.5 Sonnet.2024-10-22Text, Image200K
Claude Sonnet 3.5 (anthropic.claude-3-5-sonnet-20240620-v1:0)Standard Claude 3.5 Sonnet.2024-06-20Text, Image200K
Claude Haiku 3 (anthropic.claude-3-haiku-20240307-v1:0)Lightweight Claude model optimized for speed/cost.2024-03-07Text48K
Claude Sonnet 3 (anthropic.claude-3-sonnet-20240229-v1:0)Claude 3 Sonnet general-purpose model.2024-02-29Text, Image28K
Nova Lite (amazon.nova-lite-v1:0)Amazon Nova lightweight model.2025Text300K
Nova Micro (amazon.nova-micro-v1:0)Amazon Nova smallest variant.2025Text128K
Nova Pro (amazon.nova-pro-v1:0)Amazon Nova flagship model.2025Text300K
Titan Text G1 – Express (amazon.titan-text-express-v1)Balanced Titan LLM for text generation.2023Text8K
Titan Text G1 – Lite (amazon.titan-text-lite-v1)Lightweight Titan model.2023Text4K
IBM Granite 3.2 Instruct 8B (ibm-granite-3-2-8b-instruct)General-purpose instruct model.2025Text
IBM Granite 3.0 Instruct 8B (granite-3-0-8b-instruct)Earlier instruct model (8B params).2024Text
IBM Granite 20B Code Instruct (ibm-granite-20b-code-instruct-8k)Code-focused model (20B params).2024Text (Code)8K
IBM Granite 8B Code Instruct (ibm-granite-8b-code-instruct-128k)Code instruct model with extended context.2024Text (Code)128K
IBM Granite 34B Code Instruct (ibm-granite-34b-code-instruct-8k)Large code instruct model (34B params).2024Text (Code)8K
Llama 3 8B Instruct (meta.llama3-8b-instruct-v1:0)Meta Llama 3 instruct-tuned model.2024Text8K
Llama 3 70B Instruct (meta.llama3-70b-instruct-v1:0)Larger Meta Llama 3 instruct model.2024Text8K
DeepSeek-R1 (deepseek-llm-r1)DeepSeek foundation model.2025Text
DeepSeek V3.1 (deepseek.v3-v1:0)Latest DeepSeek v3.1 model.2025Text163,840
Mistral 7B Instruct (mistral.mistral-7b-instruct-v0:2)Instruction-tuned Mistral 7B.2024-03-01Text, Code, Classification32K
Mistral Large 24.02 (mistral.mistral-large-2402-v1:0)Large Mistral model for reasoning, text, code, RAG, and agents.2024-04-02Text, Code, RAG, Agents32K
Mixtral 8x7B Instruct (mistral.mixtral-8x7b-instruct-v0:1)Mixture-of-experts instruct model.2024-03-01Text, Code, Reasoning32K

Azure OpenAI Models

ModelDescriptionRelease DateModalitiesContext WindowMax Output TokensKnowledge Cut-Off
GPT-5 (gpt-5-2025-08-07)Flagship GPT-5 with reasoning, structured outputs, text + image processing, functions & tools.2025-08-07Text, Image400,000 (272K in / 128K out)128,000Sep 30, 2024
GPT-5 Mini (gpt-5-mini-2025-08-07)Smaller, faster GPT-5 variant.2025-08-07Text, Image400,000 (272K in / 128K out)128,000May 31, 2024
GPT-5 Nano (gpt-5-nano-2025-08-07)Optimized GPT-5 variant with smaller footprint.2025-08-07Text, Image400,000 (272K in / 128K out)128,000May 31, 2024
GPT-5 Chat Preview (gpt-5-chat-2025-08-07)Chat-optimized GPT-5 (preview).2025-08-07Text, Image128,00016,384Sep 30, 2024
GPT-5 Chat Preview (gpt-5-chat-2025-10-03)Updated chat-optimized GPT-5 (preview).2025-10-03Text, Image128,00016,384Sep 30, 2024
GPT-5 Codex (gpt-5-codex-2025-09-11)GPT-5 optimized for coding and structured outputs.2025-09-11Text, Image400,000 (272K in / 128K out)128,000
GPT-5 Pro (gpt-5-pro-2025-10-06)GPT-5 Pro with advanced reasoning, structured outputs, functions & tools.2025-10-06Text, Image400,000 (272K in / 128K out)128,000Sep 30, 2024
GPT-OSS 120B (gpt-oss-120b) PreviewOpen-source style reasoning model.2025Text131,072131,072May 31, 2024
GPT-OSS 20B (gpt-oss-20b) PreviewSmaller GPT-OSS variant.2025Text131,072131,072May 31, 2024
GPT-4.1 (gpt-4.1-2025-04-14)Multimodal model with streaming, function calling, and structured outputs.2025-04-14Text, Image1,047,576 · 128K (managed) · 300K (batch)32,768May 31, 2024
GPT-4.1 Nano (gpt-4.1-nano-2025-04-14)Lightweight GPT-4.1 variant.2025-04-14Text, Image1,047,576 · 128K (managed) · 300K (batch)32,768May 31, 2024
GPT-4.1 Mini (gpt-4.1-mini-2025-04-14)Smaller GPT-4.1 variant.2025-04-14Text, Image1,047,576 · 128K (managed) · 300K (batch)32,768May 31, 2024
Codex Mini (codex-mini-2025-05-16)Fine-tuned o4-mini optimized for code.2025-05-16Text, Image200K in / 100K out100,000May 31, 2024
O3 Pro (o3-pro-2025-06-10)Advanced reasoning model with enhanced capabilities.2025-06-10Text, Image200K in / 100K out100,000May 31, 2024
O4 Mini (o4-mini-2025-04-16)Reasoning model with efficient performance.2025-04-16Text, Image200K in / 100K out100,000May 31, 2024
O3 (o3-2025-04-16)Reasoning model with tool use.2025-04-16Text, Image200K in / 100K out100,000May 31, 2024
O3 Mini (o3-mini-2025-01-31)Text-only reasoning model.2025-01-31Text200K in / 100K out100,000Oct 2023
O1 (o1-2024-12-17)Reasoning model with structured outputs.2024-12-17Text, Image200K in / 100K out100,000Oct 2023
O1 Preview (o1-preview-2024-09-12)Early preview release of O1.2024-09-12Text128K in / 32,768 out32,768Oct 2023
O1 Mini (o1-mini-2024-09-12)Cost-efficient O1 variant.2024-09-12Text128K in / 65,536 out65,536Oct 2023
GPT-4o (gpt-4o-2024-11-20)Multimodal GPT-4o with JSON mode, function calling, and strong vision support.2024-11-20Text, Image128,00016,384Oct 2023
GPT-4o (gpt-4o-2024-08-06)Updated GPT-4o release.2024-08-06Text, Image128,00016,384Oct 2023
GPT-4o (gpt-4o-2024-05-13)Early GPT-4o release (Turbo Vision parity).2024-05-13Text, Image128,0004,096Oct 2023
GPT-4o Mini (gpt-4o-mini-2024-07-18)Smaller, fast GPT-4o variant.2024-07-18Text, Image128,00016,384Oct 2023
GPT-4 Turbo (gpt-4-turbo-2024-04-09)Multimodal GPT-4 Turbo, successor to preview models.2024-04-09Text, Image128,0004,096Dec 2023
GPT-3.5 Turbo (gpt-35-turbo-0125)JSON mode, function calling, reproducible outputs.2024-01-25Text16,385 in / 4,096 out4,096Sep 2021
GPT-3.5 Turbo (gpt-35-turbo-1106)Earlier GPT-3.5 Turbo variant.2023-11-06Text16,385 in / 4,096 out4,096Sep 2021
GPT-3.5 Turbo Instruct (gpt-35-turbo-instruct-0914)Replacement for legacy Completions models.2023-09-14Text4,0974,097Sep 2021

Azure AI Inference Models

ModelDescriptionRelease DateModalitiesContext WindowMax Output TokensKnowledge Cut-Off
AI21 Jamba 1.5 Mini (AI21-Jamba-1.5-Mini)Tool calling: Yes; supports text, JSON, structured outputs.Text262,1444,096
AI21 Jamba 1.5 Large (AI21-Jamba-1.5-Large)Tool calling: Yes; supports text, JSON, structured outputs.Text262,1444,096
O3 Mini (o3-mini)OpenAI O-series; tool calling: Yes; structured outputs.Text, Image200,000100,000
O1 (o1)OpenAI O-series; tool calling: Yes; structured outputs.Text, Image200,000100,000
O1 Preview (o1-preview)Early O1 preview; tool calling: Yes.Text128,00032,768
O1 Mini (o1-mini)Cost-efficient O1 variant; tool calling: No.Text128,00065,536
GPT-4o (gpt-4o)Multimodal GPT-4o; tool calling: Yes; supports structured outputs.Text, Image, Audio131,07216,384
GPT-4o Mini (gpt-4o-mini)Smaller GPT-4o variant; tool calling: Yes.Text, Image, Audio131,07216,384
Cohere Command A (Cohere-command-A)Cohere instruct model; tool calling: Yes.Text256,0008,000
Cohere Command R+ (Cohere-command-r-plus-08-2024)Optimized for reasoning and retrieval; tool calling: Yes.2024-08Text131,0724,096
Cohere Command R (Cohere-command-r-08-2024)Earlier R-series model; tool calling: Yes.2024-08Text131,0724,096
JAIS 30B (jais-30b-chat)Multilingual model; tool calling: Yes.Text8,1924,096
DeepSeek V3 (DeekSeek-V3-0324)Latest DeepSeek v3; tool calling: No.2024-03Text131,072131,072
DeepSeek V3 (Legacy) (DeepSeek-V3-Legacy)Earlier DeepSeek v3.Text131,072131,072
DeepSeek R1 (DeepSeek-R1)Reasoning-focused model.Text163,840163,840
Llama 4 Scout (Llama-4-Scout-17B-16E-Instruct)Meta Llama 4 variant; tool calling: Yes.Text, Image128,0008,192
Llama 4 Maverick (Llama-4-Maverick-17B-128E-Instruct-FP8)Meta Llama 4 Maverick; tool calling: Yes.Text, Image128,0008,192
Llama 3.3 70B (Llama-3.3-70B-Instruct)Meta Llama 3.3 large model.Text128,0008,192
Llama 3.2 Vision (Llama-3.2-90B-Vision-Instruct)Meta Llama 3.2 multimodal vision model.Text, Image128,0008,192
Llama 3.2 Vision (Llama-3.2-11B-Vision-Instruct)Smaller Meta Llama 3.2 vision variant.Text, Image128,0008,192
Llama 3.1 8B (Meta-Llama-3.1-8B-Instruct)Meta Llama 3.1 instruct variant.Text131,0728,192
Llama 3.1 405B (Meta-Llama-3.1-405B-Instruct)Largest Meta Llama 3.1 instruct variant.Text131,0728,192
MAI DS R1 (MAI-DS-R1)Reasoning model.Text163,840163,840
Phi-4 (Phi-4)Microsoft Phi-4 general-purpose.Text16,38416,384
Phi-4 Mini (Phi-4-mini-instruct)Small Phi-4 variant.Text131,0724,096
Phi-4 Multimodal (Phi-4-multimodal-instruct)Multimodal Phi-4 (text, image, audio).Text, Image, Audio131,0724,096
Phi-4 Reasoning (Phi-4-reasoning)Phi-4 reasoning-focused model.Text32,76832,768
Phi-4 Mini Reasoning (Phi-4-mini-reasoning)Lightweight reasoning variant.Text128,000128,000
Phi-3.5 Mini (Phi-3.5-mini-instruct)Phi-3.5 small instruct model.Text131,0724,096
Phi-3.5 MoE (Phi-3.5-MoE-instruct)Phi-3.5 mixture-of-experts variant.Text131,0724,096
Phi-3.5 Vision (Phi-3.5-vision-instruct)Phi-3.5 multimodal variant.Text, Image131,0724,096
Phi-3 Mini 128K (Phi-3-mini-128k-instruct)Compact Phi-3 variant with 128K context.Text131,0724,096
Phi-3 Mini 4K (Phi-3-mini-4k-instruct)Compact Phi-3 with 4K context.Text4,0964,096
Phi-3 Small 128K (Phi-3-small-128k-instruct)Small Phi-3 with 128K context.Text131,0724,096
Phi-3 Small 8K (Phi-3-small-8k-instruct)Small Phi-3 with 8K context.Text131,0724,096
Phi-3 Medium 128K (Phi-3-medium-128k-instruct)Medium Phi-3 with 128K context.Text131,0724,096
Phi-3 Medium 4K (Phi-3-medium-4k-instruct)Medium Phi-3 with 4K context.Text4,0964,096
Codestral 2501 (Codestral-2501)Mistral Codestral code-focused model.Text262,1444,096
Ministral 3B (Ministral-3B)Lightweight Mistral model; tool calling: Yes.Text131,0724,096
Mistral Nemo (Mistral-Nemo)Mistral Nemo model; tool calling: Yes.Text131,0724,096
Mistral Large 24.11 (Mistral-Large-2411)Latest Mistral large model; tool calling: Yes.Text128,0004,096
Mistral Medium 25.05 (Mistral-medium-2505)Balanced medium model; tool calling: No.Text, Image128,000128,000
Mistral Small 25.03 (Mistral-small-2503)Newer small Mistral; tool calling: Yes.Text, Image131,0724,096
Mistral Small (Mistral-small)Earlier small Mistral variant.Text32,7684,096
Tsuzumi 7B (tsuzumi-7b)Lightweight Tsuzumi 7B model.Text8,1928,192

Google AI Models (7 Models)

ModelDescriptionRelease DateModalitiesContext WindowMax Output TokensKnowledge Cut-Off
Gemini 2.5 Pro (gemini-2.5-pro)Most advanced model for complex reasoning and multimodal tasks.2025Text, Image, Audio, Video65,536Jan 2025
Gemini 2.5 Flash (gemini-2.5-flash)Balanced model optimized for speed and general use.2025Text, Image, Audio, Video65,536Jan 2025
Gemini 2.5 Flash (Preview) (gemini-2.5-flash-preview-09-2025)Preview release of Gemini 2.5 Flash.2025-09Text, Image, Audio, Video65,536Jan 2025
Gemini 2.5 Flash-Lite (gemini-2.5-flash-lite)Lightweight, cost-efficient variant.2025Text, Image, Audio, Video65,536Jan 2025
Gemini 2.5 Flash-Lite (Preview) (gemini-2.5-flash-lite-preview-09-2025)Preview release of Gemini 2.5 Flash-Lite.2025-09Text, Image, Audio, Video65,536Jan 2025
Gemini 2.0 Flash (gemini-2.0-flash)Earlier generation Flash model.2024Text, Image, Audio, Video8,192Aug 2024
Gemini 2.0 Flash-Lite (gemini-2.0-flash-lite)Lightweight 2.0 Flash variant.2024 / 2025Text, Image, Audio8,192Aug 2024

Google Vertex AI Models

ModelDescriptionRelease DateModalitiesContext WindowMax Output TokensKnowledge Cut-Off
Gemini 2.5 Flash (Preview) (gemini-2.5-flash)Balanced model optimized for speed.2025Text, Image, Audio, Video1M65,536Jan 2025
Gemini 2.5 Pro (Preview) (gemini-2.5-pro)Most advanced Gemini model.2025Text, Image, Audio, Video1M65,536Jan 2025
Gemini 2.0 Flash (gemini-2.0-flash)Previous Flash generation.2024Text, Image, Audio, Video8,192Aug 2024
Gemini 2.0 Flash-Lite (gemini-2.0-flash-lite)Lightweight Flash variant.2024Text, Image, Audio8,192Aug 2024
Claude Opus 4.1 (claude-opus-4-1)Exceptional reasoning model.2025Text, Image200K32,000Jan 2025
Claude Opus 4 (claude-opus-4)Previous flagship Claude model.2025Text, Image200K32,000Jan 2025
Claude Sonnet 4.5 (claude-sonnet-4-5)Best for complex agents and coding.2025Text, Image200K / 1M (beta)64,000Jan 2025
Claude Sonnet 4 (claude-sonnet-4)High-performance Claude Sonnet model.2025Text, Image200K / 1M (beta)64,000Jan 2025
Claude 3.7 Sonnet (claude-3-7-sonnet)High-performance with extended thinking.2025Text, Image200K64,000Oct 2024
Claude 3.5 Sonnet v2 (claude-3-5-sonnet-v2)Updated Claude 3.5 Sonnet.2024Text, Image200K64,0002024
Claude 3.5 Haiku (claude-3-5-haiku)Fastest Claude model.2024Text, Image200K8,192Jul 2024
Claude 3 Haiku (claude-3-haiku)Compact and fast Claude model.2024Text200K4,096Aug 2023
Claude 3.5 Sonnet (claude-3-5-sonnet)Standard Claude 3.5 Sonnet.2024Text, Image200K64,0002024
Jamba 1.5 Large (Preview) (jamba-1-5-large)Advanced AI21 Jamba model.2025Text
Jamba 1.5 Mini (Preview) (jamba-1-5-mini)Smaller AI21 Jamba 1.5 variant.2025Text
Mistral Medium 3 (mistral-medium-3)Medium-sized Mistral model.2025Text
Mistral Small 3.1 (mistral-small-3-1-25-03)Smaller, faster Mistral.2025-03Text
Mistral Large (mistral-large-24-11)Large Mistral model.2024-11Text
Mistral 7B (mistral-7b)Base 7B model.2023Text
Mixtral (mixtral)Mixture-of-experts Mistral model.2024Text
Llama 4 Maverick (llama-4-maverick-17b-128e)Meta Llama 4 Maverick.2025Text
Llama 4 Scout (llama-4-scout-17b-16e)Meta Llama 4 Scout.2025Text
Llama 4 (llama-4)Core large Llama 4 model.2025Text
Llama 3.3 (llama-3-3)Successor to Llama 3.2.2025Text
Llama 3.2 (Preview) (llama-3-2-preview)Preview release of Llama 3.2.2024Text
Llama 3.2 (llama-3-2)Stable release of Llama 3.2.2024Text
Llama 3.2 Vision (llama-3-2-vision)Multimodal Llama 3.2.2024Text, Image
Llama 3.1 (llama-3-1)Part of Llama 3 family.2024Text
Llama 3 (llama-3)Base Llama 3 model.2023Text
Qwen3-Next 80B Thinking (qwen3-next-80b-thinking)Reasoning-focused Qwen3 variant.2025Text
Qwen3-Next 80B Instruct (qwen3-next-80b-instruct)Instruction-tuned Qwen3 variant.2025Text
Qwen3 Coder (qwen3-coder)Qwen3 code-focused model.2025Text (Code)
Qwen3 235B (qwen3-235b)Very large Qwen3 model.2025Text
Qwen2 (qwen2)Earlier Qwen release.2024Text
DeepSeek V3.1 (deepseek-v3-1)Advanced DeepSeek model.2025Text
DeepSeek R1 (deepseek-r1-0528)Reasoning-focused DeepSeek model.2025-05-28Text
GPT-OSS 120B (gpt-oss-120b)Open-weight GPT-OSS model.2025Text
GPT-OSS 20B (gpt-oss-20b)Smaller open-weight GPT-OSS model.2025Text
Phi-3 (phi-3)Microsoft Phi-3 model.2024Text
Gemma 3n (gemma-3n)Google Gemma series model.2025Text
Gemma 3 (gemma-3)Member of Gemma family.2025Text
Gemma 2 (gemma-2)Earlier Gemma generation.2024Text
Gemma (gemma)First Gemma release.2023Text

Model Selection Guide

By Use Case

Complex Reasoning & Analysis

models = {
    "premium": ["gpt-5-2025-08-07", "gpt-4.1-2025-04-14", "claude-3-5-sonnet-20241022", "gemini-1.5-pro"],
    "balanced": ["gpt-4o", "gpt-4.1-mini-2025-04-14", "claude-3-sonnet-20240229"],
    "reasoning": ["o3-2025-04-16", "o1-2024-12-17", "o3-mini-2025-01-31"]
}

High Volume Processing

models = {
    "fastest": ["gpt-5-nano-2025-08-07", "gpt-4.1-nano-2025-04-14", "claude-3-5-haiku-20241022", "gemini-1.5-flash"],
    "cost_optimised": ["gpt-5-mini-2025-08-07", "gpt-4o-mini", "claude-3-haiku-20240307"],
    "open_source": ["meta.llama3-8b-instruct-v1:0"]
}

Long Context Applications

models = {
    "maximum_context": ["gemini-1.5-pro"],  # 2M tokens
    "large_context": ["gpt-4.1-2025-04-14", "gemini-1.5-flash"],  # 1M tokens
    "standard_large": ["gpt-5-2025-08-07", "claude-3-5-sonnet-20241022", "gpt-4o"]  # 400K/200K/128K
}

Multimodal Applications

models = {
    "vision": ["gpt-5-2025-08-07", "gpt-4o", "gemini-1.5-pro", "o3-2025-04-16"],
    "video": ["gemini-1.5-pro", "gemini-1.5-flash"],
    "images": ["gpt-5-2025-08-07", "gpt-4o", "o4-mini-2025-04-16", "gemini-1.0-pro-vision"]
}

Dynamic Model Selection (Example Script)

from openai import OpenAI
from typing import Dict, List

class ModelSelector:
    """Intelligent model selection based on requirements"""
    
    # Model capabilities and costs
    MODEL_PROFILES = {
        "gpt-5-2025-08-07": {
            "cost": "highest",
            "speed": "medium",
            "capability": "highest",
            "context": 400000
        },
        "gpt-5-mini-2025-08-07": {
            "cost": "medium",
            "speed": "fast",
            "capability": "highest",
            "context": 400000
        },
        "gpt-5-nano-2025-08-07": {
            "cost": "low",
            "speed": "fastest",
            "capability": "good",
            "context": 400000
        },
        "gpt-4.1-2025-04-14": {
            "cost": "high",
            "speed": "medium",
            "capability": "highest",
            "context": 1047576
        },
        "gpt-4o": {
            "cost": "high",
            "speed": "medium",
            "capability": "highest",
            "context": 128000
        },
        "gpt-4o-mini": {
            "cost": "low",
            "speed": "fast",
            "capability": "good",
            "context": 128000
        },
        "o3-2025-04-16": {
            "cost": "highest",
            "speed": "slow",
            "capability": "highest",
            "context": 200000
        },
        "claude-3-5-sonnet-20241022": {
            "cost": "medium",
            "speed": "fast",
            "capability": "highest",
            "context": 200000
        },
        "claude-3-5-haiku-20241022": {
            "cost": "very_low",
            "speed": "fastest",
            "capability": "good",
            "context": 200000
        },
        "gemini-1.5-pro": {
            "cost": "medium",
            "speed": "medium",
            "capability": "highest",
            "context": 2000000
        },
        "gemini-1.5-flash": {
            "cost": "low",
            "speed": "fast",
            "capability": "good",
            "context": 1000000
        }
    }
    
    def select_model(self, 
                     task_complexity: str,
                     context_size: int,
                     budget: str,
                     speed_requirement: str) -> str:
        """Select optimal model based on requirements"""
        
        suitable_models = []
        
        for model, profile in self.MODEL_PROFILES.items():
            # Check context size
            if context_size > profile["context"]:
                continue
                
            # Check budget constraints
            if budget == "low" and profile["cost"] in ["high", "medium"]:
                continue
                
            # Check speed requirements
            if speed_requirement == "real-time" and profile["speed"] == "slow":
                continue
                
            # Check capability requirements
            if task_complexity == "complex" and profile["capability"] != "highest":
                continue
                
            suitable_models.append(model)
        
        # Return best match or default
        return suitable_models[0] if suitable_models else "gpt-4o-mini"

# Usage example
selector = ModelSelector()

# Select model for different scenarios
model_for_chat = selector.select_model(
    task_complexity="simple",
    context_size=1000,
    budget="low",
    speed_requirement="real-time"
)  # Returns: gpt-4o-mini or claude-3-5-haiku-20241022

model_for_analysis = selector.select_model(
    task_complexity="complex",
    context_size=150000,
    budget="high",
    speed_requirement="normal"
)  # Returns: claude-3-5-sonnet-20241022 or gemini-1.5-pro

Cost Optimised Model Routing (Example Script)

import json
from typing import Dict, Optional
from openai import OpenAI

class CostOptimisedRouter:
    """Route requests to most cost-effective model"""
    
    # Cost per 1K tokens (input/output estimated)
    MODEL_COSTS = {
        "gpt-4o": {"input": 0.00250, "output": 0.01000},
        "gpt-4o-mini": {"input": 0.00015, "output": 0.00060},
        "claude-3-5-sonnet-20241022": {"input": 0.00300, "output": 0.01500},
        "claude-3-5-haiku-20241022": {"input": 0.00025, "output": 0.00125},
        "gemini-1.5-flash": {"input": 0.00010, "output": 0.00040},
        "gpt-3.5-turbo": {"input": 0.00050, "output": 0.00150}
    }
    
    def estimate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
        """Estimate cost for a request"""
        costs = self.MODEL_COSTS.get(model, {"input": 0, "output": 0})
        input_cost = (input_tokens / 1000) * costs["input"]
        output_cost = (output_tokens / 1000) * costs["output"]
        return input_cost + output_cost
    
    def select_cheapest_capable_model(self, 
                                      task_type: str,
                                      estimated_tokens: int) -> str:
        """Select cheapest model capable of the task"""
        
        # Define capable models by task type
        capable_models = {
            "simple": ["gpt-4o-mini", "claude-3-5-haiku-20241022", "gpt-3.5-turbo"],
            "moderate": ["gpt-4o-mini", "claude-3-5-haiku-20241022", "gemini-1.5-flash"],
            "complex": ["gpt-4o", "claude-3-5-sonnet-20241022", "gemini-1.5-pro"]
        }
        
        models = capable_models.get(task_type, capable_models["simple"])
        
        # Calculate costs and select cheapest
        cheapest = min(models, key=lambda m: self.estimate_cost(
            m, estimated_tokens, estimated_tokens // 2
        ))
        
        return cheapest
    
    def route_request(self, prompt: str, complexity: str = "auto") -> Dict:
        """Route request to optimal model"""
        
        # Auto-detect complexity if needed
        if complexity == "auto":
            prompt_length = len(prompt)
            if prompt_length < 100:
                complexity = "simple"
            elif prompt_length < 500:
                complexity = "moderate"
            else:
                complexity = "complex"
        
        # Estimate tokens (rough estimate)
        estimated_tokens = len(prompt) // 4
        
        # Select model
        model = self.select_cheapest_capable_model(complexity, estimated_tokens)
        
        # Get provider for model
        provider_map = {
            "gpt-4o": "openai",
            "gpt-4o-mini": "openai",
            "gpt-3.5-turbo": "openai",
            "claude-3-5-sonnet-20241022": "anthropic",
            "claude-3-5-haiku-20241022": "anthropic",
            "gemini-1.5-pro": "google",
            "gemini-1.5-flash": "google"
        }
        
        return {
            "model": model,
            "provider": provider_map[model],
            "estimated_cost": self.estimate_cost(model, estimated_tokens, estimated_tokens // 2),
            "reasoning": f"Selected {model} as cheapest option for {complexity} task"
        }

# Usage
router = CostOptimisedRouter()

# Simple query - routes to cheapest model
result = router.route_request("What is 2+2?")
print(f"Model: {result['model']}, Cost: ${result['estimated_cost']:.6f}")

# Complex query - routes to capable but cost-effective model
result = router.route_request(
    "Analyse this 10-page legal document and identify key risks...",
    complexity="complex"
)
print(f"Model: {result['model']}, Cost: ${result['estimated_cost']:.4f}")

A/B Testing Different Models

import asyncio
import time
from typing import List, Dict
from openai import OpenAI

class ModelABTester:
    """A/B test different models for quality and performance"""
    
    def __init__(self, gateway_url: str):
        self.gateway_url = gateway_url
        self.results = []
    
    def create_client(self, provider: str, credentials: dict) -> OpenAI:
        """Create client for specific provider"""
        headers = {"x-provider-name": provider}
        headers.update(credentials)
        
        return OpenAI(
            api_key="dummy",
            base_url=self.gateway_url,
            default_headers=headers
        )
    
    async def test_model(self, 
                         model: str, 
                         provider: str,
                         credentials: dict,
                         prompt: str) -> Dict:
        """Test a single model"""
        
        client = self.create_client(provider, credentials)
        
        start_time = time.time()
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=500
            )
            
            end_time = time.time()
            
            return {
                "model": model,
                "provider": provider,
                "success": True,
                "latency": end_time - start_time,
                "response": response.choices[0].message.content,
                "tokens_used": response.usage.total_tokens if response.usage else 0
            }
        except Exception as e:
            return {
                "model": model,
                "provider": provider,
                "success": False,
                "error": str(e)
            }
    
    async def run_ab_test(self, 
                         test_configs: List[Dict],
                         prompt: str) -> List[Dict]:
        """Run A/B test across multiple models"""
        
        tasks = []
        for config in test_configs:
            task = self.test_model(
                config["model"],
                config["provider"],
                config["credentials"],
                prompt
            )
            tasks.append(task)
        
        results = await asyncio.gather(*tasks)
        return results
    
    def analyse_results(self, results: List[Dict]) -> Dict:
        """Analyse A/B test results"""
        
        successful = [r for r in results if r.get("success")]
        
        if not successful:
            return {"error": "All models failed"}
        
        # Find best by latency
        fastest = min(successful, key=lambda x: x["latency"])
        
        # Calculate averages
        avg_latency = sum(r["latency"] for r in successful) / len(successful)
        
        return {
            "models_tested": len(results),
            "successful": len(successful),
            "fastest_model": fastest["model"],
            "fastest_latency": fastest["latency"],
            "average_latency": avg_latency,
            "results": results
        }

# Usage example
async def main():
    tester = ModelABTester("https://gateway.altrum.ai/v1")
    
    # Configure models to test
    test_configs = [
        {
            "model": "gpt-4o-mini",
            "provider": "openai",
            "credentials": {"Authorization": "Bearer key"}
        },
        {
            "model": "claude-3-5-haiku-20241022",
            "provider": "anthropic",
            "credentials": {"x-api-key": "key"}
        },
        {
            "model": "gemini-1.5-flash",
            "provider": "google",
            "credentials": {"x-goog-api-key": "key"}
        }
    ]
    
    # Run test
    prompt = "Write a haiku about cloud computing"
    results = await tester.run_ab_test(test_configs, prompt)
    
    # Analyse
    analysis = tester.analyse_results(results)
    print(f"Fastest model: {analysis['fastest_model']}")
    print(f"Latency: {analysis['fastest_latency']:.2f}s")

# Run the test
asyncio.run(main())

Model Comparison Matrix

ProviderModelContextSpeedCostBest For
OpenAIgpt-5-2025-08-07400KMediumHighestFlagship coding, reasoning, agentic tasks
gpt-5-mini-2025-08-07400KFastMediumCost-efficient GPT-5 for defined tasks
gpt-5-nano-2025-08-07400KFastestLowFastest, cheapest GPT-5 variant
gpt-4.1-2025-04-141MMediumHighInstruction following, tool use
gpt-4o128KMediumHighComplex reasoning, multimodal
gpt-4o-mini128KFastLowSimple tasks, high volume
o3-2025-04-16200KSlowHighestMath, science, coding, multimodal analysis
o1-2024-12-17200KSlowHighStep-by-step reasoning
Anthropicclaude-3-5-sonnet200KFastMediumCoding, analysis
claude-3-5-haiku200KFastestVery LowReal-time apps
claude-3-opus200KSlowVery HighComplex research
Googlegemini-1.5-pro2MMediumMediumMassive documents
gemini-1.5-flash1MFastLowHigh-speed processing
Bedrockllama3-1-70b8KMediumLowOpen-source needs
titan-premier8KFastLowAWS integration
mistral-large32KMediumMediumEuropean compliance

Best Practices

1. Model Selection Strategy

def select_model_strategy(requirements):
    """Strategic model selection based on requirements"""
    
    strategies = {
        "quality_first": [
            "gpt-5-2025-08-07",
            "gpt-4.1-2025-04-14",
            "claude-3-5-sonnet-20241022",
            "gemini-1.5-pro"
        ],
        "speed_first": [
            "gpt-5-nano-2025-08-07",
            "claude-3-5-haiku-20241022",
            "gpt-4o-mini",
            "gemini-1.5-flash"
        ],
        "cost_first": [
            "gpt-5-nano-2025-08-07",
            "gpt-4o-mini",
            "claude-3-haiku-20240307",
            "meta.llama3-8b-instruct-v1:0"
        ],
        "context_first": [
            "gemini-1.5-pro",  # 2M tokens
            "gpt-4.1-2025-04-14",  # 1M tokens
            "gpt-5-2025-08-07",  # 400K tokens
            "claude-3-5-sonnet-20241022"  # 200K tokens
        ]
    }
    
    return strategies.get(requirements["priority"], strategies["quality_first"])

2. Fallback Chains

class ModelFallbackChain:
    """Implement fallback chains for reliability"""
    
    def __init__(self):
        self.fallback_chains = {
            "premium": [
                "gpt-5-2025-08-07",
                "gpt-4.1-2025-04-14",
                "claude-3-5-sonnet-20241022",
                "gemini-1.5-pro",
                "gpt-4o"
            ],
            "efficient": [
                "gpt-5-nano-2025-08-07",
                "gpt-5-mini-2025-08-07",
                "claude-3-5-haiku-20241022",
                "gpt-4o-mini",
                "gemini-1.5-flash"
            ]
        }
    
    def execute_with_fallback(self, chain_type, prompt):
        """Execute with automatic fallback"""
        
        chain = self.fallback_chains[chain_type]
        
        for model in chain:
            try:
                return self.call_model(model, prompt)
            except Exception as e:
                print(f"Model {model} failed: {e}")
                continue
        
        raise Exception("All models in fallback chain failed")

3. Cost Monitoring

class CostMonitor:
    """Monitor and control model costs"""
    
    def __init__(self, monthly_budget: float):
        self.monthly_budget = monthly_budget
        self.current_spend = 0.0
        self.model_usage = {}
    
    def track_usage(self, model: str, tokens_in: int, tokens_out: int):
        """Track model usage and costs"""
        
        cost = self.calculate_cost(model, tokens_in, tokens_out)
        self.current_spend += cost
        
        if model not in self.model_usage:
            self.model_usage[model] = {"calls": 0, "cost": 0}
        
        self.model_usage[model]["calls"] += 1
        self.model_usage[model]["cost"] += cost
        
        # Alert if approaching budget
        if self.current_spend > self.monthly_budget * 0.8:
            self.send_budget_alert()
    
    def get_usage_report(self):
        """Generate usage report"""
        
        return {
            "total_spend": self.current_spend,
            "budget_remaining": self.monthly_budget - self.current_spend,
            "model_breakdown": self.model_usage
        }

Migration Guide

From Single Model to Multi-Model

# Before: Single model deployment
client = OpenAI(api_key="key")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[...]
)

# After: Multi-model with intelligent selection
class MultiModelClient:
    def __init__(self):
        self.gateway = "https://gateway.altrum.ai/v1"
    
    def create_completion(self, messages, requirements=None):
        # Select model based on requirements
        model = self.select_optimal_model(requirements)
        
        # Create client for selected model
        client = self.create_client_for_model(model)
        
        # Execute with automatic fallback
        return self.execute_with_fallback(client, model, messages)

Conclusion

The Multi-Model Support feature transforms AI deployment from a single model dependency to a flexible, optimised multi-model strategy in your Production AI Stack. With access to 55+ models across 7 providers, organisations can select the perfect model for each use case, optimise costs, ensure reliability through redundancy, and stay at the forefront of AI innovation.