Understanding Foundation Models in AWS Bedrock

Difficulty: beginner

Estimated time: 25 minutes

Understanding Foundation Models in AWS Bedrock

“You know AWS Bedrock gives you access to multiple AI models, but which one should you choose? Let’s cut through the confusion and find the right tool for your job.”

The Problem

Scenario: You’re implementing AI capabilities for your enterprise application using AWS Bedrock. As you start exploring, you realize there are numerous foundation models available from different providers: Anthropic, Amazon, Meta, Cohere, AI21 Labs, and Stability AI.

Each model has different capabilities, pricing structures, token limits, and performance characteristics. You need to select the right model for your specific use cases, but you’re overwhelmed by the options.

Your specific challenges include:

Understanding the real-world strengths and weaknesses of each model family
Determining which models offer the best price-performance ratio for your tasks
Identifying which models support advanced features like streaming or structured outputs
Finding the right balance between model quality and cost for production use

Key Concepts Explained

Before diving into specific models, let’s clarify what makes foundation models different from one another.

What Differentiates Foundation Models

Think of foundation models like different professional tools – a hammer, screwdriver, and wrench all have their purposes, and while you could use a hammer for everything, some jobs are better suited for other tools.

Foundation models differ in several key dimensions:

Training Data: The content they were trained on shapes their knowledge and abilities
Architecture: Different internal structures affect reasoning capabilities and efficiency
Size: Larger models generally perform better but cost more and run slower
Specialization: Some models excel at specific tasks like code generation or reasoning
Context Window: How much text they can process and “remember” at once
Instruction Following: How well they adhere to specific directions
Output Control: Ability to generate structured outputs or follow formatting rules

Understanding Model Versions

Model capabilities evolve over time with new versions. For example, Anthropic’s Claude has progressed from Claude 1 to Claude 2 to Claude 3, with each generation offering improved capabilities.

Within a given model family, you’ll often see variations like:

Opus/Large: The most capable (and expensive) version
Sonnet/Medium: A balanced option for most use cases
Haiku/Small: Faster and less expensive, but with reduced capabilities

Model Families in AWS Bedrock

Let’s explore the major model families available in AWS Bedrock:

Anthropic Claude Models

Claude models excel at:

Following complex instructions precisely
Nuanced reasoning and analysis
Safety and reducing harmful outputs
Long context processing (up to 200K tokens in Claude 3 Opus)

Best For:

Enterprise applications requiring reliability and safety
Complex reasoning tasks and content generation
Applications needing long context windows
Situations requiring nuanced understanding and responses

Available Models:

Claude 3 Opus: The most capable Claude model
Claude 3 Sonnet: Balanced performance and cost
Claude 3 Haiku: Fastest and most cost-effective
Claude 2 and Claude Instant (older generations)

Code Example:

import boto3
import json

bedrock = boto3.client('bedrock-runtime')

claude_prompt = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1000,
    "messages": [
        {"role": "user", "content": "Analyze the following contract clause and explain any potential risks: 'The party shall make best efforts to deliver services in a timely manner.'"}
    ]
}

response = bedrock.invoke_model(
    modelId='anthropic.claude-3-sonnet-20240229-v1:0',
    body=json.dumps(claude_prompt)
)

response_body = json.loads(response['body'].read())
print(response_body['content'][0]['text'])

Amazon Titan Models

Titan models offer:

Tight integration with AWS services
Good balance of performance and cost
Text and image generation capabilities
Embeddings for retrieval and classification

Best For:

Cost-sensitive applications
Applications requiring deep AWS integration
Baseline text generation and summarization
Embedding generation for vector databases

Available Models:

Titan Text (Express and Lite)
Titan Embeddings
Titan Image Generator

Code Example:

import boto3
import json

bedrock = boto3.client('bedrock-runtime')

titan_prompt = {
    "inputText": "Summarize the key features of AWS Bedrock in 3 bullet points.",
    "textGenerationConfig": {
        "maxTokenCount": 500,
        "temperature": 0.7,
        "topP": 0.9
    }
}

response = bedrock.invoke_model(
    modelId='amazon.titan-text-express-v1',
    body=json.dumps(titan_prompt)
)

response_body = json.loads(response['body'].read())
print(response_body['results'][0]['outputText'])

Meta Llama 2 Models

Llama 2 models provide:

Strong performance at lower cost than some alternatives
Open weights (outside of Bedrock) for local deployment
Good code generation capabilities
Effective for conversational applications

Best For:

Applications with a good balance of cost and performance
Code generation and technical content
Chatbots and conversational agents
Projects that might later require local deployment

Available Models:

Llama 2 (13B and 70B parameters)
Llama 2 Chat (tuned for conversations)

Code Example:

import boto3
import json

bedrock = boto3.client('bedrock-runtime')

llama_prompt = {
    "prompt": "<s>[INST] Write a Python function to calculate the Fibonacci sequence up to n [/INST]",
    "max_gen_len": 512,
    "temperature": 0.7,
    "top_p": 0.9
}

response = bedrock.invoke_model(
    modelId='meta.llama2-13b-chat-v1',
    body=json.dumps(llama_prompt)
)

response_body = json.loads(response['body'].read())
print(response_body['generation'])

Cohere Command Models

Cohere models excel at:

Multilingual capabilities across 100+ languages
Search and retrieval tasks
Text classification and analysis
High-quality summarization

Best For:

International applications requiring multiple languages
Search functionality and semantic matching
Content summarization and classification
Applications requiring multilingual capabilities

Available Models:

Command (text generation)
Command Light (faster, more efficient)
Command R (enhanced reasoning capabilities)
Embed (multilingual embeddings)

Code Example:

import boto3
import json

bedrock = boto3.client('bedrock-runtime')

cohere_prompt = {
    "prompt": "Translate the following English text to French, Spanish, and German: 'Welcome to our global platform.'",
    "max_tokens": 500,
    "temperature": 0.7
}

response = bedrock.invoke_model(
    modelId='cohere.command-text-v14',
    body=json.dumps(cohere_prompt)
)

response_body = json.loads(response['body'].read())
print(response_body['generations'][0]['text'])

AI21 Jurassic Models

Jurassic models are strong in:

Structured text generation
Numerical reasoning and analysis
Specialized document tasks
Handling specific formats reliably

Best For:

Financial and data-heavy applications
Applications requiring structured outputs
Document summarization and analysis
Mathematical and numerical reasoning

Available Models:

Jurassic-2 (various sizes)

Code Example:

import boto3
import json

bedrock = boto3.client('bedrock-runtime')

jurassic_prompt = {
    "prompt": "Calculate the compound interest on a loan of $10,000 with an annual interest rate of 5% over 3 years.",
    "maxTokens": 500,
    "temperature": 0.7
}

response = bedrock.invoke_model(
    modelId='ai21.j2-mid-v1',
    body=json.dumps(jurassic_prompt)
)

response_body = json.loads(response['body'].read())
print(response_body['completions'][0]['data']['text'])

Stability AI Models

Stability AI provides:

High-quality image generation from text prompts
Style control and artistic variations
Support for various image dimensions
Fast generation times

Best For:

Creating marketing visuals
Product design and visualization
Creative and artistic applications
Generating custom imagery for content

Available Models:

Stable Diffusion XL
Stable Diffusion 3

Code Example:

import boto3
import json
import base64
from PIL import Image
import io

bedrock = boto3.client('bedrock-runtime')

stability_prompt = {
    "text_prompts": [
        {
            "text": "A futuristic city with flying cars and tall buildings, digital art style",
            "weight": 1.0
        }
    ],
    "cfg_scale": 7,
    "steps": 30,
    "seed": 42,
    "width": 1024,
    "height": 1024
}

response = bedrock.invoke_model(
    modelId='stability.stable-diffusion-xl-v1',
    body=json.dumps(stability_prompt)
)

response_body = json.loads(response['body'].read())
image_bytes = base64.b64decode(response_body['artifacts'][0]['base64'])

# Save or display the image
image = Image.open(io.BytesIO(image_bytes))
image.save("generated_image.png")

Choosing the Right Model for Your Task

Here’s a practical framework for selecting the most appropriate model:

Step 1: Define Your Requirements

Start by clearly defining what you need:

Task type: Text generation, conversation, summarization, code, image generation
Quality threshold: How critical is perfect output quality?
Speed requirements: Is response time critical?
Budget constraints: What are your cost limitations?
Context needs: How much input context do you need to process?

Step 2: Match Your Task Type to Model Strengths

Here’s a quick reference guide:

Task	Recommended Models	Reasoning
General Content Creation	Claude 3 Sonnet, Titan Text	Good balance of quality and cost
Customer Support Chatbot	Claude 3 Haiku, Llama 2 Chat	Fast responses, good conversation handling
Legal/Financial Analysis	Claude 3 Opus, Jurassic-2	Strong reasoning, handles complex documents
Code Generation	Llama 2, Claude 3	Strong technical capabilities
Multilingual Applications	Cohere Command	Superior multilingual support
Image Generation	Stable Diffusion XL	High-quality image creation
Knowledge Base QA	Claude with long context	Can process lengthy documents

Step 3: Consider Pricing

Price comparisons (approximate, please check current AWS pricing):

Model	Input Tokens (per 1M)	Output Tokens (per 1M)	Relative Cost
Claude 3 Opus	$15	$75	Highest
Claude 3 Sonnet	$3	$15	High
Claude 3 Haiku	$0.25	$1.25	Medium
Titan Text Express	$0.20	$0.30	Low
Llama 2 70B	$0.80	$1.10	Medium
Cohere Command	$1	$2	Medium
Jurassic-2 Mid	$0.50	$1.50	Medium

Step 4: Evaluate and Iterate

Don’t rely solely on specifications – test models with your actual use cases:

Create a representative set of example inputs
Run them through your candidate models
Evaluate the outputs against your requirements
Consider both quality and cost metrics
Iterate based on real-world performance

Feature Comparison

Here’s a feature comparison to help with your decision:

Feature	Claude 3	Titan	Llama 2	Cohere	Jurassic-2
Max Context Window	Up to 200K tokens	8K tokens	4K tokens	8K tokens	8K tokens
Streaming Support	Yes	Yes	Yes	Yes	Yes
Structured Output	Yes	Limited	Limited	Yes	Yes
Multilingual Support	Good	Basic	Basic	Excellent	Good
Fine-tuning Support	Yes	Yes	Yes	Yes	Yes
Safety Guardrails	Strong	Moderate	Moderate	Moderate	Moderate
Code Generation	Good	Basic	Strong	Basic	Basic
Image Understanding	Yes (Claude 3)	No	No	No	No

Common Pitfalls and Troubleshooting

Pitfall #1: Model Overspending

Problem: Using expensive models for simple tasks that don’t require their capabilities.

Solution: Start with smaller models and only upgrade when necessary. For example, use Claude 3 Haiku for most tasks and only upgrade to Sonnet or Opus when you need more sophisticated reasoning or larger context windows.

Pitfall #2: Ignoring Context Windows

Problem: Trying to process documents larger than a model’s context window.

Solution: Implement chunking strategies or use models with larger context windows like Claude 3 Opus. Be aware of token limits when designing your application.

def process_large_document(document, chunk_size=8000, overlap=500):
    """Process a document that exceeds context window by chunking."""
    tokens = tokenize_text(document)  # Implement appropriate tokenization
    chunks = []
    
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(detokenize(chunk))  # Convert back to text
    
    results = []
    for chunk in chunks:
        # Process each chunk with the model
        result = process_with_model(chunk)
        results.append(result)
    
    # Combine results appropriately for your use case
    return combine_results(results)

Pitfall #3: Not Accounting for Token Costs

Problem: Unexpected high costs due to not estimating token usage.

Solution: Implement token counting and budget management:

def estimate_cost(input_text, expected_output_length, model="claude-3-sonnet"):
    """Estimate cost for a model request."""
    # Rough token estimation (implement proper tokenization for accuracy)
    input_tokens = len(input_text.split()) * 1.3
    output_tokens = expected_output_length
    
    # Example pricing (would need to be updated)
    pricing = {
        "claude-3-opus": {"input": 0.000015, "output": 0.000075},
        "claude-3-sonnet": {"input": 0.000003, "output": 0.000015},
        "claude-3-haiku": {"input": 0.00000025, "output": 0.00000125},
        "titan-text-express": {"input": 0.0000002, "output": 0.0000003}
    }
    
    cost = (input_tokens * pricing[model]["input"]) + (output_tokens * pricing[model]["output"])
    return cost

Try It Yourself Challenge

Now it’s your turn to gain hands-on experience with different foundation models:

Challenge: Model Comparison Test

Create a simple test to compare 2-3 different models on the same task
Implement code that:
- Sends the same prompt to multiple models
- Captures response quality, token usage, and latency
- Provides a simple scoring mechanism for comparison

Starting Code:

import boto3
import json
import time
from datetime import datetime

def model_comparison_test(prompt, models):
    """
    Compare different models' performance on the same prompt.
    
    Args:
        prompt: The text prompt to send to each model
        models: List of dictionaries with model information
        
    Returns:
        Dictionary with comparison results
    """
    bedrock = boto3.client('bedrock-runtime')
    results = []
    
    for model in models:
        model_id = model["id"]
        formatted_prompt = format_prompt_for_model(prompt, model_id)
        
        start_time = time.time()
        try:
            response = bedrock.invoke_model(
                modelId=model_id,
                body=json.dumps(formatted_prompt)
            )
            
            end_time = time.time()
            latency = end_time - start_time
            
            response_body = json.loads(response['body'].read())
            output = extract_output(response_body, model_id)
            
            # Add your metrics and evaluation here
            result = {
                "model_id": model_id,
                "latency_seconds": latency,
                "output": output,
                "success": True
            }
            
        except Exception as e:
            result = {
                "model_id": model_id,
                "error": str(e),
                "success": False
            }
        
        results.append(result)
    
    return results

def format_prompt_for_model(prompt, model_id):
    """Format the prompt appropriately for each model."""
    # Implement formatting logic for different model types
    pass

def extract_output(response_body, model_id):
    """Extract the output text from model-specific response format."""
    # Implement extraction logic for different model types
    pass

# TODO: Complete the implementation of the helper functions
# and run the comparison test with your own prompt

Expected Outcome: A working script that provides quantitative and qualitative comparison data for different foundation models on the same task.

Beyond the Basics

Once you’ve selected your foundation models, consider these advanced strategies:

1. Model Ensembles

Combine multiple models to improve reliability and quality:

def ensemble_generation(prompt, models, combination_strategy="voting"):
    """Generate text using an ensemble of models."""
    results = []
    
    # Get responses from all models
    for model in models:
        result = invoke_model(model, prompt)
        results.append(result)
    
    if combination_strategy == "voting":
        # Implement voting mechanism for classification tasks
        return majority_vote(results)
    elif combination_strategy == "confidence":
        # Return result with highest confidence score
        return highest_confidence(results)
    elif combination_strategy == "average":
        # For numeric predictions, average the results
        return average_results(results)
    else:
        # Default: return all results for manual evaluation
        return results

2. Automatic Model Selection

Dynamically select models based on input characteristics:

def auto_select_model(input_text, max_cost=None):
    """Automatically select the appropriate model based on input."""
    input_length = len(input_text.split())
    contains_code = any(marker in input_text for marker in ["def ", "class ", "function", "```"])
    is_multilingual = detect_non_english(input_text)
    
    if contains_code:
        return "meta.llama2-13b-chat-v1"  # Good for code
    elif is_multilingual:
        return "cohere.command-text-v14"  # Strong multilingual support
    elif input_length > 6000:
        return "anthropic.claude-3-opus-20240229-v1:0"  # Large context window
    elif max_cost and max_cost < 0.001:
        return "amazon.titan-text-express-v1"  # Most affordable
    else:
        return "anthropic.claude-3-haiku-20240307-v1:0"  # Good balance

3. Hybrid Approaches

Use different models for different parts of your application workflow:

Classifier Model: Use an efficient model to classify user input
Content Generator: Use a high-quality model for main content generation
Refinement Model: Use a specialized model to check and refine outputs

This approach optimizes for both cost and quality throughout the processing pipeline.

Key Takeaways

Each foundation model family in AWS Bedrock has distinct strengths and use cases
Model selection should be based on task requirements, quality needs, and budget constraints
Test models with representative examples rather than relying solely on specifications
Consider context window limitations when designing your applications
Implement token counting and budget management to control costs
More expensive models aren’t always better for every task - match capabilities to requirements
Advanced strategies like ensembles and hybrid approaches can optimize performance

Next Steps: Now that you understand the different foundation models, learn about implementing synchronous inference with AWS Bedrock.