Understanding and Managing AWS Bedrock Quota Limits

AWS Bedrock imposes various quota limits to ensure fair usage across all customers. This guide explores how to discover, monitor, and optimize your usage within these limits.

Types of Quota Limits in AWS Bedrock

1. Tokens Per Minute (TPM)

Limits the total number of tokens (both input and output) processed per minute
Set individually for each foundation model
Usually the most critical limit for high-throughput applications

2. Requests Per Minute (RPM)

Restricts how many API calls you can make to a specific model per minute
Independent of token count in each request
Important for applications making many small requests

3. Concurrent Requests

Limits the number of simultaneous requests to a model
Particularly relevant for streaming and asynchronous operations

4. Maximum Input/Output Tokens

Caps the maximum size of individual requests/responses
Varies by model (e.g., Claude models have larger context windows than Llama 2)

Discovering Quota Limits

Using AWS CLI

To view your current quotas for all Bedrock services:

aws service-quotas list-service-quotas \
  --service-code bedrock \
  --profile aws

To get details about a specific quota:

aws service-quotas get-service-quota \
  --service-code bedrock \
  --quota-code L-12345678 \
  --profile aws

To view the default quotas for services:

aws service-quotas list-aws-default-service-quotas \
  --service-code bedrock \
  --profile aws

Using AWS SDK (Python)

import boto3
from utils.profile_manager import get_profile

# Use the configured profile (defaults to 'aws' for local testing)
profile_name = get_profile()
session = boto3.Session(profile_name=profile_name)
client = session.client('service-quotas')

# List all quotas for Bedrock
response = client.list_service_quotas(ServiceCode='bedrock')

# Print quota information
for quota in response['Quotas']:
    print(f"Name: {quota['QuotaName']}")
    print(f"Code: {quota['QuotaCode']}")
    print(f"Value: {quota['Value']}")
    print(f"Adjustable: {quota['Adjustable']}")
    print("---")

Checking Model Throughput (TPM Capacity)

To view your provisioned throughput capacity for specific models:

aws bedrock list-provisioned-model-throughputs --profile aws

To check specific model throughput:

aws bedrock get-foundation-model-throughput-capacity \
  --model-id anthropic.claude-3-sonnet-20240229-v1:0 \
  --profile aws

Monitoring Quota Usage

CloudWatch Metrics

AWS Bedrock publishes usage metrics to CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name InvokeModelClientErrors \
  --statistics Sum \
  --period 3600 \
  --start-time 2023-05-01T00:00:00Z \
  --end-time 2023-05-02T00:00:00Z \
  --dimensions Name=ModelId,Value=anthropic.claude-3-sonnet-20240229-v1:0 \
  --profile aws

Key metrics to monitor:

InvokeModel (successful invocations)
InvokeModelClientErrors (client errors)
InvokeModelUserErrors (user errors)
InvokeModelThrottled (requests throttled due to quota limits)

Using Python to Monitor Usage

import boto3
import datetime
from utils.profile_manager import get_profile

def monitor_bedrock_usage(model_id, hours=24):
    profile_name = get_profile()
    session = boto3.Session(profile_name=profile_name)
    cloudwatch = session.client('cloudwatch')
    
    end_time = datetime.datetime.utcnow()
    start_time = end_time - datetime.timedelta(hours=hours)
    
    # Get invocation counts
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/Bedrock',
        MetricName='InvokeModel',
        Dimensions=[
            {'Name': 'ModelId', 'Value': model_id}
        ],
        StartTime=start_time,
        EndTime=end_time,
        Period=3600,  # 1 hour periods
        Statistics=['Sum']
    )
    
    # Get throttling events
    throttled = cloudwatch.get_metric_statistics(
        Namespace='AWS/Bedrock',
        MetricName='InvokeModelThrottled',
        Dimensions=[
            {'Name': 'ModelId', 'Value': model_id}
        ],
        StartTime=start_time,
        EndTime=end_time,
        Period=3600,  # 1 hour periods
        Statistics=['Sum']
    )
    
    print(f"Usage statistics for {model_id} over the past {hours} hours:")
    print("Timestamp\t\tInvocations\tThrottled")
    print("-" * 50)
    
    # Process and display results
    datapoints = {point['Timestamp']: {'invocations': point['Sum']} 
                 for point in response['Datapoints']}
    
    for point in throttled['Datapoints']:
        if point['Timestamp'] in datapoints:
            datapoints[point['Timestamp']]['throttled'] = point['Sum']
        else:
            datapoints[point['Timestamp']] = {'invocations': 0, 'throttled': point['Sum']}
    
    # Sort by timestamp
    for timestamp in sorted(datapoints.keys()):
        data = datapoints[timestamp]
        invocations = data.get('invocations', 0)
        throttled_count = data.get('throttled', 0)
        print(f"{timestamp}\t{invocations:.0f}\t\t{throttled_count:.0f}")

# Example usage
monitor_bedrock_usage('anthropic.claude-3-sonnet-20240229-v1:0')

Requesting Quota Increases

Using AWS CLI

To request a quota increase:

aws service-quotas request-service-quota-increase \
  --service-code bedrock \
  --quota-code L-12345678 \
  --desired-value 100 \
  --profile aws

Using AWS Management Console

Open the Service Quotas console
Search for “Amazon Bedrock”
Select the quota you want to increase
Click “Request quota increase”
Enter the desired value and justification

When you hit a quota limit, you’ll receive error responses:

{
  "message": "Rate exceeded",
  "code": "ThrottlingException"
}

Or for token limits:

{
  "message": "Token limit exceeded",
  "code": "ValidationException"
}

Model-Specific Quota Considerations

Different foundation models have varying quota limits and requirements:

Anthropic Claude Models

Generally have higher token-per-minute quotas
Support large context windows (up to 100K tokens)
Quota limits are set per model version (Claude 3 Opus vs. Sonnet vs. Haiku)

Meta Llama 2 Models

Typically have lower default quotas than Claude
Smaller context windows (4K-8K tokens)
Lower token costs make them economical for certain workloads

Amazon Titan Models

Often have higher default quotas since they’re Amazon’s own models
Great for users who need higher throughput without quota increase requests

Image Generation Models

Have separate quotas from text models
Often limited by images per minute rather than tokens

Optimizing Within Quota Limits

Strategies for Managing TPM Limits

Prompt Engineering
- Reduce input token length through concise prompts
- Use efficient summarization techniques for long inputs
Batching
- Combine multiple smaller requests into batches
- Distribute workloads over time to stay under per-minute limits
Request Scheduling
- Implement token bucket algorithms to pace requests
- Use exponential backoff for retry mechanisms
Model Selection
- Choose smaller models for simpler tasks (Claude Haiku vs. Opus)
- Distribute workloads across multiple model families
Streaming
- Use streaming APIs to get results faster while still using the same token quota
- Improves user experience while working within the same limits

Visualizing Quota Usage

import matplotlib.pyplot as plt
import pandas as pd
import boto3
import datetime
from utils.profile_manager import get_profile
from utils.visualization_config import SVG_CONFIG

def visualize_quota_usage(model_id, hours=24, output_file="quota_usage.svg"):
    profile_name = get_profile()
    session = boto3.Session(profile_name=profile_name)
    cloudwatch = session.client('cloudwatch')
    
    end_time = datetime.datetime.utcnow()
    start_time = end_time - datetime.timedelta(hours=hours)
    
    # Get invocation counts
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/Bedrock',
        MetricName='InvokeModel',
        Dimensions=[
            {'Name': 'ModelId', 'Value': model_id}
        ],
        StartTime=start_time,
        EndTime=end_time,
        Period=3600,  # 1 hour periods
        Statistics=['Sum']
    )
    
    # Get throttling events
    throttled = cloudwatch.get_metric_statistics(
        Namespace='AWS/Bedrock',
        MetricName='InvokeModelThrottled',
        Dimensions=[
            {'Name': 'ModelId', 'Value': model_id}
        ],
        StartTime=start_time,
        EndTime=end_time,
        Period=3600,  # 1 hour periods
        Statistics=['Sum']
    )
    
    # Process data for visualization
    data = []
    for point in response['Datapoints']:
        data.append({
            'timestamp': point['Timestamp'],
            'invocations': point['Sum'],
            'type': 'Successful'
        })
    
    for point in throttled['Datapoints']:
        data.append({
            'timestamp': point['Timestamp'],
            'invocations': point['Sum'],
            'type': 'Throttled'
        })
    
    df = pd.DataFrame(data)
    
    # Create visualization
    plt.figure(figsize=(12, 6))
    
    if not df.empty:
        # Pivot data for stacked bar chart
        pivot_df = df.pivot_table(
            index='timestamp', 
            columns='type', 
            values='invocations',
            fill_value=0
        )
        
        # Sort by timestamp
        pivot_df = pivot_df.sort_index()
        
        # Create stacked bar chart
        pivot_df.plot(kind='bar', stacked=True, ax=plt.gca(),
                    color=['#2ca02c', '#d62728'])  # Green for success, red for throttled
        
        plt.title(f'API Usage for {model_id}', fontsize=16)
        plt.xlabel('Time', fontsize=12)
        plt.ylabel('Request Count', fontsize=12)
        plt.xticks(rotation=45)
        plt.tight_layout()
        
        # Save as SVG
        plt.savefig(output_file, **SVG_CONFIG)
        plt.close()
        
        return output_file
    else:
        print("No data available for the specified time period")
        return None

# Example usage
visualize_quota_usage('anthropic.claude-3-sonnet-20240229-v1:0', 
                    output_file='docs/images/claude_quota_usage.svg')

Next Steps

Implement a quota monitoring dashboard
Set up CloudWatch alarms for approaching limits
Create automated scaling mechanisms to distribute load
Optimize prompts to reduce token usage
Explore more advanced parallelization strategies

These techniques will help you maximize the value from AWS Bedrock while working within the established quota limits.

Understanding and Managing AWS Bedrock Quota Limits

Understanding and Managing AWS Bedrock Quota Limits

Types of Quota Limits in AWS Bedrock

1. Tokens Per Minute (TPM)

2. Requests Per Minute (RPM)

3. Concurrent Requests

4. Maximum Input/Output Tokens

Discovering Quota Limits

Using AWS CLI

Using AWS SDK (Python)

Checking Model Throughput (TPM Capacity)

Monitoring Quota Usage

CloudWatch Metrics

Using Python to Monitor Usage

Requesting Quota Increases

Using AWS CLI

Using AWS Management Console

Common Quota-Related Error Responses

Model-Specific Quota Considerations

Anthropic Claude Models

Meta Llama 2 Models

Amazon Titan Models

Image Generation Models

Optimizing Within Quota Limits

Strategies for Managing TPM Limits

Visualizing Quota Usage

Next Steps