AWS Bedrock Inference Methods
AWS Bedrock Inference Methods
This document provides a comprehensive overview of the various inference methods available in AWS Bedrock, their use cases, advantages, and implementation details.
Overview of Inference Methods
AWS Bedrock offers multiple ways to interact with foundation models:
- Synchronous Inference (InvokeModel)
- Streaming Inference (InvokeModelWithResponseStream)
- Asynchronous Processing (CreateModelInvocationJob)
- Conversational AI (Converse API)
- Structured Outputs (Construct API)
Each method has different characteristics and is suited for specific use cases. Understanding these differences is crucial for optimizing throughput, managing quota limits, and providing the best user experience.
Synchronous Inference (InvokeModel)
Overview
Synchronous inference is the simplest way to interact with foundation models. It follows a request-response pattern where you send a prompt and wait for the complete response before proceeding.
When to Use
- Single, independent requests that don’t require real-time feedback
- Batch processing where you can wait for the full result
- Simple integrations where streaming adds unnecessary complexity
- Cases where you need the full response before taking any action
AWS SDK Implementation
import boto3
import json
from utils.profile_manager import get_profile
def invoke_model_sync(model_id, prompt_data):
"""
Perform synchronous inference using AWS Bedrock.
Args:
model_id: The model identifier
prompt_data: Dictionary with the prompt payload
Returns:
The model's response
"""
# Use the configured profile (defaults to 'aws' for local testing)
profile_name = get_profile()
session = boto3.Session(profile_name=profile_name)
bedrock_runtime = session.client('bedrock-runtime')
# Convert the prompt data to JSON string
body = json.dumps(prompt_data)
# Invoke the model
response = bedrock_runtime.invoke_model(
modelId=model_id,
body=body
)
# Parse and return the response
response_body = json.loads(response['body'].read())
return response_body
AWS CLI Example
aws bedrock-runtime invoke-model \
--model-id anthropic.claude-3-sonnet-20240229-v1:0 \
--body '{"anthropic_version":"bedrock-2023-05-31","max_tokens":1000,"messages":[{"role":"user","content":"What is quantum computing?"}]}' \
--profile aws \
output.json
Quota Considerations
- Subject to both TPM (tokens per minute) and RPM (requests per minute) quotas
- Each request counts as 1 against RPM quota
- Both input and output tokens count against TPM quota
- If a request completes very quickly, you may still be limited by RPM quota
Best Practices
- Batch related requests - Process multiple items in a single request when possible
- Optimize prompt size - Use concise prompts to reduce token usage
- Implement retries with backoff - Handle throttling errors gracefully
- Monitor token usage - Track both input and output tokens to predict quota consumption
Streaming Inference (InvokeModelWithResponseStream)
Overview
Streaming inference allows you to receive the model’s response incrementally as it’s being generated, rather than waiting for the complete response.
When to Use
- Interactive applications where showing incremental results improves user experience
- Long-form content generation where you want to display progress
- Applications where perceived latency is more important than total processing time
- Chat interfaces where typing indicators or progressive responses are expected
AWS SDK Implementation
import boto3
import json
from utils.profile_manager import get_profile
def invoke_model_stream(model_id, prompt_data):
"""
Perform streaming inference using AWS Bedrock.
Args:
model_id: The model identifier
prompt_data: Dictionary with the prompt payload
Returns:
Generator yielding response chunks as they arrive
"""
# Use the configured profile (defaults to 'aws' for local testing)
profile_name = get_profile()
session = boto3.Session(profile_name=profile_name)
bedrock_runtime = session.client('bedrock-runtime')
# Convert the prompt data to JSON string
body = json.dumps(prompt_data)
# Invoke the model with streaming
response = bedrock_runtime.invoke_model_with_response_stream(
modelId=model_id,
body=body
)
# Process the streaming response
stream = response.get('body')
if stream:
for event in stream:
chunk = event.get('chunk')
if chunk:
chunk_data = json.loads(chunk.get('bytes').decode())
yield chunk_data
AWS CLI Example
aws bedrock-runtime invoke-model-with-response-stream \
--model-id anthropic.claude-3-sonnet-20240229-v1:0 \
--body '{"anthropic_version":"bedrock-2023-05-31","max_tokens":1000,"messages":[{"role":"user","content":"Write a short story about robots."}]}' \
--profile aws \
output_stream.json
Quota Considerations
- Subject to the same TPM and RPM quotas as synchronous inference
- Can provide better perceived performance despite the same total processing time
- May improve throughput for large outputs by starting to process results before completion
- Token usage is identical to synchronous requests
Best Practices
- Update UI incrementally - Display content chunks as they arrive for better UX
- Implement robust error handling - Handle stream interruptions gracefully
- Consider connection timeout limits - For very long responses, be aware of connection limits
- Manage incomplete responses - Design your application to handle partial results if the stream is interrupted
Asynchronous Processing (CreateModelInvocationJob)
Overview
Asynchronous processing allows you to submit long-running inference jobs without maintaining an open connection. You submit a job, receive a job ID, and can check for results later.
When to Use
- Long-running inference tasks that may exceed typical connection timeouts
- Batch processing of multiple requests
- Background processing where you don’t need immediate results
- Heavy workloads where you need to manage throughput without overwhelming the system
AWS SDK Implementation
import boto3
import json
import time
from utils.profile_manager import get_profile
def create_async_inference_job(model_id, prompt_data, output_s3_uri):
"""
Create an asynchronous inference job.
Args:
model_id: The model identifier
prompt_data: Dictionary with the prompt payload
output_s3_uri: S3 URI where results should be stored
Returns:
Job ID for tracking the job
"""
# Use the configured profile (defaults to 'aws' for local testing)
profile_name = get_profile()
session = boto3.Session(profile_name=profile_name)
bedrock = session.client('bedrock')
# Convert the prompt data to JSON string
input_text = json.dumps(prompt_data)
# Create the job
response = bedrock.create_model_invocation_job(
modelId=model_id,
jobName=f"inference-job-{int(time.time())}",
inputDataConfig={
'contentType': 'application/json',
'text': input_text
},
outputDataConfig={
's3Uri': output_s3_uri
}
)
return response['jobArn']
def check_job_status(job_arn):
"""
Check the status of an asynchronous inference job.
Args:
job_arn: The ARN of the job to check
Returns:
Dictionary with job status information
"""
profile_name = get_profile()
session = boto3.Session(profile_name=profile_name)
bedrock = session.client('bedrock')
response = bedrock.get_model_invocation_job(
jobIdentifier=job_arn
)
return response
AWS CLI Example
# Submit an asynchronous job
aws bedrock create-model-invocation-job \
--model-id anthropic.claude-3-sonnet-20240229-v1:0 \
--job-name "my-batch-job-1" \
--input-data-config \
contentType="application/json",\
text="{\"anthropic_version\":\"bedrock-2023-05-31\",\"max_tokens\":1000,\"messages\":[{\"role\":\"user\",\"content\":\"Write a detailed analysis of quantum computing.\"}]}" \
--output-data-config s3Uri="s3://my-bucket/results/" \
--profile aws
# Check job status
aws bedrock get-model-invocation-job \
--job-identifier "arn:aws:bedrock:us-west-2:123456789012:model-invocation-job/abcde12345" \
--profile aws
Quota Considerations
- Different quota limits than real-time inference
- Often allows for higher throughput for large batch operations
- May have limits on concurrent jobs rather than RPM
- Consider storage costs for inputs and outputs in S3
Best Practices
- Implement job polling - Check job status at appropriate intervals
- Manage job lifecycles - Clean up completed jobs and outputs
- Use job batching - Group related requests into a single job when possible
- Implement notification mechanisms - Use SNS or other services to notify when jobs complete
Conversational AI (Converse API)
Overview
The Converse API is purpose-built for multi-turn conversations, handling conversation state and memory management automatically.
When to Use
- Chat applications with conversation history
- Conversational interfaces requiring context management
- Applications where maintaining conversation state is important
- Cases where you need to manage conversation memory efficiently
AWS SDK Implementation
import boto3
import json
from utils.profile_manager import get_profile
def converse(model_id, messages, system_prompt=None):
"""
Use the Converse API for a multi-turn conversation.
Args:
model_id: The model identifier
messages: List of conversation messages
system_prompt: Optional system prompt
Returns:
The model's response
"""
# Use the configured profile (defaults to 'aws' for local testing)
profile_name = get_profile()
session = boto3.Session(profile_name=profile_name)
bedrock = session.client('bedrock-runtime')
# Prepare request parameters
request = {
"messages": messages
}
if system_prompt:
request["system"] = system_prompt
# Invoke the Converse API
response = bedrock.converse(
modelId=model_id,
messages=request["messages"],
**({"system": system_prompt} if system_prompt else {})
)
return response
Quota Considerations
- May have separate quota limits from standard inference
- Optimization focuses on efficient conversation history management
- Token usage includes conversation history, which grows with conversation length
Best Practices
- Summarize conversation history - Periodically summarize long conversations
- Prune irrelevant messages - Remove unimportant turns to save tokens
- Use system prompts effectively - Set context with system prompts instead of user messages
- Implement conversation memory strategies - Consider sliding windows or hierarchical summarization
Structured Outputs (Construct API)
Overview
The Construct API provides a way to generate structured outputs in specific formats like JSON or XML, with schema validation.
When to Use
- When you need consistent, structured data from the model
- Applications requiring JSON or XML outputs
- Integration with databases or APIs expecting specific formats
- Cases where output validation is critical
AWS SDK Implementation
import boto3
import json
from utils.profile_manager import get_profile
def construct_structured_output(model_id, prompt, schema):
"""
Use the Construct API to generate structured output.
Args:
model_id: The model identifier
prompt: The prompt text
schema: JSON schema defining the expected output structure
Returns:
Structured output matching the schema
"""
# Use the configured profile (defaults to 'aws' for local testing)
profile_name = get_profile()
session = boto3.Session(profile_name=profile_name)
bedrock = session.client('bedrock-runtime')
# Invoke the Construct API
response = bedrock.construct(
modelId=model_id,
prompt=prompt,
schema=schema
)
return response
Quota Considerations
- May have specific quota limits separate from standard inference
- Optimize by designing minimal schemas that capture only required fields
- Consider response size when designing schemas
Best Practices
- Design precise schemas - Clearly define expected output format
- Include examples in prompts - Providing examples helps models generate correct formats
- Implement validation - Always validate outputs against schemas
- Create fallback mechanisms - Handle cases where structured generation fails
Comparative Analysis
Performance Characteristics
| Inference Method | Latency | Throughput | User Experience | Quota Efficiency |
|---|---|---|---|---|
| Synchronous | Higher | Standard | Wait for full response | Standard |
| Streaming | Lower perceived | Standard | Progressive display | Standard |
| Asynchronous | Highest | Highest | Background processing | Most efficient for batch |
| Converse API | Similar to sync | Standard | Maintains context | Depends on conversation length |
| Construct API | Similar to sync | Standard | Structured data | May require more tokens |
Use Case Matrix
| Use Case | Recommended Method | Why |
|---|---|---|
| Chat interfaces | Streaming or Converse | Better user experience, context management |
| Data extraction | Construct | Ensures consistent, validated output |
| Batch document processing | Asynchronous | Handle large volumes efficiently |
| Simple Q&A | Synchronous | Straightforward implementation |
| Long-form content | Streaming | Show progress during generation |
Implementation Pattern: Multi-Method Inference Manager
A robust application might use different inference methods based on the specific requirements:
class InferenceManager:
def __init__(self, model_id, profile_name=None):
self.model_id = model_id
self.profile_name = profile_name or get_profile()
self.session = boto3.Session(profile_name=self.profile_name)
self.bedrock_runtime = self.session.client('bedrock-runtime')
self.bedrock = self.session.client('bedrock')
def infer(self, prompt_data, method="sync", **kwargs):
"""
Perform inference using the specified method.
Args:
prompt_data: The prompt payload
method: One of "sync", "stream", "async", "converse", "construct"
**kwargs: Additional method-specific parameters
Returns:
Inference results in the appropriate format
"""
if method == "sync":
return self._sync_inference(prompt_data)
elif method == "stream":
return self._streaming_inference(prompt_data)
elif method == "async":
return self._async_inference(prompt_data, kwargs.get("output_s3_uri"))
elif method == "converse":
return self._converse(prompt_data, kwargs.get("system_prompt"))
elif method == "construct":
return self._construct(prompt_data, kwargs.get("schema"))
else:
raise ValueError(f"Unknown inference method: {method}")
def _sync_inference(self, prompt_data):
"""Synchronous inference implementation"""
response = self.bedrock_runtime.invoke_model(
modelId=self.model_id,
body=json.dumps(prompt_data)
)
return json.loads(response['body'].read())
def _streaming_inference(self, prompt_data):
"""Streaming inference implementation"""
response = self.bedrock_runtime.invoke_model_with_response_stream(
modelId=self.model_id,
body=json.dumps(prompt_data)
)
stream = response.get('body')
if stream:
for event in stream:
chunk = event.get('chunk')
if chunk:
yield json.loads(chunk.get('bytes').decode())
def _async_inference(self, prompt_data, output_s3_uri):
"""Asynchronous inference implementation"""
# Implementation details...
pass
def _converse(self, messages, system_prompt=None):
"""Converse API implementation"""
# Implementation details...
pass
def _construct(self, prompt, schema):
"""Construct API implementation"""
# Implementation details...
pass
Conclusion
Each inference method in AWS Bedrock offers unique advantages for specific use cases. By understanding these differences and implementing the appropriate method for each scenario, you can optimize for throughput, user experience, and quota efficiency.
The next sections will explore each method in greater depth, with specific code examples, optimization techniques, and real-world use cases.