From seven broken Lambda functions to a production AI platform in 8 articles.
That's the journey we've taken together. Functions that couldn't communicate, hit timeout walls, and left users staring at loading spinners. Now you get a complete platform that orchestrates complex workflows, streams real-time updates, and won't bankrupt your startup.
This isn't a toy example. The architecture I'm about to show you serves 1,500+ requests daily, has survived 8 months in production, and handles everything from document analysis to multi-step research tasks.
Time to deploy it.
The Complete Architecture
Before we dive into deployment, here's what we're building:

- Content Classification
The data flow:
- API Gateway receives requests, handles auth, enforces rate limits
- Gateway Lambda validates requests, checks budgets, routes to appropriate service
- ECS Agents orchestrate multi-step workflows using Lambda tools
- Lambda Tools perform specific AI tasks (summarize, extract, classify)
- DynamoDB tracks usage, manages budgets, stores user data
- WebSocket streams real-time updates back to clients
Prerequisites: Bootstrap Your Environment
First, let's set up the deployment environment:
# Install AWS CDK
npm install -g aws-cdk
# Clone the platform
git clone https://github.com/tysoncung/ai-platform-aws.git
cd ai-platform-aws
# Install dependencies
npm install
npm run install:all # Installs in all packages
# Bootstrap CDK (one time per account/region)
npx cdk bootstrap
# Create environment file
cp .env.example .env
Edit .env with your configuration:
# AWS Configuration
AWS_REGION=us-east-1
AWS_ACCOUNT_ID=123456789012
# AI Provider API Keys
OPENAI_API_KEY=sk-your-openai-key
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key
# Platform Configuration
PLATFORM_ENVIRONMENT=production
COST_TRACKING_ENABLED=true
BUDGET_ALERTS_ENABLED=true
# Monitoring
SLACK_WEBHOOK_URL=https://hooks.slack.com/your-webhook
ALERT_EMAIL=you@company.com
# Security
JWT_SECRET_KEY=your-super-secret-jwt-key
ENCRYPTION_SALT=your-encryption-salt
Local Development Setup
Before deploying to AWS, let's run everything locally with Docker Compose:
# docker-compose.yml
version: '3.8'
services:
api-gateway:
build:
context: ./packages/gateway
dockerfile: Dockerfile.dev
ports:
- "3000:3000"
environment:
- NODE_ENV=development
- DYNAMODB_ENDPOINT=http://dynamodb:8000
- AGENT_ENDPOINT=http://agent:3001
depends_on:
- dynamodb
- agent
agent:
build:
context: ./packages/agents
dockerfile: Dockerfile.dev
ports:
- "3001:3001"
environment:
- NODE_ENV=development
- LAMBDA_ENDPOINT=http://lambda-tools:3002
depends_on:
- lambda-tools
lambda-tools:
build:
context: ./packages/tools
dockerfile: Dockerfile.dev
ports:
- "3002:3002"
environment:
- NODE_ENV=development
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
dynamodb:
image: amazon/dynamodb-local:latest
ports:
- "8000:8000"
command: ["-jar", "DynamoDBLocal.jar", "-sharedDb", "-inMemory"]
redis:
image: redis:7-alpine
ports:
- "6379:6379"
Start the local environment:
# Start all services
docker-compose up -d
# Run database migrations
npm run db:migrate:local
# Seed with sample data
npm run db:seed:local
# Test the platform
curl http://localhost:3000/health
CDK Stack Composition
The platform is composed of multiple CDK stacks for better separation of concerns:
// bin/deploy.ts
import { AIGatewayStack } from '../lib/gateway-stack';
import { AIAgentsStack } from '../lib/agents-stack';
import { AIToolsStack } from '../lib/tools-stack';
import { AIMonitoringStack } from '../lib/monitoring-stack';
import { AISecurityStack } from '../lib/security-stack';
const app = new cdk.App();
const env = {
account: process.env.CDK_DEFAULT_ACCOUNT,
region: process.env.CDK_DEFAULT_REGION
};
// Security layer (VPC, IAM, KMS)
const securityStack = new AISecurityStack(app, 'AISecurityStack', { env });
// Lambda tools layer
const toolsStack = new AIToolsStack(app, 'AIToolsStack', {
env,
vpc: securityStack.vpc,
securityGroup: securityStack.lambdaSecurityGroup
});
// ECS agents layer
const agentsStack = new AIAgentsStack(app, 'AIAgentsStack', {
env,
vpc: securityStack.vpc,
securityGroup: securityStack.ecsSecurityGroup,
toolsArns: toolsStack.functionArns
});
// API Gateway layer
const gatewayStack = new AIGatewayStack(app, 'AIGatewayStack', {
env,
agentsCluster: agentsStack.cluster,
agentsService: agentsStack.service,
toolsArns: toolsStack.functionArns
});
// Monitoring and alerting
new AIMonitoringStack(app, 'AIMonitoringStack', {
env,
gatewayApi: gatewayStack.api,
agentsService: agentsStack.service,
toolsFunctions: toolsStack.functions
});
Here's the gateway stack implementation:
// lib/gateway-stack.ts
export class AIGatewayStack extends cdk.Stack {
public readonly api: apigateway.RestApi;
constructor(scope: Construct, id: string, props: AIGatewayStackProps) {
super(scope, id, props);
// DynamoDB tables
const usageTable = new dynamodb.Table(this, 'UsageTable', {
tableName: 'ai-platform-usage',
partitionKey: { name: 'userId', type: dynamodb.AttributeType.STRING },
sortKey: { name: 'timestamp', type: dynamodb.AttributeType.NUMBER },
billingMode: dynamodb.BillingMode.ON_DEMAND,
timeToLiveAttribute: 'ttl'
});
const budgetTable = new dynamodb.Table(this, 'BudgetTable', {
tableName: 'ai-platform-budgets',
partitionKey: { name: 'userId', type: dynamodb.AttributeType.STRING },
billingMode: dynamodb.BillingMode.ON_DEMAND
});
// Gateway Lambda function
const gatewayFunction = new lambda.Function(this, 'GatewayFunction', {
runtime: lambda.Runtime.NODEJS_18_X,
code: lambda.Code.fromAsset('packages/gateway/dist'),
handler: 'index.handler',
timeout: cdk.Duration.seconds(30),
memorySize: 512,
environment: {
USAGE_TABLE_NAME: usageTable.tableName,
BUDGET_TABLE_NAME: budgetTable.tableName,
AGENTS_CLUSTER_ARN: props.agentsCluster.clusterArn,
AGENTS_SERVICE_ARN: props.agentsService.serviceArn,
TOOLS_ARNS: JSON.stringify(props.toolsArns)
}
});
// Grant permissions
usageTable.grantReadWriteData(gatewayFunction);
budgetTable.grantReadWriteData(gatewayFunction);
// API Gateway
this.api = new apigateway.RestApi(this, 'AIApi', {
restApiName: 'AI Platform API',
description: 'AI Platform REST API',
defaultCorsPreflightOptions: {
allowOrigins: apigateway.Cors.ALL_ORIGINS,
allowMethods: apigateway.Cors.ALL_METHODS,
allowHeaders: ['Content-Type', 'Authorization']
}
});
// API Gateway integration
const lambdaIntegration = new apigateway.LambdaIntegration(gatewayFunction);
// Routes
const v1 = this.api.root.addResource('v1');
v1.addResource('complete').addMethod('POST', lambdaIntegration);
v1.addResource('embed').addMethod('POST', lambdaIntegration);
v1.addResource('stream').addMethod('POST', lambdaIntegration);
const agents = v1.addResource('agents');
agents.addResource('run').addMethod('POST', lambdaIntegration);
agents.addResource('stream').addMethod('POST', lambdaIntegration);
// Usage and budget endpoints
const usage = v1.addResource('usage');
usage.addMethod('GET', lambdaIntegration); // Get usage stats
usage.addResource('budget').addMethod('GET', lambdaIntegration);
usage.addResource('budget').addMethod('PUT', lambdaIntegration);
// WebSocket API for streaming
const webSocketApi = new apigatewayv2.WebSocketApi(this, 'StreamingAPI', {
apiName: 'AI Platform Streaming',
connectRouteOptions: {
integration: new apigatewayv2integrations.WebSocketLambdaIntegration(
'ConnectIntegration',
gatewayFunction
)
},
disconnectRouteOptions: {
integration: new apigatewayv2integrations.WebSocketLambdaIntegration(
'DisconnectIntegration',
gatewayFunction
)
},
defaultRouteOptions: {
integration: new apigatewayv2integrations.WebSocketLambdaIntegration(
'DefaultIntegration',
gatewayFunction
)
}
});
new apigatewayv2.WebSocketStage(this, 'StreamingStage', {
webSocketApi,
stageName: 'prod',
autoDeploy: true
});
}
}
Step-by-Step Deployment
Now let's deploy everything:
# 1. Validate CDK configuration
npx cdk doctor
# 2. Review what will be deployed
npx cdk diff
# 3. Deploy security stack first
npx cdk deploy AISecurityStack
# 4. Deploy Lambda tools
npx cdk deploy AIToolsStack
# 5. Deploy ECS agents
npx cdk deploy AIAgentsStack
# 6. Deploy API Gateway
npx cdk deploy AIGatewayStack
# 7. Deploy monitoring
npx cdk deploy AIMonitoringStack
# Or deploy everything at once
npx cdk deploy --all
The deployment takes about 15 minutes. You'll see output like:
AIGatewayStack.APIEndpoint = https://abc123.execute-api.us-east-1.amazonaws.com/v1
AIGatewayStack.WebSocketEndpoint = wss://def456.execute-api.us-east-1.amazonaws.com/prod
AIAgentsStack.ClusterName = ai-platform-agents
AIToolsStack.SummarizeFunctionArn = arn:aws:lambda:us-east-1:123456789012:function:summarize
Configure AI Providers
Once deployed, configure your AI provider credentials:
# Store API keys in AWS Systems Manager
aws ssm put-parameter \
--name "/ai-platform/openai-api-key" \
--value "sk-your-openai-key" \
--type "SecureString"
aws ssm put-parameter \
--name "/ai-platform/anthropic-api-key" \
--value "sk-ant-your-anthropic-key" \
--type "SecureString"
# Update the deployed functions with the new parameter names
npx cdk deploy AIToolsStack AIGatewayStack
Testing Your Deployment
Let's test the complete platform:
# 1. Health check
curl https://your-api-endpoint.execute-api.us-east-1.amazonaws.com/v1/health
# 2. Create an API key
curl -X POST https://your-api-endpoint/v1/auth/keys \
-H "Content-Type: application/json" \
-d '{
"name": "Test Key",
"scopes": ["ai:complete", "ai:embed", "agent:run"],
"monthlyBudget": 50
}'
# Returns: {"apiKey": "sk-proj-abc123...", "keyId": "sk-proj-abc"}
# 3. Test completion
curl -X POST https://your-api-endpoint/v1/complete \
-H "Authorization: Bearer sk-proj-abc123..." \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Write a haiku about TypeScript"}],
"model": "gpt-4",
"temperature": 0.8
}'
# 4. Test agent workflow
curl -X POST https://your-api-endpoint/v1/agents/run \
-H "Authorization: Bearer sk-proj-abc123..." \
-H "Content-Type: application/json" \
-d '{
"type": "research",
"input": {"topic": "renewable energy trends"},
"tools": ["search", "summarize", "extract"]
}'
Dashboard Tour
The platform includes a built-in dashboard at /dashboard. Here's what you'll see:
Usage Overview:
- Requests per day/hour
- Token consumption by model
- Cost breakdown by user
- Success/error rates
Real-time Monitoring:
- Active agent sessions
- Queue depth for tools
- Response time percentiles
- Error alerts
Budget Management:
- Per-user spend tracking
- Budget utilization alerts
- Cost projections
- BYOK vs platform credit usage
System Health:
- Lambda cold start metrics
- ECS task utilization
- DynamoDB performance
- API Gateway latency
You can access it at: https://your-api-endpoint/dashboard
Performance Numbers from Production
Here are the real metrics from 8 months running in production:
Latency (P95):
- Simple completion: 1.2s
- Streaming completion: 180ms to first token
- Agent workflow (3 tools): 12s
- API Gateway overhead: 45ms
- Lambda cold start: 850ms (mitigated with provisioned concurrency)
Throughput:
- Sustained: 50 requests/second
- Burst: 200 requests/second (before rate limiting)
- Agent concurrency: 15 parallel workflows
- Tool execution: 100 parallel Lambda invocations
Reliability:
- Uptime: 99.8%
- Error rate: 0.4%
- P99 latency SLA: 5s (met 98.9% of the time)
- Budget enforcement accuracy: 99.99%
Cost Optimization Wins:
- Response caching: 25% reduction in API calls
- Smart model selection: 40% cost reduction (Claude Haiku for summaries)
- BYOK adoption: 70% of users, eliminating platform AI costs
- Lambda right-sizing: 30% reduction in compute costs
Cost Breakdown: What This Actually Costs
Fixed Infrastructure (Monthly):
API Gateway: $3.50 (1M requests)
Lambda (Gateway): $8.20 (compute + requests)
ECS Fargate: $15.40 (2 tasks avg)
DynamoDB: $6.80 (usage + budgets)
Application Load Balancer: $16.20
NAT Gateway: $45.00 (data transfer)
CloudWatch: $4.30 (logs + metrics)
Route 53: $0.50 (hosted zone)
----
Total Fixed: $99.90/month
Variable Costs:
- AI API costs: Pass-through with 2% platform markup
- Data transfer: $0.09/GB out of AWS
- Lambda executions: $0.20 per million requests
- DynamoDB reads/writes: $0.25 per million operations
Real customer costs (excluding AI API):
- Light usage (500 req/month): $12/month
- Medium usage (5K req/month): $35/month
- Heavy usage (50K req/month): $120/month
The platform is cost-effective for most use cases. The break-even point vs building your own infrastructure is around 2,000 requests per month.
Cold Start Mitigation
Lambda cold starts were killing our performance. Here's how we solved it:
// Provisioned concurrency for critical functions
new lambda.Function(this, 'GatewayFunction', {
// ... other config
reservedConcurrencyLimit: 10,
provisionedConcurrencyConfig: {
provisionedConcurrentExecutions: 5
}
});
// Keep-warm function that pings Lambdas every 5 minutes
new events.Rule(this, 'KeepWarmRule', {
schedule: events.Schedule.rate(cdk.Duration.minutes(5)),
targets: [
new targets.LambdaFunction(gatewayFunction, {
event: events.RuleTargetInput.fromObject({ warmup: true })
})
]
});
// In Lambda handler - respond quickly to warmup
export const handler = async (event: any) => {
if (event.warmup) {
return { statusCode: 200, body: 'warm' };
}
// Normal processing...
};
Result: Cold start rate dropped from 23% to 3% of requests.
Open Source Roadmap
This platform is completely open source. Here's what's coming next:
Q2 2026:
- [ ] Multi-region deployment support
- [ ] GraphQL API alongside REST
- [ ] Built-in vector database (Pinecone integration)
- [ ] Advanced agent memory management
Q3 2026:
- [ ] Kubernetes support (alternative to ECS)
- [ ] Multi-tenant isolation improvements
- [ ] Advanced cost optimization (spot instances)
- [ ] Plugin system for custom tools
Q4 2026:
- [ ] Edge deployment (CloudFlare Workers)
- [ ] Real-time collaboration features
- [ ] Advanced monitoring and observability
- [ ] Enterprise SSO integration
Community Requests:
- Google Cloud and Azure support
- Terraform modules (alternative to CDK)
- Python SDK alongside TypeScript
- Zapier/Make.com integrations
Contributing and Community
The entire platform is open source under MIT license. Everything I've built, you can use, modify, and improve.
Repositories:
- Main platform: github.com/tysoncung/ai-platform-aws
- Working examples: github.com/tysoncung/ai-platform-aws-examples
How to help:
- Star the repositories - helps others discover the project
- Try the full deployment - example 07-full-stack has everything
- Report deployment issues - especially AWS region differences
- Submit improvements - see CONTRIBUTING.md for guidelines
- Share your experience - what are you building with it?
Connect:
- Email: tyson@hivo.co
- Twitter: @tysoncung
What We Built Together
Eight articles. One complete AI platform.
We started with seven broken Lambda functions. We built:
- Agent orchestration that handles complex multi-step workflows without timeouts
- TypeScript SDK with perfect IntelliSense, streaming support, and smart error handling
- Cost control that prevents $2,847 surprises with budgets and rate limits
- Production security with authentication, encryption, and monitoring
- One-command deployment that gets you running in under an hour
The platform serves 1,500+ requests daily. It's survived 8 months in production. It's processing everything from document analysis to research workflows. And it's completely open source.
The Hard-Won Lessons
Building production AI infrastructure taught me things tutorials never mention:
Technical truths:
- Cost control is life support, not a nice-to-have feature
- Lambda excels at tools, fails at orchestration
- Streaming looks simple, implementation is brutal
- Type safety prevents expensive mistakes at 3AM
Business realities:
- Developers pay for great experience, abandon bad APIs
- Open source builds trust better than marketing
- Production numbers matter more than perfect demos
- Failure stories teach more than success posts
Personal discoveries:
- Building in public creates accountability
- Documentation is your product's face
- Shipping beats perfecting every time
- Sharing mistakes helps everyone improve
Your Turn
You have everything you need. Real code, real examples, real production lessons. The platform is MIT licensed - use it, improve it, make money with it.
Next steps:
- Star the repos - ai-platform-aws and examples
- Deploy example 07 - full platform in under an hour
- Build something cool - then tell me about it
- Share your experience - help others learn from your journey
Get stuck? Email me at tyson@hivo.co or find me on Twitter @tysoncung.
The AI revolution needs better infrastructure. You can build it.
Go.
End of series: "Building an AI Platform on AWS from Scratch". Complete platform and examples at github.com/tysoncung/ai-platform-aws.
Top comments (6)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.