Reference: EKS Cluster Module
⚠️ Disclaimer: This documentation was AI-generated (“vibe coded”) and has not been fully verified yet. Please review carefully and report any inaccuracies.
The eks-cluster module deploys a complete Kubernetes-based workshop environment on AWS, providing participants with browser-based VSCode instances and a full workshop portal.
📦 What Gets Deployed
The eks-cluster module creates a complete workshop environment on AWS EKS:
Core Infrastructure
- EKS Cluster - Managed Kubernetes with auto-scaling nodes
- Networking - VPC, subnets, security groups, DNS (Route53)
- Storage - EFS for persistent user workspaces
- Security - SSL/TLS certificates (Let’s Encrypt), IAM roles
User Environment (Per Participant)
- VSCode Online - Browser-based IDE with pre-installed extensions
- MongoDB Database - Isolated database per user
- PostgreSQL Database - Optional, isolated per user
- Workspace Files - Workshop repo cloned and ready
Workshop Portal
- Frontend - Next.js application with leaderboard and instructions
- Backend API - Python Flask server for user management
- Documentation - Jekyll-built workshop instructions
- Results Processor - Automated exercise validation
Optional Components
- LiteLLM Proxy - Unified LLM API (OpenAI/Anthropic) with caching
- Redis Cache - LLM response caching to reduce API costs
⚙️ Configuration
All configuration is centralized in config.yaml at the customer folder level:
AWS Settings
aws:
region: "us-east-2"
profile: "Solution-Architects.User-979559056307"
Workshop Scenario Configuration
scenario:
repository: "https://github.com/simonegaiera/mongodb-airbnb-workshop"
branch: "main"
database:
mongodb: true
postgresql: false
llm:
enabled: true
provider: "openai" # or "anthropic"
model: "gpt-5-chat" # or "claude-3-haiku"
proxy_enabled: true
leaderboard:
type: "timed" # or "score"
prizes:
enabled: true
where: "Happy Hour"
when: "4:30 PM"
instructions:
sections:
- title: "Welcome & Setup"
content: ["/guided/", "/guided/vscode/"]
- title: "CRUD Operations"
content: ["/crud/", "/crud/1/"]
⏱️ Deployment Time
- EKS cluster + infrastructure: 30-40 minutes
- User environments: 5-7 minutes (for ~10 users)
- Total: ~45-75 minutes depending on configuration
💡 Key Features
- 🌐 Domain & DNS - Automatic Route53 configuration with per-user subdomains and A+ rated SSL
- 🔒 Security - TLS encryption, IAM roles, namespace isolation, security groups
- 👤 Per-User Isolation - Dedicated VSCode instance, database, and workspace per participant
- 📊 Progress Tracking - Real-time leaderboard with automated exercise validation
- 🤖 AI Integration - Optional LLM support (OpenAI/Anthropic) with response caching
- ⚡ Scalability - Auto-scaling nodes, on-demand resource provisioning
📁 Module Structure & Components
The eks-cluster module is organized by folders, where each folder represents a deployable component with its own Helm chart. Each component has a corresponding .tf file that orchestrates its deployment.
🏗️ Folder-Based Architecture
utils/eks-cluster/
├── Core Infrastructure (.tf files)
│ ├── main.tf # Providers & backend
│ ├── variables.tf # Input variables
│ ├── infra.tf # VPC & networking
│ ├── eks.tf # EKS cluster
│ ├── efs.tf # Persistent storage
│ └── route53.tf # DNS & SSL
│
├── Component Folders (Helm Charts)
│ ├── mdb-openvscode/ → openvscode.tf
│ ├── mdb-nginx/ → nginx.tf
│ ├── portal-server/ → arena-portal.tf
│ ├── portal-nginx/ → arena-portal.tf
│ ├── docs-nginx/ → docs-nginx.tf
│ ├── scenario-definition/→ scenario-definition.tf
│ ├── litellm/ → litellm.tf (optional)
│ ├── redis/ → redis.tf (optional)
│ └── results-processor/ (mounted in VSCode)
│
└── Supporting Folders
├── aws_policies/ # IAM policy documents
├── nginx-conf-files/ # Nginx templates
├── nginx-html-files/ # Static HTML pages
└── mongodb-arena-portal/ # Portal source code
📦 How It Works: Folder = Deployable Component
Key Concept: Each folder with a Helm chart represents one deployable service. The corresponding .tf file orchestrates its deployment with environment-specific configuration.
Example:
mdb-openvscode/folder contains the complete Helm chart for VSCodeopenvscode.tffile deploys it with dynamic configuration (user list, MongoDB URIs, etc.)
Benefits:
- ✅ Self-contained: Each component has all its manifests, scripts, and config in one place
- ✅ Reusable: Helm charts can be versioned and reused across different customers
- ✅ Clear ownership: Easy to identify what each folder deploys
- ✅ Modular: Add/remove components by adding/removing folders
📦 Component Folder Structure
Each component folder contains a complete Helm chart:
component-name/
├── Chart.yaml # Helm chart metadata
├── values.yaml # Default configuration values
├── templates/ # Kubernetes manifests
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── configmap.yaml
│ └── ...
└── files/ # Scripts and assets
├── startup.sh
└── ...
Core Infrastructure Files
main.tf - Foundation & Providers
The backbone of the module, configuring:
- Terraform Backend: S3 backend for state management
- Providers: AWS (~6.0), ACME (~2.35), PostgreSQL (~1.22), Kubernetes (~2.37), Helm (~3.0)
- Local Variables: Atlas connection strings, user lists, cluster naming conventions
- Expiration Management: Automatic 7-day (168h) expiration timestamp for resources
- Domain Configuration: Customer-specific domain names (e.g.,
customer.mongoarena.com)
variables.tf - Configuration Inputs
Defines all input variables with validation:
aws_profile- SA AWS profile for authenticationaws_region- EKS cluster location (default: us-east-2)customer_name- Workshop customer identifieraws_route53_hosted_zone- DNS zone (mongoarena.com)domain_email- Email for Let’s Encrypt certificatesatlas_standard_srv- MongoDB connection string from atlas-cluster moduleatlas_user_list- Participant usernames (validated non-empty)atlas_user_password- Shared password for participantsatlas_admin_user/atlas_admin_password- Admin credentialsscenario_config- Complete workshop configuration from config.yamlanthropic_api_key/azure_openai_api_key- LLM API keys (optional)
Network & Compute Files
infra.tf - VPC & Network Infrastructure
Creates the networking foundation:
- VPC: 10.0.0.0/16 CIDR block with DNS support enabled
- Subnets: 2 public subnets (10.0.0.0/22, 10.0.4.0/22) across availability zones
- Internet Gateway: Enables outbound internet access
- Route Tables: Routes all traffic (0.0.0.0/0) through IGW
- Security Groups:
- EKS security group with DNS (53/tcp, 53/udp) and egress rules
- Kubernetes ELB role tags for load balancer provisioning
- Availability Zone Selection: Dynamic selection with us-west-2 filtering
eks.tf - EKS Cluster & IAM
The heart of the Kubernetes infrastructure:
IAM Roles & Policies:
- Node Role (
aws_iam_role.node):AmazonEKSWorkerNodeMinimalPolicy- Core worker node permissionsAmazonEC2ContainerRegistryPullOnly- Pull Docker imagesAmazonElasticFileSystemClientReadWriteAccess- Access EFS volumes- Custom EFS CSI policy for persistent storage
- Bedrock policy (conditional, if LLM enabled)
- S3 read policy for mongodb-arena bucket
- Secrets Manager policy for API keys
- Cluster Role (
aws_iam_role.cluster):AmazonEKSClusterPolicy- Manage EKS resourcesAmazonEKSComputePolicy- Compute operationsAmazonEKSBlockStoragePolicy- EBS volumesAmazonEKSLoadBalancingPolicy- Load balancersAmazonEKSNetworkingPolicy- VPC networking- Custom EKS Auto Mode policy for EC2 tagging
EKS Cluster Configuration:
- Compute Config: Auto Mode with “general-purpose” and “system” node pools
- Network Config: Elastic Load Balancing enabled
- Storage Config: Block storage enabled
- Access Config: API + ConfigMap authentication mode
- VPC Config: Public and private API endpoints
- Logging: API, audit, authenticator, controller manager, scheduler logs to CloudWatch
- Tags: Name, expire-on, owner, purpose for cost tracking
EKS Add-ons:
vpc-cni- VPC networkingmetrics-server- Resource metrics for HPAamazon-cloudwatch-observability- Centralized loggingaws-efs-csi-driver- EFS persistent volumes
Storage Classes:
efs-sc- EFS storage class for user workspace persistence- Provisioning mode: efs-ap (access points)
- Reclaim policy: Delete
- Volume binding: Immediate
efs.tf - Elastic File System
Persistent shared storage for user workspaces:
- EFS File System: Encrypted at rest, tagged with expiration
- Security Group: Allows NFS traffic (port 2049) from all sources
- Mount Targets: One per subnet for high availability
- Purpose: Stores user workspace data, ensures persistence across pod restarts
- Access: Each user gets isolated access through EFS access points
- Output: EFS ID and DNS name for volume mounting
SSL/TLS & DNS Files
route53.tf - DNS & SSL Certificates
Manages domain names and HTTPS certificates:
SSL Certificate Generation (ACME/Let’s Encrypt):
- ACME Account: Registered with Let’s Encrypt production server
- Private Keys: RSA 2048-bit keys for account and certificate
- Certificate Request: Wildcard cert for
*.customer.mongoarena.comand apex domain - DNS Challenge: Route53 DNS-01 challenge with automatic validation
- Recursive Nameservers: Google (8.8.8.8) and Cloudflare (1.1.1.1) for fast propagation
- Result: A+ rated SSL certificate (validated with SSL Labs)
Route53 DNS Records:
- Main domain (
customer.mongoarena.com) → Portal frontend - WWW subdomain (
www.customer.mongoarena.com) → Portal frontend - Wildcard (
*.customer.mongoarena.com) → User VSCode instances - Instructions (
instructions.customer.mongoarena.com) → Workshop documentation - Participants (
participants.customer.mongoarena.com) → Participant list page - Portal (
portal.customer.mongoarena.com) → Backend API server
All records use Route53 Alias records pointing to AWS Load Balancers with health checks.
Application Components
Each application component is deployed from its own folder containing a complete Helm chart. The corresponding .tf file orchestrates the deployment with dynamic configuration.
1️⃣ mdb-openvscode/ → openvscode.tf
Per-User VSCode Instances - Deploys isolated browser-based IDEs for each participant:
Helm Chart Deployment (one per user):
- Chart:
./mdb-openvscodev0.1.3 - Namespace: Default (each user gets unique deployment name)
- Timeout: 10 minutes to handle slow image pulls
Environment Variables (passed to each VSCode instance):
MONGODB_URI- User-specific connection string with credentialsENVIRONMENT- Set to “prod”SERVICE_NAME- Backend server URL (http://localhost:5000)SCENARIO_PATH- Results processor locationSIGNAL_FILE_PATH- Server restart signal locationLOG_PATH- Exercise results directoryWORKSHOP_USER- User home directory (/app)LEADERBOARD- Leaderboard type (timed/score)BACKEND_URL- User’s backend API URL (https://username.customer.mongoarena.com/backend)LLM_MODEL- LLM model name if enabledDATABASE_NAME- User’s MongoDB database nameLLM_PROXY_ENABLED- MCP proxy toggleLLM_PROXY_TYPE,LLM_PROXY_SERVICE,LLM_PROXY_PORT- Proxy configurationMDB_MCP_CONNECTION_STRING- MongoDB MCP server connectionPGSQL_MCP_CONNECTION_STRING- PostgreSQL MCP server connection (if enabled)
Volume Mounts:
/home/workspace- Persistent EFS volume (user’s files)/home/workspace/utils- Results processor scripts/home/workspace/.openvscode-server/data/Machine- VSCode settings/home/workspace/.openvscode-server/data/User/globalStorage/saoudrizwan.claude-dev/settings- Cline extension settings/home/workspace/scenario-config- Workshop scenario configuration
Persistence:
- Each user gets dedicated PVC (PersistentVolumeClaim) backed by EFS
- Storage class:
efs-sc(from eks.tf) - Data persists across pod restarts
Services:
- Each deployment exposes service:
vscode-{username}-svc - Services discovered by Nginx for routing
2️⃣ mdb-nginx/ → nginx.tf
User VSCode Reverse Proxy - Routes traffic to individual VSCode instances:
Dynamic Configuration Generation:
- Base Config: Common Nginx settings (gzip, caching, SSL)
- Per-User Configs: For each Atlas user, generates:
- Server block listening on port 443 (HTTPS)
- Server name:
{username}.customer.mongoarena.com - Proxy pass to user’s VSCode service
- WebSocket upgrade headers for VSCode
- SSL certificate from
nginx-tls-secret
Nginx Deployment (helm_release.airbnb_arena_nginx):
- Chart:
./mdb-nginxv0.1.2 - Volume Mounts:
/etc/nginx/conf.d- Dynamic user configurations/etc/nginx/nginx.conf- Base Nginx config/usr/share/nginx/html- Static HTML (landing page)/mnt/vscode-{user}- Each user’s persistent volume (for file serving)/etc/nginx/ssl- TLS certificates
- Service Type: LoadBalancer (AWS ELB automatically provisioned)
- Output: Load balancer hostname and zone ID for Route53
Security:
- TLS 1.2/1.3 only
- Strong cipher suites
- HSTS headers
- A+ SSL Labs rating
3️⃣ portal-server/ + portal-nginx/ → arena-portal.tf
Workshop Portal - Deploys the main workshop interface (frontend + backend):
Portal Backend Server (helm_release.portal_server):
- Technology: Python Flask application
- Chart:
./portal-serverv0.1.0 - Purpose: REST API for leaderboard, user management, progress tracking
- Environment Variables:
MONGODB_URI- Admin connection to shared databaseDB_NAME- “arena_shared” (shared across all users)PARTICIPANTS- Collection name for participant dataUSER_DETAILS- Collection name for user detailsLEADERBOARD- Leaderboard type configuration
- Volumes:
- Scenario config (read-only)
- Startup script for initialization
- Service: ClusterIP service on port 5000
- Database Collections:
participants- User registration and statususer_details- Extended user informationleaderboard- Exercise completion times/scores
Portal Frontend Nginx (helm_release.portal_nginx):
- Technology: Next.js static export served by Nginx
- Chart:
./portal-nginxv0.1.5 - Build Process:
- Startup script clones
mongodb-arena-portalrepo - Runs
npm install && npm run build - Exports static files to
/usr/share/nginx/html/portal
- Startup script clones
- Environment Variables (injected into Next.js build):
NEXT_PUBLIC_API_URL- Backend API URLNEXT_PUBLIC_REPO_NAME- Workshop repository nameNEXT_PUBLIC_SERVER_PATH- Backend server pathNEXT_PUBLIC_PRIZES_ENABLED,PRIZES_WHERE,PRIZES_WHEN- Prize information
- Routes:
/- Main portal landing page/leaderboard- Real-time leaderboard/backend/*- Proxied to portal-server
- Nginx Config:
- Listens on ports 80 (HTTP) and 443 (HTTPS)
- SSL termination with wildcard certificate
- Proxy to backend server for API calls
- Static file serving for frontend
- Service: LoadBalancer (separate from VSCode Nginx)
- DNS: Points to
customer.mongoarena.comandportal.customer.mongoarena.com
4️⃣ docs-nginx/ → docs-nginx.tf
Workshop Instructions - Serves workshop documentation:
Documentation Builder:
- Technology: Jekyll static site generator
- Chart:
./docs-nginxv0.1.4 - Build Process:
- Startup script clones workshop repository
- Navigates to
docs/directory - Runs
bundle install && bundle exec jekyll build - Copies
_site/output to/usr/share/nginx/html/instructions
- Source: Workshop repository’s
/docs/folder (Jekyll site) - Result: Static HTML/CSS/JS documentation
Nginx Server Blocks:
- Participants Page (
participants.customer.mongoarena.com):- Root:
/usr/share/nginx/html/ - Shows participant list with VSCode links
- Root:
- Instructions Page (
instructions.customer.mongoarena.com):- Root:
/usr/share/nginx/html/instructions - Serves Jekyll-built documentation
- Root:
Configuration:
- SSL enabled with wildcard certificate
- Gzip compression for faster page loads
- Custom 404 and 50x error pages
- Service: LoadBalancer (shared by both subdomains)
5️⃣ scenario-definition/ → scenario-definition.tf
Workshop Configuration - Centralizes workshop scenario settings:
Kubernetes Job (helm_release.scenario_definition):
- Chart:
./scenario-definitionv0.1.7 - Purpose: Uploads scenario config to MongoDB for runtime access
- Execution: Runs once at deployment, completes when upload succeeds
- Timeout: 5 minutes
ConfigMaps Created:
scenario-definition-config- Raw YAML scenario configscenario-definition-enhanced-config- Enhanced with runtime values:aws_route53_record_name- Full domain nameatlas_standard_srv- MongoDB connection stringatlas_user_password- User credentials
Python Script (define-scenario.py):
- Reads
/etc/scenario-config/scenario-config.json - Connects to MongoDB (arena_shared database)
- Upserts document into
scenario_configcollection - Makes config accessible to all services
Configuration Elements:
- Workshop repository URL and branch
- Database configuration (MongoDB/PostgreSQL)
- LLM settings (provider, model, proxy)
- Leaderboard type (timed/score)
- Prize information
- Workshop instructions structure
Usage: Other services mount scenario-definition-config to read workshop settings.
Optional Components
6️⃣ litellm/ → litellm.tf (Optional)
LLM Proxy - Provides unified API for multiple LLM providers:
Conditional Deployment:
- Only deploys if
scenario_config.llm.enabled == trueandscenario_config.llm.proxy.enabled == true
LiteLLM Service (helm_release.litellm):
- Chart:
./litellmv0.1.14 - Purpose: Proxy for OpenAI and Anthropic APIs with caching
- Service: ClusterIP on port 4000
API Key Management:
- Primary Source: AWS Secrets Manager (
arena/secrets) - Fallback: Terraform variables (
anthropic_api_key,azure_openai_api_key) - Security: Stored as Kubernetes secrets, mounted as env vars
Model Configuration:
- If OpenAI Provider:
gpt-5-mini- Azure OpenAI endpointgpt-5-chat- Azure OpenAI endpoint- API base:
https://solutionsconsultingopenai.openai.azure.com - API version: 2025-04-01-preview
- If Anthropic Provider:
claude-3-haiku- Anthropic APIclaude-4-sonnet- Anthropic API
Caching (if Redis enabled):
- Type: Redis-backed semantic cache
- TTL: 1800 seconds (30 minutes)
- Namespace:
litellm.cline.cache - Purpose: Reduce API costs by caching similar prompts
- Supported Call Types: completion, acompletion, embedding, aembedding
Environment Variables:
PORT- 4000LITELLM_LOG- INFO level loggingREDIS_HOST,REDIS_PORT- If Redis enabled
Integration: VSCode Cline extension connects to http://litellm-service:4000
7️⃣ redis/ → redis.tf (Optional)
Redis Cache - Provides caching for LiteLLM:
Conditional Deployment:
- Only deploys if
scenario_config.llm.proxy.redis.enabled == true
Redis Instance (helm_release.redis):
- Chart:
./redisv0.1.1 - Version: Redis 7.x
- Persistence: None (in-memory cache, acceptable for workshop duration)
- Service: ClusterIP on port 6379 (default Redis port)
Kubernetes Secret (redis_credentials):
REDIS_HOST- Service name (e.g., “redis-service”)REDIS_PORT- Port number (6379)REDIS_URL- Full connection string (redis://redis-service:6379)
Purpose:
- Cache LLM responses for identical prompts
- Reduce API costs during workshops
- Faster response times for repeated queries
Connection String: redis://redis-service:6379 (internal cluster DNS)
8️⃣ aurora.tf (Optional, no folder)
PostgreSQL Database - Provides PostgreSQL for multi-database workshops:
Conditional Deployment:
- Only deploys if
scenario_config.database.postgres == true
Aurora Serverless v2 Cluster (aws_rds_cluster.aurora_cluster):
- Engine: aurora-postgresql 17.5
- Scaling: 0.5 to 1.0 ACU (Aurora Capacity Units)
- Database: sample_airbnb
- Master User: postgres
- Master Password: Same as Atlas user password
- Backup: 1 day retention
- Security:
- Accessible from EKS cluster security group
- Accessible from VPC (10.0.0.0/16)
- Accessible from specific subnet (104.30.164.0/28)
- Public Access: Enabled (for external tools)
- SSL: Required (sslmode=require)
pgvector Extension (null_resource.enable_pgvector):
- Installs PostgreSQL client locally (macOS/Linux)
- Connects to Aurora cluster
- Runs:
CREATE EXTENSION IF NOT EXISTS vector; - Enables vector similarity search in PostgreSQL
Per-User Resources: For each Atlas user:
- PostgreSQL Role: Login user with CREATE DATABASE privilege
- PostgreSQL Database: Owned by respective user
- Connection String:
postgresql://{user}:{password}@{endpoint}:5432/{user}
Data Restoration (kubernetes_job_v1.restore_backup_to_users):
- Purpose: Pre-populate user databases with Airbnb dataset
- Execution: Kubernetes job (runs once at deployment)
- Process:
- Downloads backup from S3 (
s3://mongodb-arena/postgres-backups/airbnb-backup.sql.gz) - Decompresses SQL file
- For each user, restores backup to their database
- Verifies restoration by checking table count
- Skips users already restored
- Downloads backup from S3 (
- Image: postgres:17
- Resources: 1Gi memory, 500m CPU (bursts to 2Gi/1000m)
- Timeout: 10 minutes
- Backoff Limit: 3 retries
Output Variables:
aurora_cluster_endpoint- Writer endpointaurora_cluster_reader_endpoint- Reader endpointaurora_database_name- “sample_airbnb”aurora_master_username- “postgres”aurora_connection_string- Full PostgreSQL connection stringuser_connection_strings- Per-user connection stringsuser_jdbc_urls- JDBC URLs for each user
Integration: VSCode instances receive PGSQL_MCP_CONNECTION_STRING environment variable for MCP server.
Supporting Directories
These folders provide configuration, templates, and utilities used by the component folders:
aws_policies/
IAM Policy Documents - JSON policy files for granular permissions:
node_policy.json- Node role trust policycluster_policy.json- Cluster role trust policyefs_csi_node_policy.json- EFS CSI driver permissionsbedrock.json- AWS Bedrock LLM accesss3.json- S3 bucket read access (mongodb-arena)secrets.json- Secrets Manager read accesseks_auto_mode_policy.json- EC2 instance tagging for EKS Auto Mode
Used by: eks.tf
nginx-conf-files/
Nginx Configuration Templates - Templatefiles for dynamic Nginx configs:
nginx-base-config.conf.tpl- Common settings (gzip, caching, timeouts)mdb-nginx-openvscode.conf.tpl- Per-user VSCode server blockdoc-nginx-main.conf.tpl- Documentation server blockportal-nginx-server.conf.tpl- Portal server blocknginx.conf- Main Nginx configuration file
Used by: nginx.tf, docs-nginx.tf, arena-portal.tf
nginx-html-files/
Static HTML Templates - Templated HTML pages:
index.html.tpl- Landing page with user links404.html.tpl- Custom 404 error page50x.html.tpl- Server error pagefavicon.ico- Website icon
Used by: nginx.tf, docs-nginx.tf
mongodb-arena-portal/
Portal Application Source - Next.js + Flask application source code:
frontend/- Next.js application:src/app/- Next.js pages (App Router)src/components/- React componentspublic/- Static assetsnext.config.mjs- Next.js configurationtailwind.config.js- Tailwind CSS styling
server/- Python Flask backend:app.py- Flask applicationrequirements.txt- Python dependencies
Used by: arena-portal.tf (cloned and built during deployment)
results-processor/
Exercise Validation - Java application for validating workshop exercises:
- Technology: Java 21 + MongoDB Java Driver
- Purpose: Validates student exercise solutions
- Structure:
src/main/java/com/mongodb/- Java source codepom.xml- Maven dependenciestarget/- Compiled JAR filerun.sh- Execution scriptlogs/- Validation results
- Integration: Mounted into VSCode containers at
/home/workspace/utils
Used by: openvscode.tf (mounted as volume)
🔄 Deployment Flow
Understanding how these components deploy in sequence (aligned with the 5-tier architecture):
Tier 1: Network Foundation
Files: main.tf, infra.tf
- VPC creation with public subnets
- Internet Gateway and route tables
- Security groups for EKS
- Availability zone configuration
- Time: ~2-3 minutes
Tier 2: Storage & Kubernetes
Files: efs.tf, eks.tf
- EFS Storage: File system creation, mount targets, security group
- EKS Cluster: Cluster creation, IAM roles, node pools, add-ons
- Storage Classes: EFS CSI driver configuration
- Time: ~30-40 minutes (EKS cluster is the longest component)
Tier 3: Supporting Services
Files: route53.tf, scenario-definition/, aurora.tf
- SSL Certificates: Let’s Encrypt certificate via ACME, DNS validation
- Scenario Config: Kubernetes job uploads workshop config to MongoDB
- PostgreSQL (optional): Aurora Serverless v2, pgvector, per-user databases
- Time: ~5-10 minutes (15-20 if Aurora enabled)
Tier 4: User Workspaces
Folders: mdb-openvscode/, redis/
- VSCode Instances: Per-user deployments with PVCs, ConfigMaps, services
- Redis Cache (optional): In-memory cache for LiteLLM
- Deployments happen in parallel per user
- Time: ~3-5 minutes for 10 users
Tier 5: User-Facing Services
Folders: mdb-nginx/, portal-server/, portal-nginx/, docs-nginx/, litellm/
- VSCode Nginx: Reverse proxy with per-user configs, load balancer
- Portal Backend: Flask API server for leaderboard and user management
- Portal Frontend: Next.js build and deployment
- Documentation: Jekyll site build and deployment
- LiteLLM Proxy (optional): Unified LLM API with caching
- Time: ~5-10 minutes
Total Deployment Time
- Minimum (no optional components): ~45-55 minutes
- Full deployment (with PostgreSQL + LLM): ~60-75 minutes
- Note: Most time spent waiting for EKS cluster (Tier 2)
🎭 Component Dependencies
This diagram shows the deployment dependencies. Components are deployed in order based on what they depend on:
┌─────────────────────┐
│ 1. infra.tf │
│ VPC + Subnets │
└──────────┬──────────┘
│
┌──────────┴──────────┐
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────┐
│ 2. efs.tf │ │ 2. eks.tf │
│ EFS Storage │ │ EKS Cluster │
└──────────────────┘ └────────┬────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌────────────────┐ ┌──────────────────┐
│ 3. route53.tf │ │ 3. scenario- │ │ 3. aurora.tf │
│ SSL/TLS Cert │ │ definition/ │ │ PostgreSQL │
└────────┬─────────┘ └───────┬────────┘ │ (optional) │
│ │ └──────────────────┘
│ │
┌───────┴────────────────────┴───────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ 4. redis/ │ │ 4. mdb- │
│ Redis Cache │ │ openvscode/ │
│ (optional) │ │ VSCode Users │
└────────┬────────┘ └────────┬────────┘
│ │
│ ┌───────────────────────┼──────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────┐ ┌──────────┐ ┌─────────────┐ ┌──────────────┐
│ 5. litellm/ │ │ 5. mdb- │ │ 5. portal- │ │ 5. docs- │
│ LLM Proxy │ │ nginx/ │ │ server/ + │ │ nginx/ │
│ (optional) │ │ VSCode │ │ portal- │ │ Workshop │
└─────────────┘ │ Proxy │ │ nginx/ │ │ Docs │
└──────────┘ │ Portal │ └──────────────┘
└─────────────┘
Legend:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Tier 1: infra.tf → Network foundation
Tier 2: efs.tf, eks.tf → Storage + Kubernetes
Tier 3: SSL, Config, DB → Supporting services
Tier 4: User workspaces → Per-user resources
Tier 5: Proxies & Portal → User-facing services
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Folder-to-Tier Mapping:
├── Tier 1-3: Core .tf files (no folders)
├── Tier 4-5: Component folders
│ ├── mdb-openvscode/ → openvscode.tf
│ ├── mdb-nginx/ → nginx.tf
│ ├── portal-server/ → arena-portal.tf
│ ├── portal-nginx/ → arena-portal.tf
│ ├── docs-nginx/ → docs-nginx.tf
│ ├── scenario-definition/→ scenario-definition.tf
│ ├── redis/ → redis.tf (optional)
│ └── litellm/ → litellm.tf (optional)
Key Dependencies:
- SSL Certificate (route53.tf) → Required by all Nginx components for HTTPS
- Scenario Config (scenario-definition/) → Required by VSCode and Portal for workshop settings
- EFS Storage (efs.tf) → Required by VSCode instances for persistent workspaces
- Redis (redis/) → Required by LiteLLM for caching (if enabled)
- VSCode Instances (mdb-openvscode/) → Must exist before VSCode Nginx can proxy to them
- Aurora (aurora.tf) → Independent, only needs VPC (if PostgreSQL enabled)
🔍 Key Design Patterns
Per-User Isolation
- Each user gets dedicated Kubernetes resources (namespace via naming)
- Separate PVC for file persistence
- Individual ConfigMaps for personalized settings
- Isolated MongoDB databases
- Isolated PostgreSQL databases (if enabled)
Dynamic Configuration
- Terraform templates generate per-user Nginx configs
- User list from Atlas drives VSCode instance creation
- ConfigMaps populated from scenario_config variable
- DNS records automatically created for each user
Conditional Deployment
- PostgreSQL only if
database.postgres == true - LiteLLM only if
llm.enabled == trueandllm.proxy.enabled == true - Redis only if
llm.proxy.redis.enabled == true - Bedrock policies only if
llm.enabled == true
High Availability
- Multi-AZ subnets for resilience
- EFS mount targets in each AZ
- Load balancers with health checks
- Auto-scaling node pools (EKS Auto Mode)
Security Best Practices
- Least privilege IAM roles
- TLS encryption everywhere (A+ rating)
- Secrets stored in AWS Secrets Manager
- Network security groups
- Private subnets for sensitive resources
Cost Optimization
- EKS Auto Mode for efficient scaling
- Aurora Serverless v2 (scales to zero-ish)
- Redis in-memory only (no persistence)
- 7-day auto-expiration tags
- Shared infrastructure (one EKS cluster for all users)
🎯 Deployment Options
Fully Managed (with EKS)
- Complete browser-based environment
- VSCode Online for all participants
- Centralized management and monitoring
- Best for: Corporate workshops, large groups
Hybrid (without EKS)
- MongoDB Atlas cluster only
- Participants use local development environment
- Reduced infrastructure costs
- Best for: Technical audiences, smaller groups
💡 Tip: To deploy hybrid mode, remove the
eks-clusterfolder from your customer directory before runningterragrunt apply --all
⚠️ Important Notes
Cluster Expiration
- EKS clusters expire after 1 week by default
- Plan workshop timing accordingly
- Can be extended if needed
Resource Cleanup
- Always destroy EKS resources after workshop completion
- Download leaderboard data before destruction
- Verify all resources removed to avoid AWS charges
Prerequisites
- Requires successful deployment of
atlas-clustermodule - MongoDB connection string passed automatically to EKS
- User credentials synced from Atlas
💡 Note: The EKS cluster module is optional. If you only need MongoDB Atlas for a hybrid workshop where participants use their own IDE, you can skip this module and deploy only the atlas-cluster.