Reference: EKS Cluster Module

⚠️ Disclaimer: This documentation was AI-generated (“vibe coded”) and has not been fully verified yet. Please review carefully and report any inaccuracies.

The eks-cluster module deploys a complete Kubernetes-based workshop environment on AWS, providing participants with browser-based VSCode instances and a full workshop portal.

📦 What Gets Deployed

The eks-cluster module creates a complete workshop environment on AWS EKS:

Core Infrastructure

EKS Cluster - Managed Kubernetes with auto-scaling nodes
Networking - VPC, subnets, security groups, DNS (Route53)
Storage - EFS for persistent user workspaces
Security - SSL/TLS certificates (Let’s Encrypt), IAM roles

User Environment (Per Participant)

VSCode Online - Browser-based IDE with pre-installed extensions
MongoDB Database - Isolated database per user
PostgreSQL Database - Optional, isolated per user
Workspace Files - Workshop repo cloned and ready

Workshop Portal

Frontend - Next.js application with leaderboard and instructions
Backend API - Python Flask server for user management
Documentation - Jekyll-built workshop instructions
Results Processor - Automated exercise validation

Optional Components

LiteLLM Proxy - Unified LLM API (OpenAI/Anthropic) with caching
Redis Cache - LLM response caching to reduce API costs

⚙️ Configuration

All configuration is centralized in config.yaml at the customer folder level:

AWS Settings

aws:
  region: "us-east-2"
  profile: "Solution-Architects.User-979559056307"

Workshop Scenario Configuration

scenario:
  repository: "https://github.com/simonegaiera/mongodb-airbnb-workshop"
  branch: "main"
  
  database:
    mongodb: true
    postgresql: false
  
  llm:
    enabled: true
    provider: "openai"  # or "anthropic"
    model: "gpt-5-chat"  # or "claude-3-haiku"
    proxy_enabled: true
  
  leaderboard:
    type: "timed"  # or "score"
  
  prizes:
    enabled: true
    where: "Happy Hour"
    when: "4:30 PM"
  
  instructions:
    sections:
      - title: "Welcome & Setup"
        content: ["/guided/", "/guided/vscode/"]
      - title: "CRUD Operations"
        content: ["/crud/", "/crud/1/"]

⏱️ Deployment Time

EKS cluster + infrastructure: 30-40 minutes
User environments: 5-7 minutes (for ~10 users)
Total: ~45-75 minutes depending on configuration

💡 Key Features

🌐 Domain & DNS - Automatic Route53 configuration with per-user subdomains and A+ rated SSL
🔒 Security - TLS encryption, IAM roles, namespace isolation, security groups
👤 Per-User Isolation - Dedicated VSCode instance, database, and workspace per participant
📊 Progress Tracking - Real-time leaderboard with automated exercise validation
🤖 AI Integration - Optional LLM support (OpenAI/Anthropic) with response caching
⚡ Scalability - Auto-scaling nodes, on-demand resource provisioning

📁 Module Structure & Components

The eks-cluster module is organized by folders, where each folder represents a deployable component with its own Helm chart. Each component has a corresponding .tf file that orchestrates its deployment.

🏗️ Folder-Based Architecture

utils/eks-cluster/
├── Core Infrastructure (.tf files)
│   ├── main.tf              # Providers & backend
│   ├── variables.tf         # Input variables
│   ├── infra.tf            # VPC & networking
│   ├── eks.tf              # EKS cluster
│   ├── efs.tf              # Persistent storage
│   └── route53.tf          # DNS & SSL
│
├── Component Folders (Helm Charts)
│   ├── mdb-openvscode/     → openvscode.tf
│   ├── mdb-nginx/          → nginx.tf
│   ├── portal-server/      → arena-portal.tf
│   ├── portal-nginx/       → arena-portal.tf
│   ├── docs-nginx/         → docs-nginx.tf
│   ├── scenario-definition/→ scenario-definition.tf
│   ├── litellm/            → litellm.tf (optional)
│   ├── redis/              → redis.tf (optional)
│   └── results-processor/  (mounted in VSCode)
│
└── Supporting Folders
    ├── aws_policies/       # IAM policy documents
    ├── nginx-conf-files/   # Nginx templates
    ├── nginx-html-files/   # Static HTML pages
    └── mongodb-arena-portal/ # Portal source code

📦 How It Works: Folder = Deployable Component

Key Concept: Each folder with a Helm chart represents one deployable service. The corresponding .tf file orchestrates its deployment with environment-specific configuration.

Example:

mdb-openvscode/ folder contains the complete Helm chart for VSCode
openvscode.tf file deploys it with dynamic configuration (user list, MongoDB URIs, etc.)

Benefits:

✅ Self-contained: Each component has all its manifests, scripts, and config in one place
✅ Reusable: Helm charts can be versioned and reused across different customers
✅ Clear ownership: Easy to identify what each folder deploys
✅ Modular: Add/remove components by adding/removing folders

📦 Component Folder Structure

Each component folder contains a complete Helm chart:

component-name/
├── Chart.yaml              # Helm chart metadata
├── values.yaml            # Default configuration values
├── templates/             # Kubernetes manifests
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   └── ...
└── files/                 # Scripts and assets
    ├── startup.sh
    └── ...

Core Infrastructure Files

`main.tf` - Foundation & Providers

The backbone of the module, configuring:

Terraform Backend: S3 backend for state management
Providers: AWS (~6.0), ACME (~2.35), PostgreSQL (~1.22), Kubernetes (~2.37), Helm (~3.0)
Local Variables: Atlas connection strings, user lists, cluster naming conventions
Expiration Management: Automatic 7-day (168h) expiration timestamp for resources
Domain Configuration: Customer-specific domain names (e.g., customer.mongoarena.com)

`variables.tf` - Configuration Inputs

Defines all input variables with validation:

aws_profile - SA AWS profile for authentication
aws_region - EKS cluster location (default: us-east-2)
customer_name - Workshop customer identifier
aws_route53_hosted_zone - DNS zone (mongoarena.com)
domain_email - Email for Let’s Encrypt certificates
atlas_standard_srv - MongoDB connection string from atlas-cluster module
atlas_user_list - Participant usernames (validated non-empty)
atlas_user_password - Shared password for participants
atlas_admin_user / atlas_admin_password - Admin credentials
scenario_config - Complete workshop configuration from config.yaml
anthropic_api_key / azure_openai_api_key - LLM API keys (optional)

Network & Compute Files

`infra.tf` - VPC & Network Infrastructure

Creates the networking foundation:

VPC: 10.0.0.0/16 CIDR block with DNS support enabled
Subnets: 2 public subnets (10.0.0.0/22, 10.0.4.0/22) across availability zones
Internet Gateway: Enables outbound internet access
Route Tables: Routes all traffic (0.0.0.0/0) through IGW
Security Groups:
- EKS security group with DNS (53/tcp, 53/udp) and egress rules
- Kubernetes ELB role tags for load balancer provisioning
Availability Zone Selection: Dynamic selection with us-west-2 filtering

`eks.tf` - EKS Cluster & IAM

The heart of the Kubernetes infrastructure:

IAM Roles & Policies:

Node Role (aws_iam_role.node):
- AmazonEKSWorkerNodeMinimalPolicy - Core worker node permissions
- AmazonEC2ContainerRegistryPullOnly - Pull Docker images
- AmazonElasticFileSystemClientReadWriteAccess - Access EFS volumes
- Custom EFS CSI policy for persistent storage
- Bedrock policy (conditional, if LLM enabled)
- S3 read policy for mongodb-arena bucket
- Secrets Manager policy for API keys
Cluster Role (aws_iam_role.cluster):
- AmazonEKSClusterPolicy - Manage EKS resources
- AmazonEKSComputePolicy - Compute operations
- AmazonEKSBlockStoragePolicy - EBS volumes
- AmazonEKSLoadBalancingPolicy - Load balancers
- AmazonEKSNetworkingPolicy - VPC networking
- Custom EKS Auto Mode policy for EC2 tagging

EKS Cluster Configuration:

Compute Config: Auto Mode with “general-purpose” and “system” node pools
Network Config: Elastic Load Balancing enabled
Storage Config: Block storage enabled
Access Config: API + ConfigMap authentication mode
VPC Config: Public and private API endpoints
Logging: API, audit, authenticator, controller manager, scheduler logs to CloudWatch
Tags: Name, expire-on, owner, purpose for cost tracking

EKS Add-ons:

vpc-cni - VPC networking
metrics-server - Resource metrics for HPA
amazon-cloudwatch-observability - Centralized logging
aws-efs-csi-driver - EFS persistent volumes

Storage Classes:

efs-sc - EFS storage class for user workspace persistence
- Provisioning mode: efs-ap (access points)
- Reclaim policy: Delete
- Volume binding: Immediate

`efs.tf` - Elastic File System

Persistent shared storage for user workspaces:

EFS File System: Encrypted at rest, tagged with expiration
Security Group: Allows NFS traffic (port 2049) from all sources
Mount Targets: One per subnet for high availability
Purpose: Stores user workspace data, ensures persistence across pod restarts
Access: Each user gets isolated access through EFS access points
Output: EFS ID and DNS name for volume mounting

SSL/TLS & DNS Files

`route53.tf` - DNS & SSL Certificates

Manages domain names and HTTPS certificates:

SSL Certificate Generation (ACME/Let’s Encrypt):

ACME Account: Registered with Let’s Encrypt production server
Private Keys: RSA 2048-bit keys for account and certificate
Certificate Request: Wildcard cert for *.customer.mongoarena.com and apex domain
DNS Challenge: Route53 DNS-01 challenge with automatic validation
Recursive Nameservers: Google (8.8.8.8) and Cloudflare (1.1.1.1) for fast propagation
Result: A+ rated SSL certificate (validated with SSL Labs)

Route53 DNS Records:

Main domain (customer.mongoarena.com) → Portal frontend
WWW subdomain (www.customer.mongoarena.com) → Portal frontend
Wildcard (*.customer.mongoarena.com) → User VSCode instances
Instructions (instructions.customer.mongoarena.com) → Workshop documentation
Participants (participants.customer.mongoarena.com) → Participant list page
Portal (portal.customer.mongoarena.com) → Backend API server

All records use Route53 Alias records pointing to AWS Load Balancers with health checks.

Application Components

Each application component is deployed from its own folder containing a complete Helm chart. The corresponding .tf file orchestrates the deployment with dynamic configuration.

1️⃣ `mdb-openvscode/` → `openvscode.tf`

Per-User VSCode Instances - Deploys isolated browser-based IDEs for each participant:

Helm Chart Deployment (one per user):

Chart: ./mdb-openvscode v0.1.3
Namespace: Default (each user gets unique deployment name)
Timeout: 10 minutes to handle slow image pulls

Environment Variables (passed to each VSCode instance):

MONGODB_URI - User-specific connection string with credentials
ENVIRONMENT - Set to “prod”
SERVICE_NAME - Backend server URL (http://localhost:5000)
SCENARIO_PATH - Results processor location
SIGNAL_FILE_PATH - Server restart signal location
LOG_PATH - Exercise results directory
WORKSHOP_USER - User home directory (/app)
LEADERBOARD - Leaderboard type (timed/score)
BACKEND_URL - User’s backend API URL (https://username.customer.mongoarena.com/backend)
LLM_MODEL - LLM model name if enabled
DATABASE_NAME - User’s MongoDB database name
LLM_PROXY_ENABLED - MCP proxy toggle
LLM_PROXY_TYPE, LLM_PROXY_SERVICE, LLM_PROXY_PORT - Proxy configuration
MDB_MCP_CONNECTION_STRING - MongoDB MCP server connection
PGSQL_MCP_CONNECTION_STRING - PostgreSQL MCP server connection (if enabled)

Volume Mounts:

/home/workspace - Persistent EFS volume (user’s files)
/home/workspace/utils - Results processor scripts
/home/workspace/.openvscode-server/data/Machine - VSCode settings
/home/workspace/.openvscode-server/data/User/globalStorage/saoudrizwan.claude-dev/settings - Cline extension settings
/home/workspace/scenario-config - Workshop scenario configuration

Persistence:

Each user gets dedicated PVC (PersistentVolumeClaim) backed by EFS
Storage class: efs-sc (from eks.tf)
Data persists across pod restarts

Services:

Each deployment exposes service: vscode-{username}-svc
Services discovered by Nginx for routing

2️⃣ `mdb-nginx/` → `nginx.tf`

User VSCode Reverse Proxy - Routes traffic to individual VSCode instances:

Dynamic Configuration Generation:

Base Config: Common Nginx settings (gzip, caching, SSL)
Per-User Configs: For each Atlas user, generates:
- Server block listening on port 443 (HTTPS)
- Server name: {username}.customer.mongoarena.com
- Proxy pass to user’s VSCode service
- WebSocket upgrade headers for VSCode
- SSL certificate from nginx-tls-secret

Nginx Deployment (helm_release.airbnb_arena_nginx):

Chart: ./mdb-nginx v0.1.2
Volume Mounts:
- /etc/nginx/conf.d - Dynamic user configurations
- /etc/nginx/nginx.conf - Base Nginx config
- /usr/share/nginx/html - Static HTML (landing page)
- /mnt/vscode-{user} - Each user’s persistent volume (for file serving)
- /etc/nginx/ssl - TLS certificates
Service Type: LoadBalancer (AWS ELB automatically provisioned)
Output: Load balancer hostname and zone ID for Route53

Security:

TLS 1.2/1.3 only
Strong cipher suites
HSTS headers
A+ SSL Labs rating

3️⃣ `portal-server/` + `portal-nginx/` → `arena-portal.tf`

Workshop Portal - Deploys the main workshop interface (frontend + backend):

Portal Backend Server (helm_release.portal_server):

Technology: Python Flask application
Chart: ./portal-server v0.1.0
Purpose: REST API for leaderboard, user management, progress tracking
Environment Variables:
- MONGODB_URI - Admin connection to shared database
- DB_NAME - “arena_shared” (shared across all users)
- PARTICIPANTS - Collection name for participant data
- USER_DETAILS - Collection name for user details
- LEADERBOARD - Leaderboard type configuration
Volumes:
- Scenario config (read-only)
- Startup script for initialization
Service: ClusterIP service on port 5000
Database Collections:
- participants - User registration and status
- user_details - Extended user information
- leaderboard - Exercise completion times/scores

Portal Frontend Nginx (helm_release.portal_nginx):

Technology: Next.js static export served by Nginx
Chart: ./portal-nginx v0.1.5
Build Process:
- Startup script clones mongodb-arena-portal repo
- Runs npm install && npm run build
- Exports static files to /usr/share/nginx/html/portal
Environment Variables (injected into Next.js build):
- NEXT_PUBLIC_API_URL - Backend API URL
- NEXT_PUBLIC_REPO_NAME - Workshop repository name
- NEXT_PUBLIC_SERVER_PATH - Backend server path
- NEXT_PUBLIC_PRIZES_ENABLED, PRIZES_WHERE, PRIZES_WHEN - Prize information
Routes:
- / - Main portal landing page
- /leaderboard - Real-time leaderboard
- /backend/* - Proxied to portal-server
Nginx Config:
- Listens on ports 80 (HTTP) and 443 (HTTPS)
- SSL termination with wildcard certificate
- Proxy to backend server for API calls
- Static file serving for frontend
Service: LoadBalancer (separate from VSCode Nginx)
DNS: Points to customer.mongoarena.com and portal.customer.mongoarena.com

4️⃣ `docs-nginx/` → `docs-nginx.tf`

Workshop Instructions - Serves workshop documentation:

Documentation Builder:

Technology: Jekyll static site generator
Chart: ./docs-nginx v0.1.4
Build Process:
- Startup script clones workshop repository
- Navigates to docs/ directory
- Runs bundle install && bundle exec jekyll build
- Copies _site/ output to /usr/share/nginx/html/instructions
Source: Workshop repository’s /docs/ folder (Jekyll site)
Result: Static HTML/CSS/JS documentation

Nginx Server Blocks:

Participants Page (participants.customer.mongoarena.com):
- Root: /usr/share/nginx/html/
- Shows participant list with VSCode links
Instructions Page (instructions.customer.mongoarena.com):
- Root: /usr/share/nginx/html/instructions
- Serves Jekyll-built documentation

Configuration:

SSL enabled with wildcard certificate
Gzip compression for faster page loads
Custom 404 and 50x error pages
Service: LoadBalancer (shared by both subdomains)

5️⃣ `scenario-definition/` → `scenario-definition.tf`

Workshop Configuration - Centralizes workshop scenario settings:

Kubernetes Job (helm_release.scenario_definition):

Chart: ./scenario-definition v0.1.7
Purpose: Uploads scenario config to MongoDB for runtime access
Execution: Runs once at deployment, completes when upload succeeds
Timeout: 5 minutes

ConfigMaps Created:

scenario-definition-config - Raw YAML scenario config
scenario-definition-enhanced-config - Enhanced with runtime values:
- aws_route53_record_name - Full domain name
- atlas_standard_srv - MongoDB connection string
- atlas_user_password - User credentials

Python Script (define-scenario.py):

Reads /etc/scenario-config/scenario-config.json
Connects to MongoDB (arena_shared database)
Upserts document into scenario_config collection
Makes config accessible to all services

Configuration Elements:

Workshop repository URL and branch
Database configuration (MongoDB/PostgreSQL)
LLM settings (provider, model, proxy)
Leaderboard type (timed/score)
Prize information
Workshop instructions structure

Usage: Other services mount scenario-definition-config to read workshop settings.

Optional Components

6️⃣ `litellm/` → `litellm.tf` (Optional)

LLM Proxy - Provides unified API for multiple LLM providers:

Conditional Deployment:

Only deploys if scenario_config.llm.enabled == true and scenario_config.llm.proxy.enabled == true

LiteLLM Service (helm_release.litellm):

Chart: ./litellm v0.1.14
Purpose: Proxy for OpenAI and Anthropic APIs with caching
Service: ClusterIP on port 4000

API Key Management:

Primary Source: AWS Secrets Manager (arena/secrets)
Fallback: Terraform variables (anthropic_api_key, azure_openai_api_key)
Security: Stored as Kubernetes secrets, mounted as env vars

Model Configuration:

If OpenAI Provider:
- gpt-5-mini - Azure OpenAI endpoint
- gpt-5-chat - Azure OpenAI endpoint
- API base: https://solutionsconsultingopenai.openai.azure.com
- API version: 2025-04-01-preview
If Anthropic Provider:
- claude-3-haiku - Anthropic API
- claude-4-sonnet - Anthropic API

Caching (if Redis enabled):

Type: Redis-backed semantic cache
TTL: 1800 seconds (30 minutes)
Namespace: litellm.cline.cache
Purpose: Reduce API costs by caching similar prompts
Supported Call Types: completion, acompletion, embedding, aembedding

Environment Variables:

PORT - 4000
LITELLM_LOG - INFO level logging
REDIS_HOST, REDIS_PORT - If Redis enabled

Integration: VSCode Cline extension connects to http://litellm-service:4000

7️⃣ `redis/` → `redis.tf` (Optional)

Redis Cache - Provides caching for LiteLLM:

Conditional Deployment:

Only deploys if scenario_config.llm.proxy.redis.enabled == true

Redis Instance (helm_release.redis):

Chart: ./redis v0.1.1
Version: Redis 7.x
Persistence: None (in-memory cache, acceptable for workshop duration)
Service: ClusterIP on port 6379 (default Redis port)

Kubernetes Secret (redis_credentials):

REDIS_HOST - Service name (e.g., “redis-service”)
REDIS_PORT - Port number (6379)
REDIS_URL - Full connection string (redis://redis-service:6379)

Purpose:

Cache LLM responses for identical prompts
Reduce API costs during workshops
Faster response times for repeated queries

Connection String: redis://redis-service:6379 (internal cluster DNS)

8️⃣ `aurora.tf` (Optional, no folder)

PostgreSQL Database - Provides PostgreSQL for multi-database workshops:

Conditional Deployment:

Only deploys if scenario_config.database.postgres == true

Aurora Serverless v2 Cluster (aws_rds_cluster.aurora_cluster):

Engine: aurora-postgresql 17.5
Scaling: 0.5 to 1.0 ACU (Aurora Capacity Units)
Database: sample_airbnb
Master User: postgres
Master Password: Same as Atlas user password
Backup: 1 day retention
Security:
- Accessible from EKS cluster security group
- Accessible from VPC (10.0.0.0/16)
- Accessible from specific subnet (104.30.164.0/28)
Public Access: Enabled (for external tools)
SSL: Required (sslmode=require)

pgvector Extension (null_resource.enable_pgvector):

Installs PostgreSQL client locally (macOS/Linux)
Connects to Aurora cluster
Runs: CREATE EXTENSION IF NOT EXISTS vector;
Enables vector similarity search in PostgreSQL

Per-User Resources: For each Atlas user:

PostgreSQL Role: Login user with CREATE DATABASE privilege
PostgreSQL Database: Owned by respective user
Connection String: postgresql://{user}:{password}@{endpoint}:5432/{user}

Data Restoration (kubernetes_job_v1.restore_backup_to_users):

Purpose: Pre-populate user databases with Airbnb dataset
Execution: Kubernetes job (runs once at deployment)
Process:
1. Downloads backup from S3 (s3://mongodb-arena/postgres-backups/airbnb-backup.sql.gz)
2. Decompresses SQL file
3. For each user, restores backup to their database
4. Verifies restoration by checking table count
5. Skips users already restored
Image: postgres:17
Resources: 1Gi memory, 500m CPU (bursts to 2Gi/1000m)
Timeout: 10 minutes
Backoff Limit: 3 retries

Output Variables:

aurora_cluster_endpoint - Writer endpoint
aurora_cluster_reader_endpoint - Reader endpoint
aurora_database_name - “sample_airbnb”
aurora_master_username - “postgres”
aurora_connection_string - Full PostgreSQL connection string
user_connection_strings - Per-user connection strings
user_jdbc_urls - JDBC URLs for each user

Integration: VSCode instances receive PGSQL_MCP_CONNECTION_STRING environment variable for MCP server.

Supporting Directories

These folders provide configuration, templates, and utilities used by the component folders:

`aws_policies/`

IAM Policy Documents - JSON policy files for granular permissions:

node_policy.json - Node role trust policy
cluster_policy.json - Cluster role trust policy
efs_csi_node_policy.json - EFS CSI driver permissions
bedrock.json - AWS Bedrock LLM access
s3.json - S3 bucket read access (mongodb-arena)
secrets.json - Secrets Manager read access
eks_auto_mode_policy.json - EC2 instance tagging for EKS Auto Mode

Used by: eks.tf

`nginx-conf-files/`

Nginx Configuration Templates - Templatefiles for dynamic Nginx configs:

nginx-base-config.conf.tpl - Common settings (gzip, caching, timeouts)
mdb-nginx-openvscode.conf.tpl - Per-user VSCode server block
doc-nginx-main.conf.tpl - Documentation server block
portal-nginx-server.conf.tpl - Portal server block
nginx.conf - Main Nginx configuration file

Used by: nginx.tf, docs-nginx.tf, arena-portal.tf

`nginx-html-files/`

Static HTML Templates - Templated HTML pages:

index.html.tpl - Landing page with user links
404.html.tpl - Custom 404 error page
50x.html.tpl - Server error page
favicon.ico - Website icon

Used by: nginx.tf, docs-nginx.tf

`mongodb-arena-portal/`

Portal Application Source - Next.js + Flask application source code:

frontend/ - Next.js application:
- src/app/ - Next.js pages (App Router)
- src/components/ - React components
- public/ - Static assets
- next.config.mjs - Next.js configuration
- tailwind.config.js - Tailwind CSS styling
server/ - Python Flask backend:
- app.py - Flask application
- requirements.txt - Python dependencies

Used by: arena-portal.tf (cloned and built during deployment)

`results-processor/`

Exercise Validation - Java application for validating workshop exercises:

Technology: Java 21 + MongoDB Java Driver
Purpose: Validates student exercise solutions
Structure:
- src/main/java/com/mongodb/ - Java source code
- pom.xml - Maven dependencies
- target/ - Compiled JAR file
- run.sh - Execution script
- logs/ - Validation results
Integration: Mounted into VSCode containers at /home/workspace/utils

Used by: openvscode.tf (mounted as volume)

🔄 Deployment Flow

Understanding how these components deploy in sequence (aligned with the 5-tier architecture):

Tier 1: Network Foundation

Files: main.tf, infra.tf

VPC creation with public subnets
Internet Gateway and route tables
Security groups for EKS
Availability zone configuration
Time: ~2-3 minutes

Tier 2: Storage & Kubernetes

Files: efs.tf, eks.tf

EFS Storage: File system creation, mount targets, security group
EKS Cluster: Cluster creation, IAM roles, node pools, add-ons
Storage Classes: EFS CSI driver configuration
Time: ~30-40 minutes (EKS cluster is the longest component)

Tier 3: Supporting Services

Files: route53.tf, scenario-definition/, aurora.tf

SSL Certificates: Let’s Encrypt certificate via ACME, DNS validation
Scenario Config: Kubernetes job uploads workshop config to MongoDB
PostgreSQL (optional): Aurora Serverless v2, pgvector, per-user databases
Time: ~5-10 minutes (15-20 if Aurora enabled)

Tier 4: User Workspaces

Folders: mdb-openvscode/, redis/

VSCode Instances: Per-user deployments with PVCs, ConfigMaps, services
Redis Cache (optional): In-memory cache for LiteLLM
Deployments happen in parallel per user
Time: ~3-5 minutes for 10 users

Tier 5: User-Facing Services

Folders: mdb-nginx/, portal-server/, portal-nginx/, docs-nginx/, litellm/

VSCode Nginx: Reverse proxy with per-user configs, load balancer
Portal Backend: Flask API server for leaderboard and user management
Portal Frontend: Next.js build and deployment
Documentation: Jekyll site build and deployment
LiteLLM Proxy (optional): Unified LLM API with caching
Time: ~5-10 minutes

Total Deployment Time

Minimum (no optional components): ~45-55 minutes
Full deployment (with PostgreSQL + LLM): ~60-75 minutes
Note: Most time spent waiting for EKS cluster (Tier 2)

🎭 Component Dependencies

This diagram shows the deployment dependencies. Components are deployed in order based on what they depend on:

                    ┌─────────────────────┐
                    │  1. infra.tf        │
                    │  VPC + Subnets      │
                    └──────────┬──────────┘
                               │
                    ┌──────────┴──────────┐
                    │                     │
                    ▼                     ▼
        ┌──────────────────┐    ┌─────────────────┐
        │  2. efs.tf       │    │  2. eks.tf      │
        │  EFS Storage     │    │  EKS Cluster    │
        └──────────────────┘    └────────┬────────┘
                                         │
                    ┌────────────────────┼────────────────────┐
                    │                    │                    │
                    ▼                    ▼                    ▼
        ┌──────────────────┐  ┌────────────────┐  ┌──────────────────┐
        │  3. route53.tf   │  │ 3. scenario-   │  │ 3. aurora.tf     │
        │  SSL/TLS Cert    │  │ definition/    │  │ PostgreSQL       │
        └────────┬─────────┘  └───────┬────────┘  │ (optional)       │
                 │                    │            └──────────────────┘
                 │                    │
         ┌───────┴────────────────────┴───────┐
         │                                    │
         ▼                                    ▼
┌─────────────────┐                  ┌─────────────────┐
│ 4. redis/       │                  │ 4. mdb-         │
│ Redis Cache     │                  │ openvscode/     │
│ (optional)      │                  │ VSCode Users    │
└────────┬────────┘                  └────────┬────────┘
         │                                    │
         │            ┌───────────────────────┼──────────────┐
         │            │                       │              │
         ▼            ▼                       ▼              ▼
┌─────────────┐  ┌──────────┐     ┌─────────────┐  ┌──────────────┐
│ 5. litellm/ │  │ 5. mdb-  │     │ 5. portal-  │  │ 5. docs-     │
│ LLM Proxy   │  │ nginx/   │     │ server/ +   │  │ nginx/       │
│ (optional)  │  │ VSCode   │     │ portal-     │  │ Workshop     │
└─────────────┘  │ Proxy    │     │ nginx/      │  │ Docs         │
                 └──────────┘     │ Portal      │  └──────────────┘
                                  └─────────────┘

Legend:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Tier 1: infra.tf          → Network foundation
  Tier 2: efs.tf, eks.tf    → Storage + Kubernetes
  Tier 3: SSL, Config, DB   → Supporting services
  Tier 4: User workspaces   → Per-user resources
  Tier 5: Proxies & Portal  → User-facing services
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Folder-to-Tier Mapping:
├── Tier 1-3: Core .tf files (no folders)
├── Tier 4-5: Component folders
│   ├── mdb-openvscode/    → openvscode.tf
│   ├── mdb-nginx/         → nginx.tf  
│   ├── portal-server/     → arena-portal.tf
│   ├── portal-nginx/      → arena-portal.tf
│   ├── docs-nginx/        → docs-nginx.tf
│   ├── scenario-definition/→ scenario-definition.tf
│   ├── redis/             → redis.tf (optional)
│   └── litellm/           → litellm.tf (optional)

Key Dependencies:

SSL Certificate (route53.tf) → Required by all Nginx components for HTTPS
Scenario Config (scenario-definition/) → Required by VSCode and Portal for workshop settings
EFS Storage (efs.tf) → Required by VSCode instances for persistent workspaces
Redis (redis/) → Required by LiteLLM for caching (if enabled)
VSCode Instances (mdb-openvscode/) → Must exist before VSCode Nginx can proxy to them
Aurora (aurora.tf) → Independent, only needs VPC (if PostgreSQL enabled)

🔍 Key Design Patterns

Per-User Isolation

Each user gets dedicated Kubernetes resources (namespace via naming)
Separate PVC for file persistence
Individual ConfigMaps for personalized settings
Isolated MongoDB databases
Isolated PostgreSQL databases (if enabled)

Dynamic Configuration

Terraform templates generate per-user Nginx configs
User list from Atlas drives VSCode instance creation
ConfigMaps populated from scenario_config variable
DNS records automatically created for each user

Conditional Deployment

PostgreSQL only if database.postgres == true
LiteLLM only if llm.enabled == true and llm.proxy.enabled == true
Redis only if llm.proxy.redis.enabled == true
Bedrock policies only if llm.enabled == true

High Availability

Multi-AZ subnets for resilience
EFS mount targets in each AZ
Load balancers with health checks
Auto-scaling node pools (EKS Auto Mode)

Security Best Practices

Least privilege IAM roles
TLS encryption everywhere (A+ rating)
Secrets stored in AWS Secrets Manager
Network security groups
Private subnets for sensitive resources

Cost Optimization

EKS Auto Mode for efficient scaling
Aurora Serverless v2 (scales to zero-ish)
Redis in-memory only (no persistence)
7-day auto-expiration tags
Shared infrastructure (one EKS cluster for all users)

🎯 Deployment Options

Fully Managed (with EKS)

Complete browser-based environment
VSCode Online for all participants
Centralized management and monitoring
Best for: Corporate workshops, large groups

Hybrid (without EKS)

MongoDB Atlas cluster only
Participants use local development environment
Reduced infrastructure costs
Best for: Technical audiences, smaller groups

💡 Tip: To deploy hybrid mode, remove the eks-cluster folder from your customer directory before running terragrunt apply --all

⚠️ Important Notes

Cluster Expiration

EKS clusters expire after 1 week by default
Plan workshop timing accordingly
Can be extended if needed

Resource Cleanup

Always destroy EKS resources after workshop completion
Download leaderboard data before destruction
Verify all resources removed to avoid AWS charges

Prerequisites

Requires successful deployment of atlas-cluster module
MongoDB connection string passed automatically to EKS
User credentials synced from Atlas

💡 Note: The EKS cluster module is optional. If you only need MongoDB Atlas for a hybrid workshop where participants use their own IDE, you can skip this module and deploy only the atlas-cluster.

📦 What Gets Deployed

Core Infrastructure

User Environment (Per Participant)

Workshop Portal

Optional Components

⚙️ Configuration

AWS Settings

Workshop Scenario Configuration

⏱️ Deployment Time

💡 Key Features

📁 Module Structure & Components

🏗️ Folder-Based Architecture

📦 How It Works: Folder = Deployable Component

📦 Component Folder Structure

Core Infrastructure Files

main.tf - Foundation & Providers

variables.tf - Configuration Inputs

Network & Compute Files

infra.tf - VPC & Network Infrastructure

eks.tf - EKS Cluster & IAM

efs.tf - Elastic File System

SSL/TLS & DNS Files

route53.tf - DNS & SSL Certificates

Application Components

1️⃣ mdb-openvscode/ → openvscode.tf

2️⃣ mdb-nginx/ → nginx.tf

3️⃣ portal-server/ + portal-nginx/ → arena-portal.tf

4️⃣ docs-nginx/ → docs-nginx.tf

5️⃣ scenario-definition/ → scenario-definition.tf

Optional Components

6️⃣ litellm/ → litellm.tf (Optional)

7️⃣ redis/ → redis.tf (Optional)

8️⃣ aurora.tf (Optional, no folder)

Supporting Directories

aws_policies/

nginx-conf-files/

nginx-html-files/

mongodb-arena-portal/

results-processor/

🔄 Deployment Flow

Tier 1: Network Foundation

Tier 2: Storage & Kubernetes

Tier 3: Supporting Services

Tier 4: User Workspaces

Tier 5: User-Facing Services

Total Deployment Time

🎭 Component Dependencies

🔍 Key Design Patterns

Per-User Isolation

Dynamic Configuration

Conditional Deployment

High Availability

Security Best Practices

Cost Optimization

🎯 Deployment Options

Fully Managed (with EKS)

Hybrid (without EKS)

⚠️ Important Notes

Cluster Expiration

Resource Cleanup

Prerequisites

`main.tf` - Foundation & Providers

`variables.tf` - Configuration Inputs

`infra.tf` - VPC & Network Infrastructure

`eks.tf` - EKS Cluster & IAM

`efs.tf` - Elastic File System

`route53.tf` - DNS & SSL Certificates

1️⃣ `mdb-openvscode/` → `openvscode.tf`

2️⃣ `mdb-nginx/` → `nginx.tf`

3️⃣ `portal-server/` + `portal-nginx/` → `arena-portal.tf`

4️⃣ `docs-nginx/` → `docs-nginx.tf`

5️⃣ `scenario-definition/` → `scenario-definition.tf`

6️⃣ `litellm/` → `litellm.tf` (Optional)

7️⃣ `redis/` → `redis.tf` (Optional)

8️⃣ `aurora.tf` (Optional, no folder)

`aws_policies/`

`nginx-conf-files/`

`nginx-html-files/`

`mongodb-arena-portal/`

`results-processor/`