Lambda Cloud Tutorial¶
This tutorial demonstrates how to use Clustrix with Lambda Cloud for high-performance GPU computing and distributed machine learning.
Overview¶
Lambda Cloud specializes in GPU cloud computing and integrates well with Clustrix for ML workloads:
GPU-Optimized Instances: High-performance NVIDIA GPUs (A100, H100, RTX)
Cost-Effective: Competitive pricing for GPU computing
Simple Management: Easy instance launching and management
Pre-configured Environments: ML-ready software stacks
High-Speed Networking: InfiniBand for multi-GPU communications
Persistent Storage: Fast NVMe and network storage options
SSH Access: Direct access for Clustrix integration
On-Demand and Reserved: Flexible pricing models
Prerequisites¶
Lambda Cloud account with GPU credits
SSH key pair for instance access
Lambda Cloud API key (optional)
Basic understanding of GPU computing
Installation and Setup¶
Install Clustrix with Lambda Cloud dependencies:
[ ]:
# Install Clustrix with GPU and Lambda Cloud support
!pip install clustrix torch torchvision transformers datasets accelerate
# Import required libraries
import clustrix
from clustrix import cluster, configure
import torch
import numpy as np
import time
import json
import requests
import os
Lambda Cloud Authentication and Setup¶
Option 1: Web Console Setup¶
Lambda Cloud Web Console Setup¶
Create Account:
Sign up and verify your account
Add billing information and credits
Add SSH Key:
Click “Add SSH Key”
Paste your public key (cat ~/.ssh/id_rsa.pub)
Give it a descriptive name
Launch Instance:
Click “Launch instance”
Select instance type (A100, H100, RTX 6000 Ada, etc.)
Choose region (closest to you for best performance)
Select your SSH key
Launch the instance
Instance Types Available:
RTX 6000 Ada: 48GB VRAM, ~$0.75/hour
A10: 24GB VRAM, ~$0.60/hour
A100 (40GB): 40GB VRAM, ~$1.10/hour
A100 (80GB): 80GB VRAM, ~$1.40/hour
H100: 80GB VRAM, ~$2.50/hour (when available)
Access Instance:
Wait for instance to be “Running”
Note the public IP address
SSH: ssh ubuntu@
Follow this guide to set up your Lambda Cloud account and launch your first GPU instance.
Option 2: API-Based Setup¶
[ ]:
import requests
import os
class LambdaCloudAPI:
def __init__(self, api_key=None):
self.api_key = api_key or os.getenv('LAMBDA_API_KEY')
self.base_url = 'https://cloud.lambdalabs.com/api/v1'
self.headers = {
'Authorization': f'Bearer {self.api_key}',
'Content-Type': 'application/json'
}
def list_instance_types(self):
"""List available instance types."""
response = requests.get(f'{self.base_url}/instance-types', headers=self.headers)
return response.json()
def list_instances(self):
"""List running instances."""
response = requests.get(f'{self.base_url}/instances', headers=self.headers)
return response.json()
def launch_instance(self, instance_type, ssh_key_name, region='us-east-1', name=None):
"""Launch a new instance."""
data = {
'instance_type_name': instance_type,
'ssh_key_names': [ssh_key_name],
'region_name': region
}
if name:
data['name'] = name
response = requests.post(f'{self.base_url}/instance-operations/launch',
headers=self.headers, json=data)
return response.json()
def terminate_instance(self, instance_id):
"""Terminate an instance."""
data = {'instance_ids': [instance_id]}
response = requests.post(f'{self.base_url}/instance-operations/terminate',
headers=self.headers, json=data)
return response.json()
def get_instance_details(self, instance_id):
"""Get detailed information about an instance."""
instances = self.list_instances()
for instance in instances.get('data', []):
if instance['id'] == instance_id:
return instance
return None
# Example usage:
# api = LambdaCloudAPI()
# instance_types = api.list_instance_types()
# print(json.dumps(instance_types, indent=2))
Lambda Cloud API Setup Guide¶
CLI Setup Steps¶
Get API Key:
Generate a new API key
Set as environment variable:
export LAMBDA_API_KEY="your-key"
Install Lambda Cloud CLI:
pip install lambda-cloud lambda-cloud configure # Enter your API key
Basic CLI Commands:
# List available instance types lambda-cloud instance-types list # List available regions lambda-cloud regions list # Launch instance lambda-cloud instance launch \ --instance-type a100 \ --ssh-key-name your-key-name \ --region us-east-1 # List running instances lambda-cloud instance list # Terminate instance lambda-cloud instance terminate <INSTANCE_ID>
Python API Client¶
Configure Clustrix for Lambda Cloud¶
[ ]:
# Configure Clustrix to use your Lambda Cloud instance
configure(
cluster_type="ssh",
cluster_host="your-lambda-instance-ip", # Replace with actual IP
username="ubuntu", # Default Lambda Cloud user
key_file="~/.ssh/id_rsa", # Your private SSH key
remote_work_dir="/tmp/clustrix",
package_manager="auto", # Will use uv if available
default_cores=8, # Lambda instances typically have 8+ cores
default_memory="32GB", # Generous memory allocation
default_time="02:00:00", # Longer timeout for GPU tasks
environment_variables={
"CUDA_VISIBLE_DEVICES": "0", # Use first GPU
"NVIDIA_VISIBLE_DEVICES": "all"
}
)
Replace ``your-lambda-instance-ip`` with the actual IP address from your Lambda Cloud instance.
GPU Verification and Setup¶
[ ]:
@cluster(cores=2, memory="8GB")
def verify_lambda_gpu_setup():
"""Verify GPU availability and setup on Lambda Cloud instance."""
import torch
import subprocess
import platform
# System information
system_info = {
'platform': platform.platform(),
'python_version': platform.python_version(),
'architecture': platform.architecture()[0]
}
# PyTorch and CUDA info
torch_info = {
'pytorch_version': torch.__version__,
'cuda_available': torch.cuda.is_available(),
'cuda_version': torch.version.cuda if torch.cuda.is_available() else None,
'cudnn_version': torch.backends.cudnn.version() if torch.cuda.is_available() else None,
'device_count': torch.cuda.device_count() if torch.cuda.is_available() else 0
}
# GPU details
gpu_info = []
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
gpu_info.append({
'device_id': i,
'name': props.name,
'total_memory_gb': props.total_memory / (1024**3),
'major': props.major,
'minor': props.minor,
'multiprocessor_count': props.multi_processor_count
})
# NVIDIA-SMI output
nvidia_smi = None
try:
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
if result.returncode == 0:
nvidia_smi = result.stdout
except FileNotFoundError:
nvidia_smi = "nvidia-smi not found"
# Test GPU computation
gpu_test_result = None
if torch.cuda.is_available():
try:
# Simple GPU computation test
device = torch.device('cuda:0')
x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)
start_time = torch.cuda.Event(enable_timing=True)
end_time = torch.cuda.Event(enable_timing=True)
start_time.record()
z = torch.mm(x, y)
torch.cuda.synchronize()
end_time.record()
torch.cuda.synchronize()
gpu_test_result = {
'test_passed': True,
'computation_time_ms': start_time.elapsed_time(end_time),
'result_shape': z.shape,
'memory_allocated_mb': torch.cuda.memory_allocated() / (1024**2),
'memory_reserved_mb': torch.cuda.memory_reserved() / (1024**2)
}
except Exception as e:
gpu_test_result = {
'test_passed': False,
'error': str(e)
}
return {
'system_info': system_info,
'torch_info': torch_info,
'gpu_info': gpu_info,
'nvidia_smi': nvidia_smi,
'gpu_test': gpu_test_result
}
# Run GPU verification
# gpu_status = verify_lambda_gpu_setup()
# print(json.dumps(gpu_status, indent=2, default=str))
print("GPU verification function defined. Uncomment the lines above to run on Lambda Cloud.")
Example 1: Distributed Deep Learning Training¶
[ ]:
@cluster(cores=8, memory="16GB", time="01:30:00")
def lambda_deep_learning_training(model_config, training_config):
"""Train a deep learning model on Lambda Cloud GPU."""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import time
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Training on device: {device}")
# Create synthetic dataset
n_samples = training_config['n_samples']
n_features = training_config['n_features']
n_classes = training_config['n_classes']
# Generate random data
X = torch.randn(n_samples, n_features)
y = torch.randint(0, n_classes, (n_samples,))
# Create dataset and dataloader
dataset = TensorDataset(X, y)
dataloader = DataLoader(
dataset,
batch_size=training_config['batch_size'],
shuffle=True
)
# Define model architecture
class DeepNet(nn.Module):
def __init__(self, input_size, hidden_sizes, output_size, dropout=0.2):
super(DeepNet, self).__init__()
layers = []
prev_size = input_size
for hidden_size in hidden_sizes:
layers.extend([
nn.Linear(prev_size, hidden_size),
nn.ReLU(),
nn.BatchNorm1d(hidden_size),
nn.Dropout(dropout)
])
prev_size = hidden_size
layers.append(nn.Linear(prev_size, output_size))
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
# Create model
model = DeepNet(
input_size=n_features,
hidden_sizes=model_config['hidden_sizes'],
output_size=n_classes,
dropout=model_config.get('dropout', 0.2)
).to(device)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
model.parameters(),
lr=training_config['learning_rate'],
weight_decay=training_config.get('weight_decay', 1e-4)
)
# Training loop
model.train()
training_start = time.time()
epoch_losses = []
epoch_accuracies = []
for epoch in range(training_config['epochs']):
epoch_loss = 0.0
correct = 0
total = 0
for batch_idx, (data, target) in enumerate(dataloader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
avg_loss = epoch_loss / len(dataloader)
accuracy = 100.0 * correct / total
epoch_losses.append(avg_loss)
epoch_accuracies.append(accuracy)
if epoch % 10 == 0 or epoch == training_config['epochs'] - 1:
print(f'Epoch {epoch+1}/{training_config["epochs"]}: '
f'Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')
training_time = time.time() - training_start
# Model evaluation
model.eval()
with torch.no_grad():
test_data = torch.randn(1000, n_features).to(device)
test_output = model(test_data)
test_predictions = torch.max(test_output, 1)[1]
# Memory usage
memory_info = {}
if torch.cuda.is_available():
memory_info = {
'allocated_mb': torch.cuda.memory_allocated() / (1024**2),
'reserved_mb': torch.cuda.memory_reserved() / (1024**2),
'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)
}
return {
'training_completed': True,
'device_used': str(device),
'model_parameters': sum(p.numel() for p in model.parameters()),
'trainable_parameters': sum(p.numel() for p in model.parameters() if p.requires_grad),
'training_time': training_time,
'final_loss': epoch_losses[-1],
'final_accuracy': epoch_accuracies[-1],
'best_accuracy': max(epoch_accuracies),
'epoch_losses': epoch_losses,
'epoch_accuracies': epoch_accuracies,
'memory_info': memory_info,
'model_architecture': str(model)
}
# Example configuration
model_config = {
'hidden_sizes': [512, 256, 128, 64],
'dropout': 0.3
}
training_config = {
'n_samples': 10000,
'n_features': 100,
'n_classes': 10,
'batch_size': 64,
'epochs': 50,
'learning_rate': 0.001,
'weight_decay': 1e-4
}
# Run training
# result = lambda_deep_learning_training(model_config, training_config)
# print(f"Training completed! Final accuracy: {result['final_accuracy']:.2f}%")
# print(f"Training time: {result['training_time']:.2f} seconds")
# print(f"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB")
print("Deep learning training function defined. Uncomment the lines above to run on Lambda Cloud.")
Example 2: Transformer Model Fine-tuning¶
[ ]:
@cluster(cores=8, memory="32GB", time="02:00:00")
def lambda_transformer_finetuning(model_name, training_params):
"""Fine-tune a transformer model on Lambda Cloud GPU."""
import torch
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import Dataset
import numpy as np
import time
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Fine-tuning on device: {device}")
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=training_params['num_labels']
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Create synthetic dataset
def generate_synthetic_text_data(n_samples, num_labels):
"""Generate synthetic text classification data."""
# Simple text templates for different classes
templates = {
0: ["This is a positive example about {}", "Great work on {}", "Excellent {}"],
1: ["This is a negative example about {}", "Poor {}", "Terrible {}"],
2: ["This is a neutral example about {}", "Average {}", "Okay {}"] if num_labels > 2 else []
}
topics = ["technology", "sports", "food", "movies", "music", "books", "travel", "science"]
texts = []
labels = []
for _ in range(n_samples):
label = np.random.randint(0, num_labels)
template = np.random.choice(templates[label])
topic = np.random.choice(topics)
text = template.format(topic)
texts.append(text)
labels.append(label)
return texts, labels
# Generate data
train_texts, train_labels = generate_synthetic_text_data(
training_params['train_samples'], training_params['num_labels']
)
eval_texts, eval_labels = generate_synthetic_text_data(
training_params['eval_samples'], training_params['num_labels']
)
# Tokenize data
def tokenize_function(examples):
return tokenizer(
examples['text'],
truncation=True,
padding=True,
max_length=training_params.get('max_length', 512)
)
# Create datasets
train_dataset = Dataset.from_dict({'text': train_texts, 'labels': train_labels})
eval_dataset = Dataset.from_dict({'text': eval_texts, 'labels': eval_labels})
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)
# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Training arguments
training_args = TrainingArguments(
output_dir='/tmp/results',
num_train_epochs=training_params.get('epochs', 3),
per_device_train_batch_size=training_params.get('batch_size', 8),
per_device_eval_batch_size=training_params.get('eval_batch_size', 8),
warmup_steps=training_params.get('warmup_steps', 100),
weight_decay=training_params.get('weight_decay', 0.01),
learning_rate=training_params.get('learning_rate', 2e-5),
logging_dir='/tmp/logs',
logging_steps=10,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
fp16=torch.cuda.is_available(), # Use mixed precision if GPU available
dataloader_pin_memory=torch.cuda.is_available(),
remove_unused_columns=False
)
# Define compute metrics
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
accuracy = (predictions == labels).mean()
return {'accuracy': accuracy}
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
# Training
start_time = time.time()
train_result = trainer.train()
training_time = time.time() - start_time
# Final evaluation
eval_result = trainer.evaluate()
# Memory usage
memory_info = {}
if torch.cuda.is_available():
memory_info = {
'allocated_mb': torch.cuda.memory_allocated() / (1024**2),
'reserved_mb': torch.cuda.memory_reserved() / (1024**2),
'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)
}
# Model info
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
return {
'model_name': model_name,
'device_used': str(device),
'training_completed': True,
'training_time': training_time,
'total_parameters': total_params,
'trainable_parameters': trainable_params,
'train_loss': train_result.training_loss,
'eval_loss': eval_result['eval_loss'],
'eval_accuracy': eval_result['eval_accuracy'],
'train_steps': train_result.global_step,
'memory_info': memory_info,
'training_params': training_params
}
# Example configuration
training_params = {
'num_labels': 3,
'train_samples': 1000,
'eval_samples': 200,
'epochs': 3,
'batch_size': 16,
'eval_batch_size': 32,
'learning_rate': 2e-5,
'weight_decay': 0.01,
'warmup_steps': 100,
'max_length': 256
}
# Run fine-tuning
# result = lambda_transformer_finetuning('distilbert-base-uncased', training_params)
# print(f"Fine-tuning completed! Final accuracy: {result['eval_accuracy']:.4f}")
# print(f"Training time: {result['training_time']:.2f} seconds")
# print(f"Model parameters: {result['total_parameters']:,}")
# print(f"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB")
print("Transformer fine-tuning function defined. Uncomment the lines above to run on Lambda Cloud.")
Example 3: Computer Vision with Large Datasets¶
[ ]:
@cluster(cores=8, memory="32GB", time="01:30:00")
def lambda_computer_vision_training(model_config, data_config):
"""Train a computer vision model on Lambda Cloud GPU."""
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import time
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Training computer vision model on device: {device}")
# Data augmentation and preprocessing
transform_train = transforms.Compose([
transforms.ToPILImage(),
transforms.RandomResizedCrop(data_config['image_size']),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(degrees=15),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
transform_val = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((data_config['image_size'], data_config['image_size'])),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Generate synthetic image data
def create_synthetic_images(n_samples, image_size, n_channels, n_classes):
"""Create synthetic image dataset."""
images = np.random.randint(0, 256, (n_samples, image_size, image_size, n_channels), dtype=np.uint8)
labels = np.random.randint(0, n_classes, n_samples)
return images, labels
# Create datasets
train_images, train_labels = create_synthetic_images(
data_config['train_samples'],
data_config['image_size'],
data_config['n_channels'],
data_config['n_classes']
)
val_images, val_labels = create_synthetic_images(
data_config['val_samples'],
data_config['image_size'],
data_config['n_channels'],
data_config['n_classes']
)
# Custom dataset class
class SyntheticImageDataset(torch.utils.data.Dataset):
def __init__(self, images, labels, transform=None):
self.images = images
self.labels = labels
self.transform = transform
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
image = self.images[idx]
label = self.labels[idx]
if self.transform:
image = self.transform(image)
else:
image = torch.from_numpy(image).permute(2, 0, 1).float() / 255.0
return image, label
# Create data loaders
train_dataset = SyntheticImageDataset(train_images, train_labels, transform_train)
val_dataset = SyntheticImageDataset(val_images, val_labels, transform_val)
train_loader = DataLoader(
train_dataset,
batch_size=data_config['batch_size'],
shuffle=True,
num_workers=4,
pin_memory=True if torch.cuda.is_available() else False
)
val_loader = DataLoader(
val_dataset,
batch_size=data_config['batch_size'],
shuffle=False,
num_workers=4,
pin_memory=True if torch.cuda.is_available() else False
)
# Model definition
if model_config['model_type'] == 'resnet':
if model_config['pretrained']:
model = torchvision.models.resnet50(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, data_config['n_classes'])
else:
model = torchvision.models.resnet50(pretrained=False, num_classes=data_config['n_classes'])
elif model_config['model_type'] == 'efficientnet':
if model_config['pretrained']:
model = torchvision.models.efficientnet_b0(pretrained=True)
model.classifier[1] = nn.Linear(model.classifier[1].in_features, data_config['n_classes'])
else:
model = torchvision.models.efficientnet_b0(pretrained=False, num_classes=data_config['n_classes'])
else:
raise ValueError(f"Unsupported model type: {model_config['model_type']}")
model = model.to(device)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(
model.parameters(),
lr=model_config['learning_rate'],
weight_decay=model_config['weight_decay']
)
# Learning rate scheduler
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=model_config['epochs']
)
# Training loop
start_time = time.time()
train_losses = []
val_accuracies = []
for epoch in range(model_config['epochs']):
# Training phase
model.train()
running_loss = 0.0
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
running_loss += loss.item()
avg_train_loss = running_loss / len(train_loader)
train_losses.append(avg_train_loss)
# Validation phase
model.eval()
correct = 0
total = 0
val_loss = 0.0
with torch.no_grad():
for data, target in val_loader:
data, target = data.to(device), target.to(device)
output = model(data)
val_loss += criterion(output, target).item()
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
val_accuracy = 100.0 * correct / total
val_accuracies.append(val_accuracy)
scheduler.step()
if epoch % 5 == 0 or epoch == model_config['epochs'] - 1:
print(f'Epoch {epoch+1}/{model_config["epochs"]}: '
f'Train Loss: {avg_train_loss:.4f}, '
f'Val Accuracy: {val_accuracy:.2f}%, '
f'LR: {scheduler.get_last_lr()[0]:.6f}')
training_time = time.time() - start_time
# Memory usage
memory_info = {}
if torch.cuda.is_available():
memory_info = {
'allocated_mb': torch.cuda.memory_allocated() / (1024**2),
'reserved_mb': torch.cuda.memory_reserved() / (1024**2),
'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)
}
return {
'training_completed': True,
'device_used': str(device),
'model_type': model_config['model_type'],
'model_parameters': sum(p.numel() for p in model.parameters()),
'training_time': training_time,
'final_train_loss': train_losses[-1],
'final_val_accuracy': val_accuracies[-1],
'best_val_accuracy': max(val_accuracies),
'train_losses': train_losses,
'val_accuracies': val_accuracies,
'memory_info': memory_info,
'data_config': data_config,
'model_config': model_config
}
# Example configuration
model_config = {
'model_type': 'resnet', # or 'efficientnet'
'pretrained': True,
'epochs': 20,
'learning_rate': 0.001,
'weight_decay': 1e-4
}
data_config = {
'train_samples': 5000,
'val_samples': 1000,
'image_size': 224,
'n_channels': 3,
'n_classes': 10,
'batch_size': 32
}
# Run training
# result = lambda_computer_vision_training(model_config, data_config)
# print(f"CV training completed! Best accuracy: {result['best_val_accuracy']:.2f}%")
# print(f"Training time: {result['training_time']:.2f} seconds")
# print(f"Model parameters: {result['model_parameters']:,}")
# print(f"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB")
print("Computer vision training function defined. Uncomment the lines above to run on Lambda Cloud.")
Multi-GPU Training on Lambda Cloud¶
[ ]:
@cluster(cores=16, memory="128GB", time="04:00:00")
def lambda_multi_gpu_training(model_config, training_config):
"""Multi-GPU training example using PyTorch DDP."""
import torch
import torch.nn as nn
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
import os
def setup_ddp(rank, world_size):
"""Setup distributed data parallel."""
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
init_process_group(backend="nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup_ddp():
"""Clean up distributed training."""
destroy_process_group()
def train_on_gpu(rank, world_size, model_config, training_config):
"""Training function for each GPU."""
setup_ddp(rank, world_size)
# Create model and move to GPU
model = create_model(model_config).to(rank)
model = DDP(model, device_ids=[rank])
# Create data loader with DistributedSampler
train_loader = create_distributed_dataloader(training_config, rank, world_size)
# Training loop
optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])
for epoch in range(training_config['epochs']):
train_loader.sampler.set_epoch(epoch) # Important for proper shuffling
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(rank), target.to(rank)
optimizer.zero_grad()
output = model(data)
loss = nn.CrossEntropyLoss()(output, target)
loss.backward()
optimizer.step()
if rank == 0 and batch_idx % 100 == 0:
print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')
cleanup_ddp()
# Launch multi-GPU training
world_size = torch.cuda.device_count()
print(f"Starting multi-GPU training on {world_size} GPUs")
mp.spawn(
train_on_gpu,
args=(world_size, model_config, training_config),
nprocs=world_size,
join=True
)
return {"training_completed": True, "gpus_used": world_size}
HuggingFace Accelerate Example¶
Alternative approach using HuggingFace Accelerate for easier multi-GPU setup:
[ ]:
@cluster(cores=16, memory="128GB", time="04:00:00")
def lambda_accelerate_training(model_config, training_config):
"""Multi-GPU training using HuggingFace Accelerate."""
from accelerate import Accelerator
import torch
import torch.nn as nn
# Initialize accelerator
accelerator = Accelerator()
device = accelerator.device
# Create model and optimizer
model = create_model(model_config)
optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])
train_loader = create_dataloader(training_config)
# Prepare for distributed training
model, optimizer, train_loader = accelerator.prepare(
model, optimizer, train_loader
)
# Training loop
model.train()
for epoch in range(training_config['epochs']):
for batch_idx, (data, target) in enumerate(train_loader):
with accelerator.accumulate(model):
output = model(data)
loss = nn.CrossEntropyLoss()(output, target)
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
if accelerator.is_main_process and batch_idx % 100 == 0:
print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')
return {
"training_completed": True,
"num_processes": accelerator.num_processes,
"device": str(device)
}
Multi-GPU Training on Lambda Cloud¶
Available Multi-GPU Instances¶
2x A100 (40GB): ~$2.20/hour
4x A100 (40GB): ~$4.40/hour
8x A100 (40GB): ~$8.80/hour
2x A100 (80GB): ~$2.80/hour
4x A100 (80GB): ~$5.60/hour
8x A100 (80GB): ~$11.20/hour
8x H100: ~$20.00/hour (when available)
Setup Requirements¶
Launch multi-GPU instance via Lambda Cloud console
Install additional packages for distributed training:
pip install accelerate deepspeed
Configure Clustrix for multi-GPU environment
Use appropriate parallelization strategy
Parallelization Strategies¶
Data Parallel (DP): Simple, works for most models
Distributed Data Parallel (DDP): Better performance, recommended
Model Parallel: For very large models that don’t fit on single GPU
Pipeline Parallel: For extremely large models
DeepSpeed ZeRO: For memory-efficient training of large models
PyTorch DDP Example¶
Cost Optimization Strategies¶
[ ]:
# Import Clustrix cost monitoring functionality
from clustrix import cost_tracking_decorator, get_cost_monitor, generate_cost_report
# Example 1: Using the cost tracking decorator
@cost_tracking_decorator('lambda', 'a100_40gb')
@cluster(cores=8, memory="32GB")
def lambda_training_with_cost_tracking():
"""Example training function with automatic cost tracking."""
import time
import numpy as np
# Simulate training workload
print("Starting training...")
time.sleep(2) # Simulate 2 seconds of work
# Simulate some compute
data = np.random.randn(1000, 1000)
result = np.dot(data, data.T)
print("Training completed!")
return {
'model_accuracy': 0.95,
'training_samples': 10000,
'final_loss': 0.032
}
# Example 2: Manual cost monitoring
def manual_cost_monitoring_example():
"""Example of manual cost monitoring."""
# Start cost monitoring
monitor = get_cost_monitor('lambda')
if monitor:
monitor.start_monitoring()
# Your computation here
import time
time.sleep(1)
# Stop monitoring and get report
cost_report = monitor.stop_monitoring()
if cost_report:
print(f"Computation completed in {cost_report.duration_seconds:.2f} seconds")
print(f"Estimated cost: ${cost_report.cost_estimate.estimated_cost:.4f}")
print(f"GPU utilization: {len(cost_report.resource_usage.gpu_stats or [])} GPUs")
if cost_report.recommendations:
print("Cost optimization recommendations:")
for rec in cost_report.recommendations:
print(f" - {rec}")
# Example 3: Generate real-time cost report
def get_current_cost_status():
"""Get current cost and resource status."""
report = generate_cost_report('lambda', 'a100_40gb')
if report:
print("Current Lambda Cloud Status:")
print(f" CPU Usage: {report['resource_usage']['cpu_percent']:.1f}%")
print(f" Memory Usage: {report['resource_usage']['memory_percent']:.1f}%")
if report['resource_usage']['gpu_stats']:
avg_gpu = sum(gpu['utilization_percent'] for gpu in report['resource_usage']['gpu_stats']) / len(report['resource_usage']['gpu_stats'])
print(f" GPU Usage: {avg_gpu:.1f}%")
print(f" Hourly Rate: ${report['cost_estimate']['hourly_rate']:.2f}")
# Example 4: Compare pricing across instance types
def compare_lambda_pricing():
"""Compare pricing for different Lambda Cloud instance types."""
from clustrix import get_pricing_info
pricing = get_pricing_info('lambda')
if pricing:
print("Lambda Cloud Instance Pricing (USD/hour):")
# Group by category
single_gpu = {k: v for k, v in pricing.items() if not k.startswith(('2x', '4x', '8x')) and k != 'default'}
multi_gpu = {k: v for k, v in pricing.items() if k.startswith(('2x', '4x', '8x'))}
print("\nSingle GPU Instances:")
for instance, price in sorted(single_gpu.items(), key=lambda x: x[1]):
print(f" {instance:<15}: ${price:.2f}/hour")
print("\nMulti-GPU Instances:")
for instance, price in sorted(multi_gpu.items(), key=lambda x: x[1]):
print(f" {instance:<15}: ${price:.2f}/hour")
# Run examples (uncomment to test)
# print("1. Cost tracking decorator example:")
# result = lambda_training_with_cost_tracking()
# print(f"Training result: {result}")
# print("\n2. Manual cost monitoring example:")
# manual_cost_monitoring_example()
# print("\n3. Current cost status:")
# get_current_cost_status()
print("4. Lambda Cloud pricing comparison:")
compare_lambda_pricing()
print("\n✅ Lambda Cloud cost monitoring examples ready!")
print("💡 Use @cost_tracking_decorator('lambda', 'instance_type') for automatic cost tracking")
Lambda Cloud Cost Optimization¶
Cost Monitoring and Tracking¶
Monitor GPU utilization and track costs effectively:
Lambda Cloud Cost Optimization¶
💰 Instance Selection¶
RTX 6000 Ada: Best value for most ML workloads (~$0.75/hour)
A10: Good balance of performance and cost (~$0.60/hour)
A100 40GB: For large models requiring more VRAM (~$1.10/hour)
A100 80GB: Only when 40GB is insufficient (~$1.40/hour)
H100: Premium option for cutting-edge research (~$2.50/hour)
⏰ Usage Patterns¶
Use “persistent” instances for ongoing development
Terminate instances immediately after training completion
Schedule training jobs during off-peak hours if possible
Use local development for debugging, GPU for final training
🔧 Optimization Techniques¶
Mixed precision training (fp16) to reduce memory usage
Gradient accumulation for effective larger batch sizes
Model checkpointing to resume interrupted training
Efficient data loading with multiple workers
Early stopping to avoid overtraining
📊 Monitoring and Management¶
Monitor GPU utilization with nvidia-smi
Track training progress with logging
Set training time limits to prevent runaway costs
Use Clustrix timeouts as safety nets
Regular cost reviews and budget alerts
🚀 Clustrix-Specific Optimizations¶
Use Clustrix auto-cleanup features
Implement job queuing for multiple experiments
Leverage Clustrix’s timeout mechanisms
Use remote environment caching
Best Practices and Troubleshooting¶
[ ]:
# Example usage of monitoring functions
def create_monitoring_script():
"""Create and save the GPU monitoring script."""
script_content = '''#!/bin/bash
# Lambda Cloud monitoring script
echo "Lambda Cloud Training Monitor"
echo "============================"
echo "Start time: $(date)"
echo ""
# System information
echo "System Information:"
echo "------------------"
nvidia-smi --query-gpu=gpu_name,memory.total,power.draw --format=csv
echo ""
# Monitor GPU usage every 30 seconds
while true; do
echo "GPU Status at $(date):"
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
echo ""
# Check if training process is still running
if ! pgrep -f python > /dev/null; then
echo "No Python processes found. Training may have completed."
break
fi
sleep 30
done
echo "Monitoring completed at $(date)"
'''
with open('monitor_training.sh', 'w') as f:
f.write(script_content)
# Make executable
import os
os.chmod('monitor_training.sh', 0o755)
return "Monitoring script created: monitor_training.sh"
# Uncomment to create the monitoring script:
# result = create_monitoring_script()
# print(result)
Lambda Cloud Best Practices¶
GPU Monitoring Script¶
Use this monitoring script to track GPU usage during training. Save as monitor_training.sh and run with: bash monitor_training.sh
#!/bin/bash
# Lambda Cloud monitoring script
echo "Lambda Cloud Training Monitor"
echo "============================"
echo "Start time: $(date)"
echo ""
# System information
echo "System Information:"
echo "------------------"
nvidia-smi --query-gpu=gpu_name,memory.total,power.draw --format=csv
echo ""
# Monitor GPU usage every 30 seconds
while true; do
echo "GPU Status at $(date):"
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
echo ""
# Check if training process is still running
if ! pgrep -f python > /dev/null; then
echo "No Python processes found. Training may have completed."
break
fi
sleep 30
done
echo "Monitoring completed at $(date)"
Lambda Cloud + Clustrix Best Practices¶
🚀 Performance Optimization¶
Always use mixed precision (fp16) when possible
Optimize data loading with multiple workers and pin_memory
Use appropriate batch sizes to maximize GPU utilization
Enable tensor cores for compatible operations
Pre-allocate GPU memory to avoid fragmentation
💾 Data Management¶
Store datasets on fast NVMe storage when available
Use data streaming for very large datasets
Implement efficient data preprocessing pipelines
Cache frequently used data in memory
Use appropriate data formats (e.g., HDF5, Parquet)
🔧 Environment Setup¶
Use conda environments for reproducible setups
Pin package versions in requirements.txt
Install packages from conda-forge when possible
Use uv package manager for faster installs
Set up proper CUDA environment variables
🛠️ Development Workflow¶
Develop and debug locally, train on Lambda Cloud
Use small datasets for initial testing
Implement proper logging and monitoring
Save model checkpoints regularly
Use version control for experiment tracking
🔒 Security¶
Use SSH keys instead of passwords
Keep SSH keys secure and rotate regularly
Don’t store credentials in code or notebooks
Use environment variables for configuration
Monitor instance access logs
Common Issues and Solutions¶
❌ CUDA out of memory errors¶
✅ Solutions:
Reduce batch size
Enable gradient checkpointing
Use mixed precision training
Clear GPU cache with torch.cuda.empty_cache()
Consider model parallelism for large models
❌ Slow data loading¶
✅ Solutions:
Increase num_workers in DataLoader
Enable pin_memory for GPU transfers
Use faster storage (NVMe over network storage)
Implement data prefetching
Optimize data preprocessing
❌ SSH connection timeouts¶
✅ Solutions:
Configure SSH keep-alive settings
Use screen or tmux for long-running jobs
Implement proper error handling in Clustrix
Set appropriate timeout values
Monitor network connectivity
❌ Low GPU utilization¶
✅ Solutions:
Increase batch size if memory allows
Optimize data loading pipeline
Use asynchronous data transfers
Profile code to identify bottlenecks
Consider multi-GPU training
❌ Package installation failures¶
✅ Solutions:
Use conda for system-level packages
Check CUDA compatibility versions
Clear pip cache if needed
Use –no-cache-dir flag for pip
Install packages in correct order
Instance Management and Cleanup¶
Lambda Cloud Instance Management¶
🔍 Check Running Instances¶
Via CLI:
lambda-cloud instance list
Via Web Console: Visit: https://cloud.lambdalabs.com/instances
⏹️ Terminate Instances¶
Terminate specific instance:
lambda-cloud instance terminate <INSTANCE_ID>
Terminate all instances (DANGEROUS!):
lambda-cloud instance list --format=csv | grep -v "instance_id" | cut -d',' -f1 | xargs -I {} lambda-cloud instance terminate {}
💾 Save Work Before Termination¶
Save models to persistent storage:
rsync -avz ubuntu@<INSTANCE_IP>:/path/to/models/ ./local_models/
Save logs and results:
scp -r ubuntu@<INSTANCE_IP>:/tmp/clustrix/ ./results/
📊 Cost Monitoring¶
Check current usage:
lambda-cloud instance list --format=table
Estimate costs:
lambda-cloud instance list --format=csv | awk -F',' 'NR>1 {print $2, $3}' | while read type status; do
if [ "$status" = "active" ]; then
echo "Active instance: $type"
fi
done
Automated Cleanup Script¶
Save this as lambda_cleanup.sh for automated instance management:
#!/bin/bash
# Automated cleanup script for Lambda Cloud
# Save as lambda_cleanup.sh
set -e
echo "Lambda Cloud Automated Cleanup"
echo "=============================="
# Check if lambda-cloud CLI is installed
if ! command -v lambda-cloud &> /dev/null; then
echo "Error: lambda-cloud CLI not found. Please install it first."
exit 1
fi
# List current instances
echo "Current instances:"
lambda-cloud instance list
echo ""
# Ask for confirmation
read -p "Do you want to terminate ALL instances? (y/N): " -n 1 -r
echo ""
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
echo "Cleanup cancelled."
exit 0
fi
# Get instance IDs
INSTANCE_IDS=$(lambda-cloud instance list --format=csv | grep -v "instance_id" | cut -d',' -f1)
if [ -z "$INSTANCE_IDS" ]; then
echo "No instances to terminate."
exit 0
fi
# Terminate instances
echo "Terminating instances..."
for instance_id in $INSTANCE_IDS; do
echo "Terminating instance: $instance_id"
lambda-cloud instance terminate $instance_id
done
echo "All instances terminated."
echo "Please verify termination in the web console: https://cloud.lambdalabs.com/instances"
Clustrix Integration Manager¶
[ ]:
# Integrate cleanup with Clustrix workflows
from clustrix import configure
import subprocess
import time
class LambdaCloudManager:
"""Manager for Lambda Cloud instances with Clustrix integration."""
def __init__(self):
self.active_instances = []
def launch_instance_for_clustrix(self, instance_type, ssh_key_name):
"""Launch instance and configure Clustrix."""
# Launch instance
result = subprocess.run([
'lambda-cloud', 'instance', 'launch',
'--instance-type', instance_type,
'--ssh-key-name', ssh_key_name
], capture_output=True, text=True)
if result.returncode != 0:
raise Exception(f"Failed to launch instance: {result.stderr}")
# Parse instance ID and IP (simplified)
instance_id = "extracted_from_output" # Parse from result.stdout
instance_ip = "extracted_from_output" # Parse from result.stdout
# Wait for instance to be ready
time.sleep(60) # Wait for startup
# Configure Clustrix
configure(
cluster_type="ssh",
cluster_host=instance_ip,
username="ubuntu",
key_file="~/.ssh/id_rsa",
remote_work_dir="/tmp/clustrix",
package_manager="auto",
cleanup_on_success=True
)
self.active_instances.append({
'id': instance_id,
'ip': instance_ip,
'type': instance_type,
'launch_time': time.time()
})
return instance_id, instance_ip
def cleanup_all_instances(self):
"""Clean up all managed instances."""
for instance in self.active_instances:
try:
subprocess.run([
'lambda-cloud', 'instance', 'terminate', instance['id']
], check=True)
print(f"Terminated instance {instance['id']}")
except subprocess.CalledProcessError as e:
print(f"Failed to terminate {instance['id']}: {e}")
self.active_instances.clear()
def __del__(self):
"""Ensure cleanup on object destruction."""
if self.active_instances:
print("Warning: Active instances detected. Cleaning up...")
self.cleanup_all_instances()
# Usage example:
# manager = LambdaCloudManager()
# try:
# instance_id, ip = manager.launch_instance_for_clustrix('a100', 'my-ssh-key')
# # Run your Clustrix computations
# result = my_clustrix_function()
# finally:
# manager.cleanup_all_instances()
Summary¶
This tutorial covered:
Setup: Lambda Cloud account creation and instance management
GPU Computing: High-performance GPU instances for ML workloads
Deep Learning: PyTorch training with GPU acceleration
Transformer Models: Fine-tuning with HuggingFace Transformers
Computer Vision: CNN training with data augmentation
Multi-GPU Training: Distributed training across multiple GPUs
Cost Optimization: Strategies to minimize GPU computing costs
Best Practices: Performance optimization and troubleshooting
Instance Management: Automated cleanup and monitoring
Key Advantages of Lambda Cloud + Clustrix¶
GPU Focus: Specialized in high-performance GPU computing
Cost Effective: Competitive pricing for GPU instances
Simple Management: Easy instance launching and termination
High Performance: Latest NVIDIA GPUs (A100, H100, RTX)
Fast Networking: InfiniBand for multi-GPU communication
ML Optimized: Pre-configured environments for machine learning
Flexible Scaling: From single GPU to large multi-GPU clusters
Lambda Cloud Pricing Advantages¶
RTX 6000 Ada: Excellent price/performance for most ML workloads
A100 40GB/80GB: Industry-standard for large-scale training
H100: Cutting-edge performance for the most demanding workloads
Multi-GPU: Cost-effective scaling for distributed training
No Hidden Fees: Simple per-hour pricing
Next Steps¶
Create your Lambda Cloud account and add SSH keys
Start with a single GPU instance for testing
Configure Clustrix for your Lambda Cloud instance
Run the provided examples to verify setup
Scale to multi-GPU instances for larger workloads
Implement cost monitoring and automated cleanup
Use Cases¶
Deep Learning Research: Train large neural networks efficiently
Computer Vision: Process large image datasets with CNNs
NLP: Fine-tune transformer models on custom datasets
Scientific Computing: GPU-accelerated simulations and modeling
Prototyping: Rapid experimentation with different architectures
Production Training: Scale up successful experiments
Resources¶
Remember: Lambda Cloud excels at GPU computing! Always terminate instances when not in use to control costs, and leverage Clustrix’s distributed computing capabilities to scale your ML workloads efficiently.