Lambda Cloud Tutorial

This tutorial demonstrates how to use Clustrix with Lambda Cloud for high-performance GPU computing and distributed machine learning.

Open In Colab

Overview

Lambda Cloud specializes in GPU cloud computing and integrates well with Clustrix for ML workloads:

  • GPU-Optimized Instances: High-performance NVIDIA GPUs (A100, H100, RTX)

  • Cost-Effective: Competitive pricing for GPU computing

  • Simple Management: Easy instance launching and management

  • Pre-configured Environments: ML-ready software stacks

  • High-Speed Networking: InfiniBand for multi-GPU communications

  • Persistent Storage: Fast NVMe and network storage options

  • SSH Access: Direct access for Clustrix integration

  • On-Demand and Reserved: Flexible pricing models

Prerequisites

  1. Lambda Cloud account with GPU credits

  2. SSH key pair for instance access

  3. Lambda Cloud API key (optional)

  4. Basic understanding of GPU computing

Installation and Setup

Install Clustrix with Lambda Cloud dependencies:

[ ]:
# Install Clustrix with GPU and Lambda Cloud support
!pip install clustrix torch torchvision transformers datasets accelerate

# Import required libraries
import clustrix
from clustrix import cluster, configure
import torch
import numpy as np
import time
import json
import requests
import os

Lambda Cloud Authentication and Setup

Option 1: Web Console Setup

Lambda Cloud Web Console Setup

  1. Create Account:

  2. Add SSH Key:

  3. Launch Instance:

    • Go to https://cloud.lambdalabs.com/instances

    • Click “Launch instance”

    • Select instance type (A100, H100, RTX 6000 Ada, etc.)

    • Choose region (closest to you for best performance)

    • Select your SSH key

    • Launch the instance

  4. Instance Types Available:

    • RTX 6000 Ada: 48GB VRAM, ~$0.75/hour

    • A10: 24GB VRAM, ~$0.60/hour

    • A100 (40GB): 40GB VRAM, ~$1.10/hour

    • A100 (80GB): 80GB VRAM, ~$1.40/hour

    • H100: 80GB VRAM, ~$2.50/hour (when available)

  5. Access Instance:

    • Wait for instance to be “Running”

    • Note the public IP address

    • SSH: ssh ubuntu@

Follow this guide to set up your Lambda Cloud account and launch your first GPU instance.

Option 2: API-Based Setup

[ ]:
import requests
import os

class LambdaCloudAPI:
    def __init__(self, api_key=None):
        self.api_key = api_key or os.getenv('LAMBDA_API_KEY')
        self.base_url = 'https://cloud.lambdalabs.com/api/v1'
        self.headers = {
            'Authorization': f'Bearer {self.api_key}',
            'Content-Type': 'application/json'
        }

    def list_instance_types(self):
        """List available instance types."""
        response = requests.get(f'{self.base_url}/instance-types', headers=self.headers)
        return response.json()

    def list_instances(self):
        """List running instances."""
        response = requests.get(f'{self.base_url}/instances', headers=self.headers)
        return response.json()

    def launch_instance(self, instance_type, ssh_key_name, region='us-east-1', name=None):
        """Launch a new instance."""
        data = {
            'instance_type_name': instance_type,
            'ssh_key_names': [ssh_key_name],
            'region_name': region
        }
        if name:
            data['name'] = name

        response = requests.post(f'{self.base_url}/instance-operations/launch',
                               headers=self.headers, json=data)
        return response.json()

    def terminate_instance(self, instance_id):
        """Terminate an instance."""
        data = {'instance_ids': [instance_id]}
        response = requests.post(f'{self.base_url}/instance-operations/terminate',
                               headers=self.headers, json=data)
        return response.json()

    def get_instance_details(self, instance_id):
        """Get detailed information about an instance."""
        instances = self.list_instances()
        for instance in instances.get('data', []):
            if instance['id'] == instance_id:
                return instance
        return None

# Example usage:
# api = LambdaCloudAPI()
# instance_types = api.list_instance_types()
# print(json.dumps(instance_types, indent=2))

Lambda Cloud API Setup Guide

CLI Setup Steps

  1. Get API Key:

  2. Install Lambda Cloud CLI:

    pip install lambda-cloud
    lambda-cloud configure  # Enter your API key
    
  3. Basic CLI Commands:

    # List available instance types
    lambda-cloud instance-types list
    
    # List available regions
    lambda-cloud regions list
    
    # Launch instance
    lambda-cloud instance launch \
      --instance-type a100 \
      --ssh-key-name your-key-name \
      --region us-east-1
    
    # List running instances
    lambda-cloud instance list
    
    # Terminate instance
    lambda-cloud instance terminate <INSTANCE_ID>
    

Python API Client

Configure Clustrix for Lambda Cloud

[ ]:
# Configure Clustrix to use your Lambda Cloud instance
configure(
    cluster_type="ssh",
    cluster_host="your-lambda-instance-ip",  # Replace with actual IP
    username="ubuntu",  # Default Lambda Cloud user
    key_file="~/.ssh/id_rsa",  # Your private SSH key
    remote_work_dir="/tmp/clustrix",
    package_manager="auto",  # Will use uv if available
    default_cores=8,  # Lambda instances typically have 8+ cores
    default_memory="32GB",  # Generous memory allocation
    default_time="02:00:00",  # Longer timeout for GPU tasks
    environment_variables={
        "CUDA_VISIBLE_DEVICES": "0",  # Use first GPU
        "NVIDIA_VISIBLE_DEVICES": "all"
    }
)

Replace ``your-lambda-instance-ip`` with the actual IP address from your Lambda Cloud instance.

GPU Verification and Setup

[ ]:
@cluster(cores=2, memory="8GB")
def verify_lambda_gpu_setup():
    """Verify GPU availability and setup on Lambda Cloud instance."""
    import torch
    import subprocess
    import platform

    # System information
    system_info = {
        'platform': platform.platform(),
        'python_version': platform.python_version(),
        'architecture': platform.architecture()[0]
    }

    # PyTorch and CUDA info
    torch_info = {
        'pytorch_version': torch.__version__,
        'cuda_available': torch.cuda.is_available(),
        'cuda_version': torch.version.cuda if torch.cuda.is_available() else None,
        'cudnn_version': torch.backends.cudnn.version() if torch.cuda.is_available() else None,
        'device_count': torch.cuda.device_count() if torch.cuda.is_available() else 0
    }

    # GPU details
    gpu_info = []
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            props = torch.cuda.get_device_properties(i)
            gpu_info.append({
                'device_id': i,
                'name': props.name,
                'total_memory_gb': props.total_memory / (1024**3),
                'major': props.major,
                'minor': props.minor,
                'multiprocessor_count': props.multi_processor_count
            })

    # NVIDIA-SMI output
    nvidia_smi = None
    try:
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        if result.returncode == 0:
            nvidia_smi = result.stdout
    except FileNotFoundError:
        nvidia_smi = "nvidia-smi not found"

    # Test GPU computation
    gpu_test_result = None
    if torch.cuda.is_available():
        try:
            # Simple GPU computation test
            device = torch.device('cuda:0')
            x = torch.randn(1000, 1000, device=device)
            y = torch.randn(1000, 1000, device=device)

            start_time = torch.cuda.Event(enable_timing=True)
            end_time = torch.cuda.Event(enable_timing=True)

            start_time.record()
            z = torch.mm(x, y)
            torch.cuda.synchronize()
            end_time.record()
            torch.cuda.synchronize()

            gpu_test_result = {
                'test_passed': True,
                'computation_time_ms': start_time.elapsed_time(end_time),
                'result_shape': z.shape,
                'memory_allocated_mb': torch.cuda.memory_allocated() / (1024**2),
                'memory_reserved_mb': torch.cuda.memory_reserved() / (1024**2)
            }
        except Exception as e:
            gpu_test_result = {
                'test_passed': False,
                'error': str(e)
            }

    return {
        'system_info': system_info,
        'torch_info': torch_info,
        'gpu_info': gpu_info,
        'nvidia_smi': nvidia_smi,
        'gpu_test': gpu_test_result
    }

# Run GPU verification
# gpu_status = verify_lambda_gpu_setup()
# print(json.dumps(gpu_status, indent=2, default=str))
print("GPU verification function defined. Uncomment the lines above to run on Lambda Cloud.")

Example 1: Distributed Deep Learning Training

[ ]:
@cluster(cores=8, memory="16GB", time="01:30:00")
def lambda_deep_learning_training(model_config, training_config):
    """Train a deep learning model on Lambda Cloud GPU."""
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader, TensorDataset
    import numpy as np
    import time

    # Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Training on device: {device}")

    # Create synthetic dataset
    n_samples = training_config['n_samples']
    n_features = training_config['n_features']
    n_classes = training_config['n_classes']

    # Generate random data
    X = torch.randn(n_samples, n_features)
    y = torch.randint(0, n_classes, (n_samples,))

    # Create dataset and dataloader
    dataset = TensorDataset(X, y)
    dataloader = DataLoader(
        dataset,
        batch_size=training_config['batch_size'],
        shuffle=True
    )

    # Define model architecture
    class DeepNet(nn.Module):
        def __init__(self, input_size, hidden_sizes, output_size, dropout=0.2):
            super(DeepNet, self).__init__()

            layers = []
            prev_size = input_size

            for hidden_size in hidden_sizes:
                layers.extend([
                    nn.Linear(prev_size, hidden_size),
                    nn.ReLU(),
                    nn.BatchNorm1d(hidden_size),
                    nn.Dropout(dropout)
                ])
                prev_size = hidden_size

            layers.append(nn.Linear(prev_size, output_size))
            self.network = nn.Sequential(*layers)

        def forward(self, x):
            return self.network(x)

    # Create model
    model = DeepNet(
        input_size=n_features,
        hidden_sizes=model_config['hidden_sizes'],
        output_size=n_classes,
        dropout=model_config.get('dropout', 0.2)
    ).to(device)

    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(
        model.parameters(),
        lr=training_config['learning_rate'],
        weight_decay=training_config.get('weight_decay', 1e-4)
    )

    # Training loop
    model.train()
    training_start = time.time()

    epoch_losses = []
    epoch_accuracies = []

    for epoch in range(training_config['epochs']):
        epoch_loss = 0.0
        correct = 0
        total = 0

        for batch_idx, (data, target) in enumerate(dataloader):
            data, target = data.to(device), target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            _, predicted = torch.max(output.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

        avg_loss = epoch_loss / len(dataloader)
        accuracy = 100.0 * correct / total

        epoch_losses.append(avg_loss)
        epoch_accuracies.append(accuracy)

        if epoch % 10 == 0 or epoch == training_config['epochs'] - 1:
            print(f'Epoch {epoch+1}/{training_config["epochs"]}: '
                  f'Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')

    training_time = time.time() - training_start

    # Model evaluation
    model.eval()
    with torch.no_grad():
        test_data = torch.randn(1000, n_features).to(device)
        test_output = model(test_data)
        test_predictions = torch.max(test_output, 1)[1]

    # Memory usage
    memory_info = {}
    if torch.cuda.is_available():
        memory_info = {
            'allocated_mb': torch.cuda.memory_allocated() / (1024**2),
            'reserved_mb': torch.cuda.memory_reserved() / (1024**2),
            'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)
        }

    return {
        'training_completed': True,
        'device_used': str(device),
        'model_parameters': sum(p.numel() for p in model.parameters()),
        'trainable_parameters': sum(p.numel() for p in model.parameters() if p.requires_grad),
        'training_time': training_time,
        'final_loss': epoch_losses[-1],
        'final_accuracy': epoch_accuracies[-1],
        'best_accuracy': max(epoch_accuracies),
        'epoch_losses': epoch_losses,
        'epoch_accuracies': epoch_accuracies,
        'memory_info': memory_info,
        'model_architecture': str(model)
    }

# Example configuration
model_config = {
    'hidden_sizes': [512, 256, 128, 64],
    'dropout': 0.3
}

training_config = {
    'n_samples': 10000,
    'n_features': 100,
    'n_classes': 10,
    'batch_size': 64,
    'epochs': 50,
    'learning_rate': 0.001,
    'weight_decay': 1e-4
}

# Run training
# result = lambda_deep_learning_training(model_config, training_config)
# print(f"Training completed! Final accuracy: {result['final_accuracy']:.2f}%")
# print(f"Training time: {result['training_time']:.2f} seconds")
# print(f"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB")

print("Deep learning training function defined. Uncomment the lines above to run on Lambda Cloud.")

Example 2: Transformer Model Fine-tuning

[ ]:
@cluster(cores=8, memory="32GB", time="02:00:00")
def lambda_transformer_finetuning(model_name, training_params):
    """Fine-tune a transformer model on Lambda Cloud GPU."""
    import torch
    from transformers import (
        AutoTokenizer, AutoModelForSequenceClassification,
        TrainingArguments, Trainer, DataCollatorWithPadding
    )
    from datasets import Dataset
    import numpy as np
    import time

    # Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Fine-tuning on device: {device}")

    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=training_params['num_labels']
    )

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Create synthetic dataset
    def generate_synthetic_text_data(n_samples, num_labels):
        """Generate synthetic text classification data."""

        # Simple text templates for different classes
        templates = {
            0: ["This is a positive example about {}", "Great work on {}", "Excellent {}"],
            1: ["This is a negative example about {}", "Poor {}", "Terrible {}"],
            2: ["This is a neutral example about {}", "Average {}", "Okay {}"] if num_labels > 2 else []
        }

        topics = ["technology", "sports", "food", "movies", "music", "books", "travel", "science"]

        texts = []
        labels = []

        for _ in range(n_samples):
            label = np.random.randint(0, num_labels)
            template = np.random.choice(templates[label])
            topic = np.random.choice(topics)
            text = template.format(topic)

            texts.append(text)
            labels.append(label)

        return texts, labels

    # Generate data
    train_texts, train_labels = generate_synthetic_text_data(
        training_params['train_samples'], training_params['num_labels']
    )
    eval_texts, eval_labels = generate_synthetic_text_data(
        training_params['eval_samples'], training_params['num_labels']
    )

    # Tokenize data
    def tokenize_function(examples):
        return tokenizer(
            examples['text'],
            truncation=True,
            padding=True,
            max_length=training_params.get('max_length', 512)
        )

    # Create datasets
    train_dataset = Dataset.from_dict({'text': train_texts, 'labels': train_labels})
    eval_dataset = Dataset.from_dict({'text': eval_texts, 'labels': eval_labels})

    train_dataset = train_dataset.map(tokenize_function, batched=True)
    eval_dataset = eval_dataset.map(tokenize_function, batched=True)

    # Data collator
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    # Training arguments
    training_args = TrainingArguments(
        output_dir='/tmp/results',
        num_train_epochs=training_params.get('epochs', 3),
        per_device_train_batch_size=training_params.get('batch_size', 8),
        per_device_eval_batch_size=training_params.get('eval_batch_size', 8),
        warmup_steps=training_params.get('warmup_steps', 100),
        weight_decay=training_params.get('weight_decay', 0.01),
        learning_rate=training_params.get('learning_rate', 2e-5),
        logging_dir='/tmp/logs',
        logging_steps=10,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
        dataloader_pin_memory=torch.cuda.is_available(),
        remove_unused_columns=False
    )

    # Define compute metrics
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        accuracy = (predictions == labels).mean()
        return {'accuracy': accuracy}

    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    # Training
    start_time = time.time()
    train_result = trainer.train()
    training_time = time.time() - start_time

    # Final evaluation
    eval_result = trainer.evaluate()

    # Memory usage
    memory_info = {}
    if torch.cuda.is_available():
        memory_info = {
            'allocated_mb': torch.cuda.memory_allocated() / (1024**2),
            'reserved_mb': torch.cuda.memory_reserved() / (1024**2),
            'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)
        }

    # Model info
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

    return {
        'model_name': model_name,
        'device_used': str(device),
        'training_completed': True,
        'training_time': training_time,
        'total_parameters': total_params,
        'trainable_parameters': trainable_params,
        'train_loss': train_result.training_loss,
        'eval_loss': eval_result['eval_loss'],
        'eval_accuracy': eval_result['eval_accuracy'],
        'train_steps': train_result.global_step,
        'memory_info': memory_info,
        'training_params': training_params
    }

# Example configuration
training_params = {
    'num_labels': 3,
    'train_samples': 1000,
    'eval_samples': 200,
    'epochs': 3,
    'batch_size': 16,
    'eval_batch_size': 32,
    'learning_rate': 2e-5,
    'weight_decay': 0.01,
    'warmup_steps': 100,
    'max_length': 256
}

# Run fine-tuning
# result = lambda_transformer_finetuning('distilbert-base-uncased', training_params)
# print(f"Fine-tuning completed! Final accuracy: {result['eval_accuracy']:.4f}")
# print(f"Training time: {result['training_time']:.2f} seconds")
# print(f"Model parameters: {result['total_parameters']:,}")
# print(f"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB")

print("Transformer fine-tuning function defined. Uncomment the lines above to run on Lambda Cloud.")

Example 3: Computer Vision with Large Datasets

[ ]:
@cluster(cores=8, memory="32GB", time="01:30:00")
def lambda_computer_vision_training(model_config, data_config):
    """Train a computer vision model on Lambda Cloud GPU."""
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import torchvision
    import torchvision.transforms as transforms
    from torch.utils.data import DataLoader, TensorDataset
    import numpy as np
    import time

    # Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Training computer vision model on device: {device}")

    # Data augmentation and preprocessing
    transform_train = transforms.Compose([
        transforms.ToPILImage(),
        transforms.RandomResizedCrop(data_config['image_size']),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.RandomRotation(degrees=15),
        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    transform_val = transforms.Compose([
        transforms.ToPILImage(),
        transforms.Resize((data_config['image_size'], data_config['image_size'])),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    # Generate synthetic image data
    def create_synthetic_images(n_samples, image_size, n_channels, n_classes):
        """Create synthetic image dataset."""
        images = np.random.randint(0, 256, (n_samples, image_size, image_size, n_channels), dtype=np.uint8)
        labels = np.random.randint(0, n_classes, n_samples)
        return images, labels

    # Create datasets
    train_images, train_labels = create_synthetic_images(
        data_config['train_samples'],
        data_config['image_size'],
        data_config['n_channels'],
        data_config['n_classes']
    )

    val_images, val_labels = create_synthetic_images(
        data_config['val_samples'],
        data_config['image_size'],
        data_config['n_channels'],
        data_config['n_classes']
    )

    # Custom dataset class
    class SyntheticImageDataset(torch.utils.data.Dataset):
        def __init__(self, images, labels, transform=None):
            self.images = images
            self.labels = labels
            self.transform = transform

        def __len__(self):
            return len(self.images)

        def __getitem__(self, idx):
            image = self.images[idx]
            label = self.labels[idx]

            if self.transform:
                image = self.transform(image)
            else:
                image = torch.from_numpy(image).permute(2, 0, 1).float() / 255.0

            return image, label

    # Create data loaders
    train_dataset = SyntheticImageDataset(train_images, train_labels, transform_train)
    val_dataset = SyntheticImageDataset(val_images, val_labels, transform_val)

    train_loader = DataLoader(
        train_dataset,
        batch_size=data_config['batch_size'],
        shuffle=True,
        num_workers=4,
        pin_memory=True if torch.cuda.is_available() else False
    )

    val_loader = DataLoader(
        val_dataset,
        batch_size=data_config['batch_size'],
        shuffle=False,
        num_workers=4,
        pin_memory=True if torch.cuda.is_available() else False
    )

    # Model definition
    if model_config['model_type'] == 'resnet':
        if model_config['pretrained']:
            model = torchvision.models.resnet50(pretrained=True)
            model.fc = nn.Linear(model.fc.in_features, data_config['n_classes'])
        else:
            model = torchvision.models.resnet50(pretrained=False, num_classes=data_config['n_classes'])
    elif model_config['model_type'] == 'efficientnet':
        if model_config['pretrained']:
            model = torchvision.models.efficientnet_b0(pretrained=True)
            model.classifier[1] = nn.Linear(model.classifier[1].in_features, data_config['n_classes'])
        else:
            model = torchvision.models.efficientnet_b0(pretrained=False, num_classes=data_config['n_classes'])
    else:
        raise ValueError(f"Unsupported model type: {model_config['model_type']}")

    model = model.to(device)

    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(
        model.parameters(),
        lr=model_config['learning_rate'],
        weight_decay=model_config['weight_decay']
    )

    # Learning rate scheduler
    scheduler = optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=model_config['epochs']
    )

    # Training loop
    start_time = time.time()
    train_losses = []
    val_accuracies = []

    for epoch in range(model_config['epochs']):
        # Training phase
        model.train()
        running_loss = 0.0

        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        avg_train_loss = running_loss / len(train_loader)
        train_losses.append(avg_train_loss)

        # Validation phase
        model.eval()
        correct = 0
        total = 0
        val_loss = 0.0

        with torch.no_grad():
            for data, target in val_loader:
                data, target = data.to(device), target.to(device)
                output = model(data)
                val_loss += criterion(output, target).item()

                _, predicted = torch.max(output.data, 1)
                total += target.size(0)
                correct += (predicted == target).sum().item()

        val_accuracy = 100.0 * correct / total
        val_accuracies.append(val_accuracy)

        scheduler.step()

        if epoch % 5 == 0 or epoch == model_config['epochs'] - 1:
            print(f'Epoch {epoch+1}/{model_config["epochs"]}: '
                  f'Train Loss: {avg_train_loss:.4f}, '
                  f'Val Accuracy: {val_accuracy:.2f}%, '
                  f'LR: {scheduler.get_last_lr()[0]:.6f}')

    training_time = time.time() - start_time

    # Memory usage
    memory_info = {}
    if torch.cuda.is_available():
        memory_info = {
            'allocated_mb': torch.cuda.memory_allocated() / (1024**2),
            'reserved_mb': torch.cuda.memory_reserved() / (1024**2),
            'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)
        }

    return {
        'training_completed': True,
        'device_used': str(device),
        'model_type': model_config['model_type'],
        'model_parameters': sum(p.numel() for p in model.parameters()),
        'training_time': training_time,
        'final_train_loss': train_losses[-1],
        'final_val_accuracy': val_accuracies[-1],
        'best_val_accuracy': max(val_accuracies),
        'train_losses': train_losses,
        'val_accuracies': val_accuracies,
        'memory_info': memory_info,
        'data_config': data_config,
        'model_config': model_config
    }

# Example configuration
model_config = {
    'model_type': 'resnet',  # or 'efficientnet'
    'pretrained': True,
    'epochs': 20,
    'learning_rate': 0.001,
    'weight_decay': 1e-4
}

data_config = {
    'train_samples': 5000,
    'val_samples': 1000,
    'image_size': 224,
    'n_channels': 3,
    'n_classes': 10,
    'batch_size': 32
}

# Run training
# result = lambda_computer_vision_training(model_config, data_config)
# print(f"CV training completed! Best accuracy: {result['best_val_accuracy']:.2f}%")
# print(f"Training time: {result['training_time']:.2f} seconds")
# print(f"Model parameters: {result['model_parameters']:,}")
# print(f"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB")

print("Computer vision training function defined. Uncomment the lines above to run on Lambda Cloud.")

Multi-GPU Training on Lambda Cloud

[ ]:
@cluster(cores=16, memory="128GB", time="04:00:00")
def lambda_multi_gpu_training(model_config, training_config):
    """Multi-GPU training example using PyTorch DDP."""
    import torch
    import torch.nn as nn
    import torch.multiprocessing as mp
    from torch.nn.parallel import DistributedDataParallel as DDP
    from torch.distributed import init_process_group, destroy_process_group
    import os

    def setup_ddp(rank, world_size):
        """Setup distributed data parallel."""
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '12355'
        init_process_group(backend="nccl", rank=rank, world_size=world_size)
        torch.cuda.set_device(rank)

    def cleanup_ddp():
        """Clean up distributed training."""
        destroy_process_group()

    def train_on_gpu(rank, world_size, model_config, training_config):
        """Training function for each GPU."""
        setup_ddp(rank, world_size)

        # Create model and move to GPU
        model = create_model(model_config).to(rank)
        model = DDP(model, device_ids=[rank])

        # Create data loader with DistributedSampler
        train_loader = create_distributed_dataloader(training_config, rank, world_size)

        # Training loop
        optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])

        for epoch in range(training_config['epochs']):
            train_loader.sampler.set_epoch(epoch)  # Important for proper shuffling

            for batch_idx, (data, target) in enumerate(train_loader):
                data, target = data.to(rank), target.to(rank)

                optimizer.zero_grad()
                output = model(data)
                loss = nn.CrossEntropyLoss()(output, target)
                loss.backward()
                optimizer.step()

                if rank == 0 and batch_idx % 100 == 0:
                    print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')

        cleanup_ddp()

    # Launch multi-GPU training
    world_size = torch.cuda.device_count()
    print(f"Starting multi-GPU training on {world_size} GPUs")

    mp.spawn(
        train_on_gpu,
        args=(world_size, model_config, training_config),
        nprocs=world_size,
        join=True
    )

    return {"training_completed": True, "gpus_used": world_size}

HuggingFace Accelerate Example

Alternative approach using HuggingFace Accelerate for easier multi-GPU setup:

[ ]:
@cluster(cores=16, memory="128GB", time="04:00:00")
def lambda_accelerate_training(model_config, training_config):
    """Multi-GPU training using HuggingFace Accelerate."""
    from accelerate import Accelerator
    import torch
    import torch.nn as nn

    # Initialize accelerator
    accelerator = Accelerator()
    device = accelerator.device

    # Create model and optimizer
    model = create_model(model_config)
    optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])
    train_loader = create_dataloader(training_config)

    # Prepare for distributed training
    model, optimizer, train_loader = accelerator.prepare(
        model, optimizer, train_loader
    )

    # Training loop
    model.train()
    for epoch in range(training_config['epochs']):
        for batch_idx, (data, target) in enumerate(train_loader):
            with accelerator.accumulate(model):
                output = model(data)
                loss = nn.CrossEntropyLoss()(output, target)

                accelerator.backward(loss)
                optimizer.step()
                optimizer.zero_grad()

            if accelerator.is_main_process and batch_idx % 100 == 0:
                print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')

    return {
        "training_completed": True,
        "num_processes": accelerator.num_processes,
        "device": str(device)
    }

Multi-GPU Training on Lambda Cloud

Available Multi-GPU Instances

  • 2x A100 (40GB): ~$2.20/hour

  • 4x A100 (40GB): ~$4.40/hour

  • 8x A100 (40GB): ~$8.80/hour

  • 2x A100 (80GB): ~$2.80/hour

  • 4x A100 (80GB): ~$5.60/hour

  • 8x A100 (80GB): ~$11.20/hour

  • 8x H100: ~$20.00/hour (when available)

Setup Requirements

  1. Launch multi-GPU instance via Lambda Cloud console

  2. Install additional packages for distributed training:

    pip install accelerate deepspeed
    
  3. Configure Clustrix for multi-GPU environment

  4. Use appropriate parallelization strategy

Parallelization Strategies

  • Data Parallel (DP): Simple, works for most models

  • Distributed Data Parallel (DDP): Better performance, recommended

  • Model Parallel: For very large models that don’t fit on single GPU

  • Pipeline Parallel: For extremely large models

  • DeepSpeed ZeRO: For memory-efficient training of large models

PyTorch DDP Example

Cost Optimization Strategies

[ ]:
# Import Clustrix cost monitoring functionality
from clustrix import cost_tracking_decorator, get_cost_monitor, generate_cost_report

# Example 1: Using the cost tracking decorator
@cost_tracking_decorator('lambda', 'a100_40gb')
@cluster(cores=8, memory="32GB")
def lambda_training_with_cost_tracking():
    """Example training function with automatic cost tracking."""
    import time
    import numpy as np

    # Simulate training workload
    print("Starting training...")
    time.sleep(2)  # Simulate 2 seconds of work

    # Simulate some compute
    data = np.random.randn(1000, 1000)
    result = np.dot(data, data.T)

    print("Training completed!")
    return {
        'model_accuracy': 0.95,
        'training_samples': 10000,
        'final_loss': 0.032
    }

# Example 2: Manual cost monitoring
def manual_cost_monitoring_example():
    """Example of manual cost monitoring."""
    # Start cost monitoring
    monitor = get_cost_monitor('lambda')
    if monitor:
        monitor.start_monitoring()

        # Your computation here
        import time
        time.sleep(1)

        # Stop monitoring and get report
        cost_report = monitor.stop_monitoring()
        if cost_report:
            print(f"Computation completed in {cost_report.duration_seconds:.2f} seconds")
            print(f"Estimated cost: ${cost_report.cost_estimate.estimated_cost:.4f}")
            print(f"GPU utilization: {len(cost_report.resource_usage.gpu_stats or [])} GPUs")

            if cost_report.recommendations:
                print("Cost optimization recommendations:")
                for rec in cost_report.recommendations:
                    print(f"  - {rec}")

# Example 3: Generate real-time cost report
def get_current_cost_status():
    """Get current cost and resource status."""
    report = generate_cost_report('lambda', 'a100_40gb')
    if report:
        print("Current Lambda Cloud Status:")
        print(f"  CPU Usage: {report['resource_usage']['cpu_percent']:.1f}%")
        print(f"  Memory Usage: {report['resource_usage']['memory_percent']:.1f}%")
        if report['resource_usage']['gpu_stats']:
            avg_gpu = sum(gpu['utilization_percent'] for gpu in report['resource_usage']['gpu_stats']) / len(report['resource_usage']['gpu_stats'])
            print(f"  GPU Usage: {avg_gpu:.1f}%")
        print(f"  Hourly Rate: ${report['cost_estimate']['hourly_rate']:.2f}")

# Example 4: Compare pricing across instance types
def compare_lambda_pricing():
    """Compare pricing for different Lambda Cloud instance types."""
    from clustrix import get_pricing_info

    pricing = get_pricing_info('lambda')
    if pricing:
        print("Lambda Cloud Instance Pricing (USD/hour):")

        # Group by category
        single_gpu = {k: v for k, v in pricing.items() if not k.startswith(('2x', '4x', '8x')) and k != 'default'}
        multi_gpu = {k: v for k, v in pricing.items() if k.startswith(('2x', '4x', '8x'))}

        print("\nSingle GPU Instances:")
        for instance, price in sorted(single_gpu.items(), key=lambda x: x[1]):
            print(f"  {instance:<15}: ${price:.2f}/hour")

        print("\nMulti-GPU Instances:")
        for instance, price in sorted(multi_gpu.items(), key=lambda x: x[1]):
            print(f"  {instance:<15}: ${price:.2f}/hour")

# Run examples (uncomment to test)
# print("1. Cost tracking decorator example:")
# result = lambda_training_with_cost_tracking()
# print(f"Training result: {result}")

# print("\n2. Manual cost monitoring example:")
# manual_cost_monitoring_example()

# print("\n3. Current cost status:")
# get_current_cost_status()

print("4. Lambda Cloud pricing comparison:")
compare_lambda_pricing()

print("\n✅ Lambda Cloud cost monitoring examples ready!")
print("💡 Use @cost_tracking_decorator('lambda', 'instance_type') for automatic cost tracking")

Lambda Cloud Cost Optimization

Cost Monitoring and Tracking

Monitor GPU utilization and track costs effectively:

Lambda Cloud Cost Optimization

💰 Instance Selection

  • RTX 6000 Ada: Best value for most ML workloads (~$0.75/hour)

  • A10: Good balance of performance and cost (~$0.60/hour)

  • A100 40GB: For large models requiring more VRAM (~$1.10/hour)

  • A100 80GB: Only when 40GB is insufficient (~$1.40/hour)

  • H100: Premium option for cutting-edge research (~$2.50/hour)

⏰ Usage Patterns

  • Use “persistent” instances for ongoing development

  • Terminate instances immediately after training completion

  • Schedule training jobs during off-peak hours if possible

  • Use local development for debugging, GPU for final training

🔧 Optimization Techniques

  • Mixed precision training (fp16) to reduce memory usage

  • Gradient accumulation for effective larger batch sizes

  • Model checkpointing to resume interrupted training

  • Efficient data loading with multiple workers

  • Early stopping to avoid overtraining

📊 Monitoring and Management

  • Monitor GPU utilization with nvidia-smi

  • Track training progress with logging

  • Set training time limits to prevent runaway costs

  • Use Clustrix timeouts as safety nets

  • Regular cost reviews and budget alerts

🚀 Clustrix-Specific Optimizations

  • Use Clustrix auto-cleanup features

  • Implement job queuing for multiple experiments

  • Leverage Clustrix’s timeout mechanisms

  • Use remote environment caching

Best Practices and Troubleshooting

[ ]:
# Example usage of monitoring functions
def create_monitoring_script():
    """Create and save the GPU monitoring script."""
    script_content = '''#!/bin/bash
# Lambda Cloud monitoring script

echo "Lambda Cloud Training Monitor"
echo "============================"
echo "Start time: $(date)"
echo ""

# System information
echo "System Information:"
echo "------------------"
nvidia-smi --query-gpu=gpu_name,memory.total,power.draw --format=csv
echo ""

# Monitor GPU usage every 30 seconds
while true; do
    echo "GPU Status at $(date):"
    nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
    echo ""

    # Check if training process is still running
    if ! pgrep -f python > /dev/null; then
        echo "No Python processes found. Training may have completed."
        break
    fi

    sleep 30
done

echo "Monitoring completed at $(date)"
'''

    with open('monitor_training.sh', 'w') as f:
        f.write(script_content)

    # Make executable
    import os
    os.chmod('monitor_training.sh', 0o755)

    return "Monitoring script created: monitor_training.sh"

# Uncomment to create the monitoring script:
# result = create_monitoring_script()
# print(result)

Lambda Cloud Best Practices

GPU Monitoring Script

Use this monitoring script to track GPU usage during training. Save as monitor_training.sh and run with: bash monitor_training.sh

#!/bin/bash
# Lambda Cloud monitoring script

echo "Lambda Cloud Training Monitor"
echo "============================"
echo "Start time: $(date)"
echo ""

# System information
echo "System Information:"
echo "------------------"
nvidia-smi --query-gpu=gpu_name,memory.total,power.draw --format=csv
echo ""

# Monitor GPU usage every 30 seconds
while true; do
    echo "GPU Status at $(date):"
    nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
    echo ""

    # Check if training process is still running
    if ! pgrep -f python > /dev/null; then
        echo "No Python processes found. Training may have completed."
        break
    fi

    sleep 30
done

echo "Monitoring completed at $(date)"

Lambda Cloud + Clustrix Best Practices

🚀 Performance Optimization

  • Always use mixed precision (fp16) when possible

  • Optimize data loading with multiple workers and pin_memory

  • Use appropriate batch sizes to maximize GPU utilization

  • Enable tensor cores for compatible operations

  • Pre-allocate GPU memory to avoid fragmentation

💾 Data Management

  • Store datasets on fast NVMe storage when available

  • Use data streaming for very large datasets

  • Implement efficient data preprocessing pipelines

  • Cache frequently used data in memory

  • Use appropriate data formats (e.g., HDF5, Parquet)

🔧 Environment Setup

  • Use conda environments for reproducible setups

  • Pin package versions in requirements.txt

  • Install packages from conda-forge when possible

  • Use uv package manager for faster installs

  • Set up proper CUDA environment variables

🛠️ Development Workflow

  • Develop and debug locally, train on Lambda Cloud

  • Use small datasets for initial testing

  • Implement proper logging and monitoring

  • Save model checkpoints regularly

  • Use version control for experiment tracking

🔒 Security

  • Use SSH keys instead of passwords

  • Keep SSH keys secure and rotate regularly

  • Don’t store credentials in code or notebooks

  • Use environment variables for configuration

  • Monitor instance access logs

Common Issues and Solutions

❌ CUDA out of memory errors

Solutions:

  • Reduce batch size

  • Enable gradient checkpointing

  • Use mixed precision training

  • Clear GPU cache with torch.cuda.empty_cache()

  • Consider model parallelism for large models

❌ Slow data loading

Solutions:

  • Increase num_workers in DataLoader

  • Enable pin_memory for GPU transfers

  • Use faster storage (NVMe over network storage)

  • Implement data prefetching

  • Optimize data preprocessing

❌ SSH connection timeouts

Solutions:

  • Configure SSH keep-alive settings

  • Use screen or tmux for long-running jobs

  • Implement proper error handling in Clustrix

  • Set appropriate timeout values

  • Monitor network connectivity

❌ Low GPU utilization

Solutions:

  • Increase batch size if memory allows

  • Optimize data loading pipeline

  • Use asynchronous data transfers

  • Profile code to identify bottlenecks

  • Consider multi-GPU training

❌ Package installation failures

Solutions:

  • Use conda for system-level packages

  • Check CUDA compatibility versions

  • Clear pip cache if needed

  • Use –no-cache-dir flag for pip

  • Install packages in correct order

Instance Management and Cleanup

Lambda Cloud Instance Management

🔍 Check Running Instances

Via CLI:

lambda-cloud instance list

Via Web Console: Visit: https://cloud.lambdalabs.com/instances

⏹️ Terminate Instances

Terminate specific instance:

lambda-cloud instance terminate <INSTANCE_ID>

Terminate all instances (DANGEROUS!):

lambda-cloud instance list --format=csv | grep -v "instance_id" | cut -d',' -f1 | xargs -I {} lambda-cloud instance terminate {}

💾 Save Work Before Termination

Save models to persistent storage:

rsync -avz ubuntu@<INSTANCE_IP>:/path/to/models/ ./local_models/

Save logs and results:

scp -r ubuntu@<INSTANCE_IP>:/tmp/clustrix/ ./results/

📊 Cost Monitoring

Check current usage:

lambda-cloud instance list --format=table

Estimate costs:

lambda-cloud instance list --format=csv | awk -F',' 'NR>1 {print $2, $3}' | while read type status; do
    if [ "$status" = "active" ]; then
        echo "Active instance: $type"
    fi
done

Automated Cleanup Script

Save this as lambda_cleanup.sh for automated instance management:

#!/bin/bash
# Automated cleanup script for Lambda Cloud
# Save as lambda_cleanup.sh

set -e

echo "Lambda Cloud Automated Cleanup"
echo "=============================="

# Check if lambda-cloud CLI is installed
if ! command -v lambda-cloud &> /dev/null; then
    echo "Error: lambda-cloud CLI not found. Please install it first."
    exit 1
fi

# List current instances
echo "Current instances:"
lambda-cloud instance list
echo ""

# Ask for confirmation
read -p "Do you want to terminate ALL instances? (y/N): " -n 1 -r
echo ""
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
    echo "Cleanup cancelled."
    exit 0
fi

# Get instance IDs
INSTANCE_IDS=$(lambda-cloud instance list --format=csv | grep -v "instance_id" | cut -d',' -f1)

if [ -z "$INSTANCE_IDS" ]; then
    echo "No instances to terminate."
    exit 0
fi

# Terminate instances
echo "Terminating instances..."
for instance_id in $INSTANCE_IDS; do
    echo "Terminating instance: $instance_id"
    lambda-cloud instance terminate $instance_id
done

echo "All instances terminated."
echo "Please verify termination in the web console: https://cloud.lambdalabs.com/instances"

Clustrix Integration Manager

[ ]:
# Integrate cleanup with Clustrix workflows

from clustrix import configure
import subprocess
import time

class LambdaCloudManager:
    """Manager for Lambda Cloud instances with Clustrix integration."""

    def __init__(self):
        self.active_instances = []

    def launch_instance_for_clustrix(self, instance_type, ssh_key_name):
        """Launch instance and configure Clustrix."""
        # Launch instance
        result = subprocess.run([
            'lambda-cloud', 'instance', 'launch',
            '--instance-type', instance_type,
            '--ssh-key-name', ssh_key_name
        ], capture_output=True, text=True)

        if result.returncode != 0:
            raise Exception(f"Failed to launch instance: {result.stderr}")

        # Parse instance ID and IP (simplified)
        instance_id = "extracted_from_output"  # Parse from result.stdout
        instance_ip = "extracted_from_output"   # Parse from result.stdout

        # Wait for instance to be ready
        time.sleep(60)  # Wait for startup

        # Configure Clustrix
        configure(
            cluster_type="ssh",
            cluster_host=instance_ip,
            username="ubuntu",
            key_file="~/.ssh/id_rsa",
            remote_work_dir="/tmp/clustrix",
            package_manager="auto",
            cleanup_on_success=True
        )

        self.active_instances.append({
            'id': instance_id,
            'ip': instance_ip,
            'type': instance_type,
            'launch_time': time.time()
        })

        return instance_id, instance_ip

    def cleanup_all_instances(self):
        """Clean up all managed instances."""
        for instance in self.active_instances:
            try:
                subprocess.run([
                    'lambda-cloud', 'instance', 'terminate', instance['id']
                ], check=True)
                print(f"Terminated instance {instance['id']}")
            except subprocess.CalledProcessError as e:
                print(f"Failed to terminate {instance['id']}: {e}")

        self.active_instances.clear()

    def __del__(self):
        """Ensure cleanup on object destruction."""
        if self.active_instances:
            print("Warning: Active instances detected. Cleaning up...")
            self.cleanup_all_instances()

# Usage example:
# manager = LambdaCloudManager()
# try:
#     instance_id, ip = manager.launch_instance_for_clustrix('a100', 'my-ssh-key')
#     # Run your Clustrix computations
#     result = my_clustrix_function()
# finally:
#     manager.cleanup_all_instances()

Summary

This tutorial covered:

  1. Setup: Lambda Cloud account creation and instance management

  2. GPU Computing: High-performance GPU instances for ML workloads

  3. Deep Learning: PyTorch training with GPU acceleration

  4. Transformer Models: Fine-tuning with HuggingFace Transformers

  5. Computer Vision: CNN training with data augmentation

  6. Multi-GPU Training: Distributed training across multiple GPUs

  7. Cost Optimization: Strategies to minimize GPU computing costs

  8. Best Practices: Performance optimization and troubleshooting

  9. Instance Management: Automated cleanup and monitoring

Key Advantages of Lambda Cloud + Clustrix

  • GPU Focus: Specialized in high-performance GPU computing

  • Cost Effective: Competitive pricing for GPU instances

  • Simple Management: Easy instance launching and termination

  • High Performance: Latest NVIDIA GPUs (A100, H100, RTX)

  • Fast Networking: InfiniBand for multi-GPU communication

  • ML Optimized: Pre-configured environments for machine learning

  • Flexible Scaling: From single GPU to large multi-GPU clusters

Lambda Cloud Pricing Advantages

  • RTX 6000 Ada: Excellent price/performance for most ML workloads

  • A100 40GB/80GB: Industry-standard for large-scale training

  • H100: Cutting-edge performance for the most demanding workloads

  • Multi-GPU: Cost-effective scaling for distributed training

  • No Hidden Fees: Simple per-hour pricing

Next Steps

  1. Create your Lambda Cloud account and add SSH keys

  2. Start with a single GPU instance for testing

  3. Configure Clustrix for your Lambda Cloud instance

  4. Run the provided examples to verify setup

  5. Scale to multi-GPU instances for larger workloads

  6. Implement cost monitoring and automated cleanup

Use Cases

  • Deep Learning Research: Train large neural networks efficiently

  • Computer Vision: Process large image datasets with CNNs

  • NLP: Fine-tune transformer models on custom datasets

  • Scientific Computing: GPU-accelerated simulations and modeling

  • Prototyping: Rapid experimentation with different architectures

  • Production Training: Scale up successful experiments

Resources

Remember: Lambda Cloud excels at GPU computing! Always terminate instances when not in use to control costs, and leverage Clustrix’s distributed computing capabilities to scale your ML workloads efficiently.