{ "cells": [ { "cell_type": "markdown", "id": "lambda-title", "metadata": {}, "source": "# Lambda Cloud Tutorial\n\nThis tutorial demonstrates how to use Clustrix with Lambda Cloud for high-performance GPU computing and distributed machine learning.\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/clustrix/blob/master/docs/source/notebooks/lambda_cloud_tutorial.ipynb)\n\n## Overview\n\nLambda Cloud specializes in GPU cloud computing and integrates well with Clustrix for ML workloads:\n\n- **GPU-Optimized Instances**: High-performance NVIDIA GPUs (A100, H100, RTX)\n- **Cost-Effective**: Competitive pricing for GPU computing\n- **Simple Management**: Easy instance launching and management\n- **Pre-configured Environments**: ML-ready software stacks\n- **High-Speed Networking**: InfiniBand for multi-GPU communications\n- **Persistent Storage**: Fast NVMe and network storage options\n- **SSH Access**: Direct access for Clustrix integration\n- **On-Demand and Reserved**: Flexible pricing models\n\n## Prerequisites\n\n1. Lambda Cloud account with GPU credits\n2. SSH key pair for instance access\n3. Lambda Cloud API key (optional)\n4. Basic understanding of GPU computing", "outputs": [] }, { "cell_type": "markdown", "id": "installation", "metadata": {}, "source": [ "## Installation and Setup\n", "\n", "Install Clustrix with Lambda Cloud dependencies:" ] }, { "cell_type": "code", "execution_count": null, "id": "install", "metadata": {}, "outputs": [], "source": [ "# Install Clustrix with GPU and Lambda Cloud support\n", "!pip install clustrix torch torchvision transformers datasets accelerate\n", "\n", "# Import required libraries\n", "import clustrix\n", "from clustrix import cluster, configure\n", "import torch\n", "import numpy as np\n", "import time\n", "import json\n", "import requests\n", "import os" ] }, { "cell_type": "markdown", "id": "lambda-setup", "metadata": {}, "source": [ "## Lambda Cloud Authentication and Setup\n", "\n", "### Option 1: Web Console Setup" ] }, { "cell_type": "markdown", "id": "web-setup", "metadata": {}, "outputs": [], "source": "### Lambda Cloud Web Console Setup\n\n1. **Create Account:**\n - Visit https://lambdalabs.com/service/gpu-cloud\n - Sign up and verify your account\n - Add billing information and credits\n\n2. **Add SSH Key:**\n - Go to https://cloud.lambdalabs.com/ssh-keys\n - Click \"Add SSH Key\"\n - Paste your public key (cat ~/.ssh/id_rsa.pub)\n - Give it a descriptive name\n\n3. **Launch Instance:**\n - Go to https://cloud.lambdalabs.com/instances\n - Click \"Launch instance\"\n - Select instance type (A100, H100, RTX 6000 Ada, etc.)\n - Choose region (closest to you for best performance)\n - Select your SSH key\n - Launch the instance\n\n4. **Instance Types Available:**\n - RTX 6000 Ada: 48GB VRAM, ~$0.75/hour\n - A10: 24GB VRAM, ~$0.60/hour \n - A100 (40GB): 40GB VRAM, ~$1.10/hour\n - A100 (80GB): 80GB VRAM, ~$1.40/hour\n - H100: 80GB VRAM, ~$2.50/hour (when available)\n\n5. **Access Instance:**\n - Wait for instance to be \"Running\"\n - Note the public IP address\n - SSH: ssh ubuntu@" }, { "cell_type": "markdown", "id": "a5k1lpava3n", "source": "**Follow this guide to set up your Lambda Cloud account and launch your first GPU instance.**", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "id": "lambda-api", "metadata": {}, "source": [ "### Option 2: API-Based Setup" ] }, { "cell_type": "code", "id": "api-setup", "metadata": {}, "outputs": [], "source": "import requests\nimport os\n\nclass LambdaCloudAPI:\n def __init__(self, api_key=None):\n self.api_key = api_key or os.getenv('LAMBDA_API_KEY')\n self.base_url = 'https://cloud.lambdalabs.com/api/v1'\n self.headers = {\n 'Authorization': f'Bearer {self.api_key}',\n 'Content-Type': 'application/json'\n }\n \n def list_instance_types(self):\n \"\"\"List available instance types.\"\"\"\n response = requests.get(f'{self.base_url}/instance-types', headers=self.headers)\n return response.json()\n \n def list_instances(self):\n \"\"\"List running instances.\"\"\"\n response = requests.get(f'{self.base_url}/instances', headers=self.headers)\n return response.json()\n \n def launch_instance(self, instance_type, ssh_key_name, region='us-east-1', name=None):\n \"\"\"Launch a new instance.\"\"\"\n data = {\n 'instance_type_name': instance_type,\n 'ssh_key_names': [ssh_key_name],\n 'region_name': region\n }\n if name:\n data['name'] = name\n \n response = requests.post(f'{self.base_url}/instance-operations/launch', \n headers=self.headers, json=data)\n return response.json()\n \n def terminate_instance(self, instance_id):\n \"\"\"Terminate an instance.\"\"\"\n data = {'instance_ids': [instance_id]}\n response = requests.post(f'{self.base_url}/instance-operations/terminate',\n headers=self.headers, json=data)\n return response.json()\n \n def get_instance_details(self, instance_id):\n \"\"\"Get detailed information about an instance.\"\"\"\n instances = self.list_instances()\n for instance in instances.get('data', []):\n if instance['id'] == instance_id:\n return instance\n return None\n\n# Example usage:\n# api = LambdaCloudAPI()\n# instance_types = api.list_instance_types()\n# print(json.dumps(instance_types, indent=2))", "execution_count": null }, { "cell_type": "markdown", "id": "1fgfjnypmvp", "source": "### Lambda Cloud API Setup Guide\n\n#### CLI Setup Steps\n\n1. **Get API Key:**\n - Go to https://cloud.lambdalabs.com/api-keys\n - Generate a new API key\n - Set as environment variable: `export LAMBDA_API_KEY=\"your-key\"`\n\n2. **Install Lambda Cloud CLI:**\n ```bash\n pip install lambda-cloud\n lambda-cloud configure # Enter your API key\n ```\n\n3. **Basic CLI Commands:**\n ```bash\n # List available instance types\n lambda-cloud instance-types list\n \n # List available regions\n lambda-cloud regions list\n \n # Launch instance\n lambda-cloud instance launch \\\n --instance-type a100 \\\n --ssh-key-name your-key-name \\\n --region us-east-1\n \n # List running instances\n lambda-cloud instance list\n \n # Terminate instance\n lambda-cloud instance terminate \n ```\n\n#### Python API Client", "metadata": {} }, { "cell_type": "markdown", "id": "clustrix-lambda-config", "metadata": {}, "source": [ "## Configure Clustrix for Lambda Cloud" ] }, { "cell_type": "code", "id": "config-lambda", "metadata": {}, "outputs": [], "source": "# Configure Clustrix to use your Lambda Cloud instance\nconfigure(\n cluster_type=\"ssh\",\n cluster_host=\"your-lambda-instance-ip\", # Replace with actual IP\n username=\"ubuntu\", # Default Lambda Cloud user\n key_file=\"~/.ssh/id_rsa\", # Your private SSH key\n remote_work_dir=\"/tmp/clustrix\",\n package_manager=\"auto\", # Will use uv if available\n default_cores=8, # Lambda instances typically have 8+ cores\n default_memory=\"32GB\", # Generous memory allocation\n default_time=\"02:00:00\", # Longer timeout for GPU tasks\n environment_variables={\n \"CUDA_VISIBLE_DEVICES\": \"0\", # Use first GPU\n \"NVIDIA_VISIBLE_DEVICES\": \"all\"\n }\n)", "execution_count": null }, { "cell_type": "markdown", "id": "mm5s72ijgws", "source": "**Replace `your-lambda-instance-ip` with the actual IP address from your Lambda Cloud instance.**", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "id": "gpu-verification", "metadata": {}, "source": [ "### GPU Verification and Setup" ] }, { "cell_type": "code", "execution_count": null, "id": "verify-gpu", "metadata": {}, "outputs": [], "source": [ "@cluster(cores=2, memory=\"8GB\")\n", "def verify_lambda_gpu_setup():\n", " \"\"\"Verify GPU availability and setup on Lambda Cloud instance.\"\"\"\n", " import torch\n", " import subprocess\n", " import platform\n", " \n", " # System information\n", " system_info = {\n", " 'platform': platform.platform(),\n", " 'python_version': platform.python_version(),\n", " 'architecture': platform.architecture()[0]\n", " }\n", " \n", " # PyTorch and CUDA info\n", " torch_info = {\n", " 'pytorch_version': torch.__version__,\n", " 'cuda_available': torch.cuda.is_available(),\n", " 'cuda_version': torch.version.cuda if torch.cuda.is_available() else None,\n", " 'cudnn_version': torch.backends.cudnn.version() if torch.cuda.is_available() else None,\n", " 'device_count': torch.cuda.device_count() if torch.cuda.is_available() else 0\n", " }\n", " \n", " # GPU details\n", " gpu_info = []\n", " if torch.cuda.is_available():\n", " for i in range(torch.cuda.device_count()):\n", " props = torch.cuda.get_device_properties(i)\n", " gpu_info.append({\n", " 'device_id': i,\n", " 'name': props.name,\n", " 'total_memory_gb': props.total_memory / (1024**3),\n", " 'major': props.major,\n", " 'minor': props.minor,\n", " 'multiprocessor_count': props.multi_processor_count\n", " })\n", " \n", " # NVIDIA-SMI output\n", " nvidia_smi = None\n", " try:\n", " result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)\n", " if result.returncode == 0:\n", " nvidia_smi = result.stdout\n", " except FileNotFoundError:\n", " nvidia_smi = \"nvidia-smi not found\"\n", " \n", " # Test GPU computation\n", " gpu_test_result = None\n", " if torch.cuda.is_available():\n", " try:\n", " # Simple GPU computation test\n", " device = torch.device('cuda:0')\n", " x = torch.randn(1000, 1000, device=device)\n", " y = torch.randn(1000, 1000, device=device)\n", " \n", " start_time = torch.cuda.Event(enable_timing=True)\n", " end_time = torch.cuda.Event(enable_timing=True)\n", " \n", " start_time.record()\n", " z = torch.mm(x, y)\n", " torch.cuda.synchronize()\n", " end_time.record()\n", " torch.cuda.synchronize()\n", " \n", " gpu_test_result = {\n", " 'test_passed': True,\n", " 'computation_time_ms': start_time.elapsed_time(end_time),\n", " 'result_shape': z.shape,\n", " 'memory_allocated_mb': torch.cuda.memory_allocated() / (1024**2),\n", " 'memory_reserved_mb': torch.cuda.memory_reserved() / (1024**2)\n", " }\n", " except Exception as e:\n", " gpu_test_result = {\n", " 'test_passed': False,\n", " 'error': str(e)\n", " }\n", " \n", " return {\n", " 'system_info': system_info,\n", " 'torch_info': torch_info,\n", " 'gpu_info': gpu_info,\n", " 'nvidia_smi': nvidia_smi,\n", " 'gpu_test': gpu_test_result\n", " }\n", "\n", "# Run GPU verification\n", "# gpu_status = verify_lambda_gpu_setup()\n", "# print(json.dumps(gpu_status, indent=2, default=str))\n", "print(\"GPU verification function defined. Uncomment the lines above to run on Lambda Cloud.\")" ] }, { "cell_type": "markdown", "id": "ml-training-example", "metadata": {}, "source": [ "## Example 1: Distributed Deep Learning Training" ] }, { "cell_type": "code", "execution_count": null, "id": "dl-training", "metadata": {}, "outputs": [], "source": [ "@cluster(cores=8, memory=\"16GB\", time=\"01:30:00\")\n", "def lambda_deep_learning_training(model_config, training_config):\n", " \"\"\"Train a deep learning model on Lambda Cloud GPU.\"\"\"\n", " import torch\n", " import torch.nn as nn\n", " import torch.optim as optim\n", " from torch.utils.data import DataLoader, TensorDataset\n", " import numpy as np\n", " import time\n", " \n", " # Set device\n", " device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", " print(f\"Training on device: {device}\")\n", " \n", " # Create synthetic dataset\n", " n_samples = training_config['n_samples']\n", " n_features = training_config['n_features']\n", " n_classes = training_config['n_classes']\n", " \n", " # Generate random data\n", " X = torch.randn(n_samples, n_features)\n", " y = torch.randint(0, n_classes, (n_samples,))\n", " \n", " # Create dataset and dataloader\n", " dataset = TensorDataset(X, y)\n", " dataloader = DataLoader(\n", " dataset, \n", " batch_size=training_config['batch_size'], \n", " shuffle=True\n", " )\n", " \n", " # Define model architecture\n", " class DeepNet(nn.Module):\n", " def __init__(self, input_size, hidden_sizes, output_size, dropout=0.2):\n", " super(DeepNet, self).__init__()\n", " \n", " layers = []\n", " prev_size = input_size\n", " \n", " for hidden_size in hidden_sizes:\n", " layers.extend([\n", " nn.Linear(prev_size, hidden_size),\n", " nn.ReLU(),\n", " nn.BatchNorm1d(hidden_size),\n", " nn.Dropout(dropout)\n", " ])\n", " prev_size = hidden_size\n", " \n", " layers.append(nn.Linear(prev_size, output_size))\n", " self.network = nn.Sequential(*layers)\n", " \n", " def forward(self, x):\n", " return self.network(x)\n", " \n", " # Create model\n", " model = DeepNet(\n", " input_size=n_features,\n", " hidden_sizes=model_config['hidden_sizes'],\n", " output_size=n_classes,\n", " dropout=model_config.get('dropout', 0.2)\n", " ).to(device)\n", " \n", " # Loss and optimizer\n", " criterion = nn.CrossEntropyLoss()\n", " optimizer = optim.Adam(\n", " model.parameters(), \n", " lr=training_config['learning_rate'],\n", " weight_decay=training_config.get('weight_decay', 1e-4)\n", " )\n", " \n", " # Training loop\n", " model.train()\n", " training_start = time.time()\n", " \n", " epoch_losses = []\n", " epoch_accuracies = []\n", " \n", " for epoch in range(training_config['epochs']):\n", " epoch_loss = 0.0\n", " correct = 0\n", " total = 0\n", " \n", " for batch_idx, (data, target) in enumerate(dataloader):\n", " data, target = data.to(device), target.to(device)\n", " \n", " optimizer.zero_grad()\n", " output = model(data)\n", " loss = criterion(output, target)\n", " loss.backward()\n", " optimizer.step()\n", " \n", " epoch_loss += loss.item()\n", " _, predicted = torch.max(output.data, 1)\n", " total += target.size(0)\n", " correct += (predicted == target).sum().item()\n", " \n", " avg_loss = epoch_loss / len(dataloader)\n", " accuracy = 100.0 * correct / total\n", " \n", " epoch_losses.append(avg_loss)\n", " epoch_accuracies.append(accuracy)\n", " \n", " if epoch % 10 == 0 or epoch == training_config['epochs'] - 1:\n", " print(f'Epoch {epoch+1}/{training_config[\"epochs\"]}: '\n", " f'Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')\n", " \n", " training_time = time.time() - training_start\n", " \n", " # Model evaluation\n", " model.eval()\n", " with torch.no_grad():\n", " test_data = torch.randn(1000, n_features).to(device)\n", " test_output = model(test_data)\n", " test_predictions = torch.max(test_output, 1)[1]\n", " \n", " # Memory usage\n", " memory_info = {}\n", " if torch.cuda.is_available():\n", " memory_info = {\n", " 'allocated_mb': torch.cuda.memory_allocated() / (1024**2),\n", " 'reserved_mb': torch.cuda.memory_reserved() / (1024**2),\n", " 'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)\n", " }\n", " \n", " return {\n", " 'training_completed': True,\n", " 'device_used': str(device),\n", " 'model_parameters': sum(p.numel() for p in model.parameters()),\n", " 'trainable_parameters': sum(p.numel() for p in model.parameters() if p.requires_grad),\n", " 'training_time': training_time,\n", " 'final_loss': epoch_losses[-1],\n", " 'final_accuracy': epoch_accuracies[-1],\n", " 'best_accuracy': max(epoch_accuracies),\n", " 'epoch_losses': epoch_losses,\n", " 'epoch_accuracies': epoch_accuracies,\n", " 'memory_info': memory_info,\n", " 'model_architecture': str(model)\n", " }\n", "\n", "# Example configuration\n", "model_config = {\n", " 'hidden_sizes': [512, 256, 128, 64],\n", " 'dropout': 0.3\n", "}\n", "\n", "training_config = {\n", " 'n_samples': 10000,\n", " 'n_features': 100,\n", " 'n_classes': 10,\n", " 'batch_size': 64,\n", " 'epochs': 50,\n", " 'learning_rate': 0.001,\n", " 'weight_decay': 1e-4\n", "}\n", "\n", "# Run training\n", "# result = lambda_deep_learning_training(model_config, training_config)\n", "# print(f\"Training completed! Final accuracy: {result['final_accuracy']:.2f}%\")\n", "# print(f\"Training time: {result['training_time']:.2f} seconds\")\n", "# print(f\"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB\")\n", "\n", "print(\"Deep learning training function defined. Uncomment the lines above to run on Lambda Cloud.\")" ] }, { "cell_type": "markdown", "id": "transformer-example", "metadata": {}, "source": [ "## Example 2: Transformer Model Fine-tuning" ] }, { "cell_type": "code", "execution_count": null, "id": "transformer-finetuning", "metadata": {}, "outputs": [], "source": [ "@cluster(cores=8, memory=\"32GB\", time=\"02:00:00\")\n", "def lambda_transformer_finetuning(model_name, training_params):\n", " \"\"\"Fine-tune a transformer model on Lambda Cloud GPU.\"\"\"\n", " import torch\n", " from transformers import (\n", " AutoTokenizer, AutoModelForSequenceClassification,\n", " TrainingArguments, Trainer, DataCollatorWithPadding\n", " )\n", " from datasets import Dataset\n", " import numpy as np\n", " import time\n", " \n", " # Set device\n", " device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", " print(f\"Fine-tuning on device: {device}\")\n", " \n", " # Load tokenizer and model\n", " tokenizer = AutoTokenizer.from_pretrained(model_name)\n", " model = AutoModelForSequenceClassification.from_pretrained(\n", " model_name,\n", " num_labels=training_params['num_labels']\n", " )\n", " \n", " if tokenizer.pad_token is None:\n", " tokenizer.pad_token = tokenizer.eos_token\n", " \n", " # Create synthetic dataset\n", " def generate_synthetic_text_data(n_samples, num_labels):\n", " \"\"\"Generate synthetic text classification data.\"\"\"\n", " \n", " # Simple text templates for different classes\n", " templates = {\n", " 0: [\"This is a positive example about {}\", \"Great work on {}\", \"Excellent {}\"],\n", " 1: [\"This is a negative example about {}\", \"Poor {}\", \"Terrible {}\"],\n", " 2: [\"This is a neutral example about {}\", \"Average {}\", \"Okay {}\"] if num_labels > 2 else []\n", " }\n", " \n", " topics = [\"technology\", \"sports\", \"food\", \"movies\", \"music\", \"books\", \"travel\", \"science\"]\n", " \n", " texts = []\n", " labels = []\n", " \n", " for _ in range(n_samples):\n", " label = np.random.randint(0, num_labels)\n", " template = np.random.choice(templates[label])\n", " topic = np.random.choice(topics)\n", " text = template.format(topic)\n", " \n", " texts.append(text)\n", " labels.append(label)\n", " \n", " return texts, labels\n", " \n", " # Generate data\n", " train_texts, train_labels = generate_synthetic_text_data(\n", " training_params['train_samples'], training_params['num_labels']\n", " )\n", " eval_texts, eval_labels = generate_synthetic_text_data(\n", " training_params['eval_samples'], training_params['num_labels']\n", " )\n", " \n", " # Tokenize data\n", " def tokenize_function(examples):\n", " return tokenizer(\n", " examples['text'],\n", " truncation=True,\n", " padding=True,\n", " max_length=training_params.get('max_length', 512)\n", " )\n", " \n", " # Create datasets\n", " train_dataset = Dataset.from_dict({'text': train_texts, 'labels': train_labels})\n", " eval_dataset = Dataset.from_dict({'text': eval_texts, 'labels': eval_labels})\n", " \n", " train_dataset = train_dataset.map(tokenize_function, batched=True)\n", " eval_dataset = eval_dataset.map(tokenize_function, batched=True)\n", " \n", " # Data collator\n", " data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n", " \n", " # Training arguments\n", " training_args = TrainingArguments(\n", " output_dir='/tmp/results',\n", " num_train_epochs=training_params.get('epochs', 3),\n", " per_device_train_batch_size=training_params.get('batch_size', 8),\n", " per_device_eval_batch_size=training_params.get('eval_batch_size', 8),\n", " warmup_steps=training_params.get('warmup_steps', 100),\n", " weight_decay=training_params.get('weight_decay', 0.01),\n", " learning_rate=training_params.get('learning_rate', 2e-5),\n", " logging_dir='/tmp/logs',\n", " logging_steps=10,\n", " evaluation_strategy=\"epoch\",\n", " save_strategy=\"epoch\",\n", " load_best_model_at_end=True,\n", " metric_for_best_model=\"eval_loss\",\n", " greater_is_better=False,\n", " fp16=torch.cuda.is_available(), # Use mixed precision if GPU available\n", " dataloader_pin_memory=torch.cuda.is_available(),\n", " remove_unused_columns=False\n", " )\n", " \n", " # Define compute metrics\n", " def compute_metrics(eval_pred):\n", " predictions, labels = eval_pred\n", " predictions = np.argmax(predictions, axis=1)\n", " accuracy = (predictions == labels).mean()\n", " return {'accuracy': accuracy}\n", " \n", " # Create trainer\n", " trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=train_dataset,\n", " eval_dataset=eval_dataset,\n", " tokenizer=tokenizer,\n", " data_collator=data_collator,\n", " compute_metrics=compute_metrics\n", " )\n", " \n", " # Training\n", " start_time = time.time()\n", " train_result = trainer.train()\n", " training_time = time.time() - start_time\n", " \n", " # Final evaluation\n", " eval_result = trainer.evaluate()\n", " \n", " # Memory usage\n", " memory_info = {}\n", " if torch.cuda.is_available():\n", " memory_info = {\n", " 'allocated_mb': torch.cuda.memory_allocated() / (1024**2),\n", " 'reserved_mb': torch.cuda.memory_reserved() / (1024**2),\n", " 'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)\n", " }\n", " \n", " # Model info\n", " total_params = sum(p.numel() for p in model.parameters())\n", " trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", " \n", " return {\n", " 'model_name': model_name,\n", " 'device_used': str(device),\n", " 'training_completed': True,\n", " 'training_time': training_time,\n", " 'total_parameters': total_params,\n", " 'trainable_parameters': trainable_params,\n", " 'train_loss': train_result.training_loss,\n", " 'eval_loss': eval_result['eval_loss'],\n", " 'eval_accuracy': eval_result['eval_accuracy'],\n", " 'train_steps': train_result.global_step,\n", " 'memory_info': memory_info,\n", " 'training_params': training_params\n", " }\n", "\n", "# Example configuration\n", "training_params = {\n", " 'num_labels': 3,\n", " 'train_samples': 1000,\n", " 'eval_samples': 200,\n", " 'epochs': 3,\n", " 'batch_size': 16,\n", " 'eval_batch_size': 32,\n", " 'learning_rate': 2e-5,\n", " 'weight_decay': 0.01,\n", " 'warmup_steps': 100,\n", " 'max_length': 256\n", "}\n", "\n", "# Run fine-tuning\n", "# result = lambda_transformer_finetuning('distilbert-base-uncased', training_params)\n", "# print(f\"Fine-tuning completed! Final accuracy: {result['eval_accuracy']:.4f}\")\n", "# print(f\"Training time: {result['training_time']:.2f} seconds\")\n", "# print(f\"Model parameters: {result['total_parameters']:,}\")\n", "# print(f\"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB\")\n", "\n", "print(\"Transformer fine-tuning function defined. Uncomment the lines above to run on Lambda Cloud.\")" ] }, { "cell_type": "markdown", "id": "computer-vision", "metadata": {}, "source": [ "## Example 3: Computer Vision with Large Datasets" ] }, { "cell_type": "code", "execution_count": null, "id": "cv-training", "metadata": {}, "outputs": [], "source": [ "@cluster(cores=8, memory=\"32GB\", time=\"01:30:00\")\n", "def lambda_computer_vision_training(model_config, data_config):\n", " \"\"\"Train a computer vision model on Lambda Cloud GPU.\"\"\"\n", " import torch\n", " import torch.nn as nn\n", " import torch.optim as optim\n", " import torchvision\n", " import torchvision.transforms as transforms\n", " from torch.utils.data import DataLoader, TensorDataset\n", " import numpy as np\n", " import time\n", " \n", " # Set device\n", " device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", " print(f\"Training computer vision model on device: {device}\")\n", " \n", " # Data augmentation and preprocessing\n", " transform_train = transforms.Compose([\n", " transforms.ToPILImage(),\n", " transforms.RandomResizedCrop(data_config['image_size']),\n", " transforms.RandomHorizontalFlip(p=0.5),\n", " transforms.RandomRotation(degrees=15),\n", " transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),\n", " transforms.ToTensor(),\n", " transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])\n", " ])\n", " \n", " transform_val = transforms.Compose([\n", " transforms.ToPILImage(),\n", " transforms.Resize((data_config['image_size'], data_config['image_size'])),\n", " transforms.ToTensor(),\n", " transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])\n", " ])\n", " \n", " # Generate synthetic image data\n", " def create_synthetic_images(n_samples, image_size, n_channels, n_classes):\n", " \"\"\"Create synthetic image dataset.\"\"\"\n", " images = np.random.randint(0, 256, (n_samples, image_size, image_size, n_channels), dtype=np.uint8)\n", " labels = np.random.randint(0, n_classes, n_samples)\n", " return images, labels\n", " \n", " # Create datasets\n", " train_images, train_labels = create_synthetic_images(\n", " data_config['train_samples'],\n", " data_config['image_size'],\n", " data_config['n_channels'],\n", " data_config['n_classes']\n", " )\n", " \n", " val_images, val_labels = create_synthetic_images(\n", " data_config['val_samples'],\n", " data_config['image_size'],\n", " data_config['n_channels'],\n", " data_config['n_classes']\n", " )\n", " \n", " # Custom dataset class\n", " class SyntheticImageDataset(torch.utils.data.Dataset):\n", " def __init__(self, images, labels, transform=None):\n", " self.images = images\n", " self.labels = labels\n", " self.transform = transform\n", " \n", " def __len__(self):\n", " return len(self.images)\n", " \n", " def __getitem__(self, idx):\n", " image = self.images[idx]\n", " label = self.labels[idx]\n", " \n", " if self.transform:\n", " image = self.transform(image)\n", " else:\n", " image = torch.from_numpy(image).permute(2, 0, 1).float() / 255.0\n", " \n", " return image, label\n", " \n", " # Create data loaders\n", " train_dataset = SyntheticImageDataset(train_images, train_labels, transform_train)\n", " val_dataset = SyntheticImageDataset(val_images, val_labels, transform_val)\n", " \n", " train_loader = DataLoader(\n", " train_dataset,\n", " batch_size=data_config['batch_size'],\n", " shuffle=True,\n", " num_workers=4,\n", " pin_memory=True if torch.cuda.is_available() else False\n", " )\n", " \n", " val_loader = DataLoader(\n", " val_dataset,\n", " batch_size=data_config['batch_size'],\n", " shuffle=False,\n", " num_workers=4,\n", " pin_memory=True if torch.cuda.is_available() else False\n", " )\n", " \n", " # Model definition\n", " if model_config['model_type'] == 'resnet':\n", " if model_config['pretrained']:\n", " model = torchvision.models.resnet50(pretrained=True)\n", " model.fc = nn.Linear(model.fc.in_features, data_config['n_classes'])\n", " else:\n", " model = torchvision.models.resnet50(pretrained=False, num_classes=data_config['n_classes'])\n", " elif model_config['model_type'] == 'efficientnet':\n", " if model_config['pretrained']:\n", " model = torchvision.models.efficientnet_b0(pretrained=True)\n", " model.classifier[1] = nn.Linear(model.classifier[1].in_features, data_config['n_classes'])\n", " else:\n", " model = torchvision.models.efficientnet_b0(pretrained=False, num_classes=data_config['n_classes'])\n", " else:\n", " raise ValueError(f\"Unsupported model type: {model_config['model_type']}\")\n", " \n", " model = model.to(device)\n", " \n", " # Loss and optimizer\n", " criterion = nn.CrossEntropyLoss()\n", " optimizer = optim.AdamW(\n", " model.parameters(),\n", " lr=model_config['learning_rate'],\n", " weight_decay=model_config['weight_decay']\n", " )\n", " \n", " # Learning rate scheduler\n", " scheduler = optim.lr_scheduler.CosineAnnealingLR(\n", " optimizer, T_max=model_config['epochs']\n", " )\n", " \n", " # Training loop\n", " start_time = time.time()\n", " train_losses = []\n", " val_accuracies = []\n", " \n", " for epoch in range(model_config['epochs']):\n", " # Training phase\n", " model.train()\n", " running_loss = 0.0\n", " \n", " for batch_idx, (data, target) in enumerate(train_loader):\n", " data, target = data.to(device), target.to(device)\n", " \n", " optimizer.zero_grad()\n", " output = model(data)\n", " loss = criterion(output, target)\n", " loss.backward()\n", " optimizer.step()\n", " \n", " running_loss += loss.item()\n", " \n", " avg_train_loss = running_loss / len(train_loader)\n", " train_losses.append(avg_train_loss)\n", " \n", " # Validation phase\n", " model.eval()\n", " correct = 0\n", " total = 0\n", " val_loss = 0.0\n", " \n", " with torch.no_grad():\n", " for data, target in val_loader:\n", " data, target = data.to(device), target.to(device)\n", " output = model(data)\n", " val_loss += criterion(output, target).item()\n", " \n", " _, predicted = torch.max(output.data, 1)\n", " total += target.size(0)\n", " correct += (predicted == target).sum().item()\n", " \n", " val_accuracy = 100.0 * correct / total\n", " val_accuracies.append(val_accuracy)\n", " \n", " scheduler.step()\n", " \n", " if epoch % 5 == 0 or epoch == model_config['epochs'] - 1:\n", " print(f'Epoch {epoch+1}/{model_config[\"epochs\"]}: '\n", " f'Train Loss: {avg_train_loss:.4f}, '\n", " f'Val Accuracy: {val_accuracy:.2f}%, '\n", " f'LR: {scheduler.get_last_lr()[0]:.6f}')\n", " \n", " training_time = time.time() - start_time\n", " \n", " # Memory usage\n", " memory_info = {}\n", " if torch.cuda.is_available():\n", " memory_info = {\n", " 'allocated_mb': torch.cuda.memory_allocated() / (1024**2),\n", " 'reserved_mb': torch.cuda.memory_reserved() / (1024**2),\n", " 'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)\n", " }\n", " \n", " return {\n", " 'training_completed': True,\n", " 'device_used': str(device),\n", " 'model_type': model_config['model_type'],\n", " 'model_parameters': sum(p.numel() for p in model.parameters()),\n", " 'training_time': training_time,\n", " 'final_train_loss': train_losses[-1],\n", " 'final_val_accuracy': val_accuracies[-1],\n", " 'best_val_accuracy': max(val_accuracies),\n", " 'train_losses': train_losses,\n", " 'val_accuracies': val_accuracies,\n", " 'memory_info': memory_info,\n", " 'data_config': data_config,\n", " 'model_config': model_config\n", " }\n", "\n", "# Example configuration\n", "model_config = {\n", " 'model_type': 'resnet', # or 'efficientnet'\n", " 'pretrained': True,\n", " 'epochs': 20,\n", " 'learning_rate': 0.001,\n", " 'weight_decay': 1e-4\n", "}\n", "\n", "data_config = {\n", " 'train_samples': 5000,\n", " 'val_samples': 1000,\n", " 'image_size': 224,\n", " 'n_channels': 3,\n", " 'n_classes': 10,\n", " 'batch_size': 32\n", "}\n", "\n", "# Run training\n", "# result = lambda_computer_vision_training(model_config, data_config)\n", "# print(f\"CV training completed! Best accuracy: {result['best_val_accuracy']:.2f}%\")\n", "# print(f\"Training time: {result['training_time']:.2f} seconds\")\n", "# print(f\"Model parameters: {result['model_parameters']:,}\")\n", "# print(f\"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB\")\n", "\n", "print(\"Computer vision training function defined. Uncomment the lines above to run on Lambda Cloud.\")" ] }, { "cell_type": "markdown", "id": "multi-gpu", "metadata": {}, "source": [ "## Multi-GPU Training on Lambda Cloud" ] }, { "cell_type": "code", "id": "multi-gpu-setup", "metadata": {}, "outputs": [], "source": "@cluster(cores=16, memory=\"128GB\", time=\"04:00:00\")\ndef lambda_multi_gpu_training(model_config, training_config):\n \"\"\"Multi-GPU training example using PyTorch DDP.\"\"\"\n import torch\n import torch.nn as nn\n import torch.multiprocessing as mp\n from torch.nn.parallel import DistributedDataParallel as DDP\n from torch.distributed import init_process_group, destroy_process_group\n import os\n \n def setup_ddp(rank, world_size):\n \"\"\"Setup distributed data parallel.\"\"\"\n os.environ['MASTER_ADDR'] = 'localhost'\n os.environ['MASTER_PORT'] = '12355'\n init_process_group(backend=\"nccl\", rank=rank, world_size=world_size)\n torch.cuda.set_device(rank)\n \n def cleanup_ddp():\n \"\"\"Clean up distributed training.\"\"\"\n destroy_process_group()\n \n def train_on_gpu(rank, world_size, model_config, training_config):\n \"\"\"Training function for each GPU.\"\"\"\n setup_ddp(rank, world_size)\n \n # Create model and move to GPU\n model = create_model(model_config).to(rank)\n model = DDP(model, device_ids=[rank])\n \n # Create data loader with DistributedSampler\n train_loader = create_distributed_dataloader(training_config, rank, world_size)\n \n # Training loop\n optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])\n \n for epoch in range(training_config['epochs']):\n train_loader.sampler.set_epoch(epoch) # Important for proper shuffling\n \n for batch_idx, (data, target) in enumerate(train_loader):\n data, target = data.to(rank), target.to(rank)\n \n optimizer.zero_grad()\n output = model(data)\n loss = nn.CrossEntropyLoss()(output, target)\n loss.backward()\n optimizer.step()\n \n if rank == 0 and batch_idx % 100 == 0:\n print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')\n \n cleanup_ddp()\n \n # Launch multi-GPU training\n world_size = torch.cuda.device_count()\n print(f\"Starting multi-GPU training on {world_size} GPUs\")\n \n mp.spawn(\n train_on_gpu,\n args=(world_size, model_config, training_config),\n nprocs=world_size,\n join=True\n )\n \n return {\"training_completed\": True, \"gpus_used\": world_size}", "execution_count": null }, { "cell_type": "markdown", "id": "3b6jiq9gqq2", "source": "### HuggingFace Accelerate Example\n\nAlternative approach using HuggingFace Accelerate for easier multi-GPU setup:", "metadata": {} }, { "cell_type": "code", "id": "ayqhp0nnsnb", "source": "@cluster(cores=16, memory=\"128GB\", time=\"04:00:00\")\ndef lambda_accelerate_training(model_config, training_config):\n \"\"\"Multi-GPU training using HuggingFace Accelerate.\"\"\"\n from accelerate import Accelerator\n import torch\n import torch.nn as nn\n \n # Initialize accelerator\n accelerator = Accelerator()\n device = accelerator.device\n \n # Create model and optimizer\n model = create_model(model_config)\n optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])\n train_loader = create_dataloader(training_config)\n \n # Prepare for distributed training\n model, optimizer, train_loader = accelerator.prepare(\n model, optimizer, train_loader\n )\n \n # Training loop\n model.train()\n for epoch in range(training_config['epochs']):\n for batch_idx, (data, target) in enumerate(train_loader):\n with accelerator.accumulate(model):\n output = model(data)\n loss = nn.CrossEntropyLoss()(output, target)\n \n accelerator.backward(loss)\n optimizer.step()\n optimizer.zero_grad()\n \n if accelerator.is_main_process and batch_idx % 100 == 0:\n print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')\n \n return {\n \"training_completed\": True,\n \"num_processes\": accelerator.num_processes,\n \"device\": str(device)\n }", "metadata": {}, "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "1kzkcvp11ms", "source": "## Multi-GPU Training on Lambda Cloud\n\n### Available Multi-GPU Instances\n\n- **2x A100 (40GB)**: ~$2.20/hour\n- **4x A100 (40GB)**: ~$4.40/hour \n- **8x A100 (40GB)**: ~$8.80/hour\n- **2x A100 (80GB)**: ~$2.80/hour\n- **4x A100 (80GB)**: ~$5.60/hour\n- **8x A100 (80GB)**: ~$11.20/hour\n- **8x H100**: ~$20.00/hour (when available)\n\n### Setup Requirements\n\n1. **Launch multi-GPU instance** via Lambda Cloud console\n2. **Install additional packages** for distributed training:\n ```bash\n pip install accelerate deepspeed\n ```\n3. **Configure Clustrix** for multi-GPU environment\n4. **Use appropriate parallelization strategy**\n\n### Parallelization Strategies\n\n- **Data Parallel (DP)**: Simple, works for most models\n- **Distributed Data Parallel (DDP)**: Better performance, recommended\n- **Model Parallel**: For very large models that don't fit on single GPU\n- **Pipeline Parallel**: For extremely large models\n- **DeepSpeed ZeRO**: For memory-efficient training of large models\n\n### PyTorch DDP Example", "metadata": {} }, { "cell_type": "markdown", "id": "cost-optimization", "metadata": {}, "source": [ "## Cost Optimization Strategies" ] }, { "cell_type": "code", "id": "cost-optimization-lambda", "metadata": {}, "outputs": [], "source": "# Import Clustrix cost monitoring functionality\nfrom clustrix import cost_tracking_decorator, get_cost_monitor, generate_cost_report\n\n# Example 1: Using the cost tracking decorator\n@cost_tracking_decorator('lambda', 'a100_40gb')\n@cluster(cores=8, memory=\"32GB\")\ndef lambda_training_with_cost_tracking():\n \"\"\"Example training function with automatic cost tracking.\"\"\"\n import time\n import numpy as np\n \n # Simulate training workload\n print(\"Starting training...\")\n time.sleep(2) # Simulate 2 seconds of work\n \n # Simulate some compute\n data = np.random.randn(1000, 1000)\n result = np.dot(data, data.T)\n \n print(\"Training completed!\")\n return {\n 'model_accuracy': 0.95,\n 'training_samples': 10000,\n 'final_loss': 0.032\n }\n\n# Example 2: Manual cost monitoring\ndef manual_cost_monitoring_example():\n \"\"\"Example of manual cost monitoring.\"\"\"\n # Start cost monitoring\n monitor = get_cost_monitor('lambda')\n if monitor:\n monitor.start_monitoring()\n \n # Your computation here\n import time\n time.sleep(1)\n \n # Stop monitoring and get report\n cost_report = monitor.stop_monitoring()\n if cost_report:\n print(f\"Computation completed in {cost_report.duration_seconds:.2f} seconds\")\n print(f\"Estimated cost: ${cost_report.cost_estimate.estimated_cost:.4f}\")\n print(f\"GPU utilization: {len(cost_report.resource_usage.gpu_stats or [])} GPUs\")\n \n if cost_report.recommendations:\n print(\"Cost optimization recommendations:\")\n for rec in cost_report.recommendations:\n print(f\" - {rec}\")\n\n# Example 3: Generate real-time cost report\ndef get_current_cost_status():\n \"\"\"Get current cost and resource status.\"\"\"\n report = generate_cost_report('lambda', 'a100_40gb')\n if report:\n print(\"Current Lambda Cloud Status:\")\n print(f\" CPU Usage: {report['resource_usage']['cpu_percent']:.1f}%\")\n print(f\" Memory Usage: {report['resource_usage']['memory_percent']:.1f}%\")\n if report['resource_usage']['gpu_stats']:\n avg_gpu = sum(gpu['utilization_percent'] for gpu in report['resource_usage']['gpu_stats']) / len(report['resource_usage']['gpu_stats'])\n print(f\" GPU Usage: {avg_gpu:.1f}%\")\n print(f\" Hourly Rate: ${report['cost_estimate']['hourly_rate']:.2f}\")\n\n# Example 4: Compare pricing across instance types\ndef compare_lambda_pricing():\n \"\"\"Compare pricing for different Lambda Cloud instance types.\"\"\"\n from clustrix import get_pricing_info\n \n pricing = get_pricing_info('lambda')\n if pricing:\n print(\"Lambda Cloud Instance Pricing (USD/hour):\")\n \n # Group by category\n single_gpu = {k: v for k, v in pricing.items() if not k.startswith(('2x', '4x', '8x')) and k != 'default'}\n multi_gpu = {k: v for k, v in pricing.items() if k.startswith(('2x', '4x', '8x'))}\n \n print(\"\\nSingle GPU Instances:\")\n for instance, price in sorted(single_gpu.items(), key=lambda x: x[1]):\n print(f\" {instance:<15}: ${price:.2f}/hour\")\n \n print(\"\\nMulti-GPU Instances:\")\n for instance, price in sorted(multi_gpu.items(), key=lambda x: x[1]):\n print(f\" {instance:<15}: ${price:.2f}/hour\")\n\n# Run examples (uncomment to test)\n# print(\"1. Cost tracking decorator example:\")\n# result = lambda_training_with_cost_tracking()\n# print(f\"Training result: {result}\")\n\n# print(\"\\n2. Manual cost monitoring example:\")\n# manual_cost_monitoring_example()\n\n# print(\"\\n3. Current cost status:\")\n# get_current_cost_status()\n\nprint(\"4. Lambda Cloud pricing comparison:\")\ncompare_lambda_pricing()\n\nprint(\"\\n\u2705 Lambda Cloud cost monitoring examples ready!\")\nprint(\"\ud83d\udca1 Use @cost_tracking_decorator('lambda', 'instance_type') for automatic cost tracking\")", "execution_count": null }, { "cell_type": "markdown", "id": "iyk1bplps9", "source": "## Lambda Cloud Cost Optimization\n\n### Cost Monitoring and Tracking\n\nMonitor GPU utilization and track costs effectively:", "metadata": {} }, { "cell_type": "markdown", "id": "h19wkr887g", "source": "### Lambda Cloud Cost Optimization\n\n#### \ud83d\udcb0 Instance Selection\n- **RTX 6000 Ada**: Best value for most ML workloads (~$0.75/hour)\n- **A10**: Good balance of performance and cost (~$0.60/hour)\n- **A100 40GB**: For large models requiring more VRAM (~$1.10/hour)\n- **A100 80GB**: Only when 40GB is insufficient (~$1.40/hour)\n- **H100**: Premium option for cutting-edge research (~$2.50/hour)\n\n#### \u23f0 Usage Patterns\n- Use \"persistent\" instances for ongoing development\n- Terminate instances immediately after training completion\n- Schedule training jobs during off-peak hours if possible\n- Use local development for debugging, GPU for final training\n\n#### \ud83d\udd27 Optimization Techniques\n- Mixed precision training (fp16) to reduce memory usage\n- Gradient accumulation for effective larger batch sizes\n- Model checkpointing to resume interrupted training\n- Efficient data loading with multiple workers\n- Early stopping to avoid overtraining\n\n#### \ud83d\udcca Monitoring and Management\n- Monitor GPU utilization with nvidia-smi\n- Track training progress with logging\n- Set training time limits to prevent runaway costs\n- Use Clustrix timeouts as safety nets\n- Regular cost reviews and budget alerts\n\n#### \ud83d\ude80 Clustrix-Specific Optimizations\n- Use Clustrix auto-cleanup features\n- Implement job queuing for multiple experiments\n- Leverage Clustrix's timeout mechanisms\n- Use remote environment caching", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "id": "best-practices", "metadata": {}, "source": [ "## Best Practices and Troubleshooting" ] }, { "cell_type": "code", "id": "best-practices-lambda", "metadata": {}, "outputs": [], "source": "# Example usage of monitoring functions\ndef create_monitoring_script():\n \"\"\"Create and save the GPU monitoring script.\"\"\"\n script_content = '''#!/bin/bash\n# Lambda Cloud monitoring script\n\necho \"Lambda Cloud Training Monitor\"\necho \"============================\"\necho \"Start time: $(date)\"\necho \"\"\n\n# System information\necho \"System Information:\"\necho \"------------------\"\nnvidia-smi --query-gpu=gpu_name,memory.total,power.draw --format=csv\necho \"\"\n\n# Monitor GPU usage every 30 seconds\nwhile true; do\n echo \"GPU Status at $(date):\"\n nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader\n echo \"\"\n \n # Check if training process is still running\n if ! pgrep -f python > /dev/null; then\n echo \"No Python processes found. Training may have completed.\"\n break\n fi\n \n sleep 30\ndone\n\necho \"Monitoring completed at $(date)\"\n'''\n \n with open('monitor_training.sh', 'w') as f:\n f.write(script_content)\n \n # Make executable\n import os\n os.chmod('monitor_training.sh', 0o755)\n \n return \"Monitoring script created: monitor_training.sh\"\n\n# Uncomment to create the monitoring script:\n# result = create_monitoring_script()\n# print(result)", "execution_count": null }, { "cell_type": "markdown", "id": "9e011y2ptia", "source": "## Lambda Cloud Best Practices\n\n### GPU Monitoring Script\n\nUse this monitoring script to track GPU usage during training. Save as `monitor_training.sh` and run with: `bash monitor_training.sh`\n\n```bash\n#!/bin/bash\n# Lambda Cloud monitoring script\n\necho \"Lambda Cloud Training Monitor\"\necho \"============================\"\necho \"Start time: $(date)\"\necho \"\"\n\n# System information\necho \"System Information:\"\necho \"------------------\"\nnvidia-smi --query-gpu=gpu_name,memory.total,power.draw --format=csv\necho \"\"\n\n# Monitor GPU usage every 30 seconds\nwhile true; do\n echo \"GPU Status at $(date):\"\n nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader\n echo \"\"\n \n # Check if training process is still running\n if ! pgrep -f python > /dev/null; then\n echo \"No Python processes found. Training may have completed.\"\n break\n fi\n \n sleep 30\ndone\n\necho \"Monitoring completed at $(date)\"\n```", "metadata": {} }, { "cell_type": "markdown", "id": "2d105yt16xr", "source": "### Lambda Cloud + Clustrix Best Practices\n\n#### \ud83d\ude80 Performance Optimization\n- Always use mixed precision (fp16) when possible\n- Optimize data loading with multiple workers and pin_memory\n- Use appropriate batch sizes to maximize GPU utilization\n- Enable tensor cores for compatible operations\n- Pre-allocate GPU memory to avoid fragmentation\n\n#### \ud83d\udcbe Data Management\n- Store datasets on fast NVMe storage when available\n- Use data streaming for very large datasets\n- Implement efficient data preprocessing pipelines\n- Cache frequently used data in memory\n- Use appropriate data formats (e.g., HDF5, Parquet)\n\n#### \ud83d\udd27 Environment Setup\n- Use conda environments for reproducible setups\n- Pin package versions in requirements.txt\n- Install packages from conda-forge when possible\n- Use uv package manager for faster installs\n- Set up proper CUDA environment variables\n\n#### \ud83d\udee0\ufe0f Development Workflow\n- Develop and debug locally, train on Lambda Cloud\n- Use small datasets for initial testing\n- Implement proper logging and monitoring\n- Save model checkpoints regularly\n- Use version control for experiment tracking\n\n#### \ud83d\udd12 Security\n- Use SSH keys instead of passwords\n- Keep SSH keys secure and rotate regularly\n- Don't store credentials in code or notebooks\n- Use environment variables for configuration\n- Monitor instance access logs", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "id": "0lrqgzg1xis", "source": "### Common Issues and Solutions\n\n#### \u274c CUDA out of memory errors\n\u2705 **Solutions:**\n- Reduce batch size\n- Enable gradient checkpointing\n- Use mixed precision training\n- Clear GPU cache with torch.cuda.empty_cache()\n- Consider model parallelism for large models\n\n#### \u274c Slow data loading\n\u2705 **Solutions:**\n- Increase num_workers in DataLoader\n- Enable pin_memory for GPU transfers\n- Use faster storage (NVMe over network storage)\n- Implement data prefetching\n- Optimize data preprocessing\n\n#### \u274c SSH connection timeouts\n\u2705 **Solutions:**\n- Configure SSH keep-alive settings\n- Use screen or tmux for long-running jobs\n- Implement proper error handling in Clustrix\n- Set appropriate timeout values\n- Monitor network connectivity\n\n#### \u274c Low GPU utilization\n\u2705 **Solutions:**\n- Increase batch size if memory allows\n- Optimize data loading pipeline\n- Use asynchronous data transfers\n- Profile code to identify bottlenecks\n- Consider multi-GPU training\n\n#### \u274c Package installation failures\n\u2705 **Solutions:**\n- Use conda for system-level packages\n- Check CUDA compatibility versions\n- Clear pip cache if needed\n- Use --no-cache-dir flag for pip\n- Install packages in correct order", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "id": "cleanup-lambda", "metadata": {}, "source": [ "## Instance Management and Cleanup" ] }, { "cell_type": "markdown", "id": "cleanup-instances", "metadata": {}, "outputs": [], "source": "### Lambda Cloud Instance Management\n\n#### \ud83d\udd0d Check Running Instances\n\n**Via CLI:**\n```bash\nlambda-cloud instance list\n```\n\n**Via Web Console:**\nVisit: https://cloud.lambdalabs.com/instances\n\n#### \u23f9\ufe0f Terminate Instances\n\n**Terminate specific instance:**\n```bash\nlambda-cloud instance terminate \n```\n\n**Terminate all instances (DANGEROUS!):**\n```bash\nlambda-cloud instance list --format=csv | grep -v \"instance_id\" | cut -d',' -f1 | xargs -I {} lambda-cloud instance terminate {}\n```\n\n#### \ud83d\udcbe Save Work Before Termination\n\n**Save models to persistent storage:**\n```bash\nrsync -avz ubuntu@:/path/to/models/ ./local_models/\n```\n\n**Save logs and results:**\n```bash\nscp -r ubuntu@:/tmp/clustrix/ ./results/\n```\n\n#### \ud83d\udcca Cost Monitoring\n\n**Check current usage:**\n```bash\nlambda-cloud instance list --format=table\n```\n\n**Estimate costs:**\n```bash\nlambda-cloud instance list --format=csv | awk -F',' 'NR>1 {print $2, $3}' | while read type status; do\n if [ \"$status\" = \"active\" ]; then\n echo \"Active instance: $type\"\n fi\ndone\n```\n\n### Automated Cleanup Script\n\nSave this as `lambda_cleanup.sh` for automated instance management:\n\n```bash\n#!/bin/bash\n# Automated cleanup script for Lambda Cloud\n# Save as lambda_cleanup.sh\n\nset -e\n\necho \"Lambda Cloud Automated Cleanup\"\necho \"==============================\"\n\n# Check if lambda-cloud CLI is installed\nif ! command -v lambda-cloud &> /dev/null; then\n echo \"Error: lambda-cloud CLI not found. Please install it first.\"\n exit 1\nfi\n\n# List current instances\necho \"Current instances:\"\nlambda-cloud instance list\necho \"\"\n\n# Ask for confirmation\nread -p \"Do you want to terminate ALL instances? (y/N): \" -n 1 -r\necho \"\"\nif [[ ! $REPLY =~ ^[Yy]$ ]]; then\n echo \"Cleanup cancelled.\"\n exit 0\nfi\n\n# Get instance IDs\nINSTANCE_IDS=$(lambda-cloud instance list --format=csv | grep -v \"instance_id\" | cut -d',' -f1)\n\nif [ -z \"$INSTANCE_IDS\" ]; then\n echo \"No instances to terminate.\"\n exit 0\nfi\n\n# Terminate instances\necho \"Terminating instances...\"\nfor instance_id in $INSTANCE_IDS; do\n echo \"Terminating instance: $instance_id\"\n lambda-cloud instance terminate $instance_id\ndone\n\necho \"All instances terminated.\"\necho \"Please verify termination in the web console: https://cloud.lambdalabs.com/instances\"\n```\n\n### Clustrix Integration Manager" }, { "cell_type": "code", "id": "5f4n1hu8qdb", "source": "# Integrate cleanup with Clustrix workflows\n\nfrom clustrix import configure\nimport subprocess\nimport time\n\nclass LambdaCloudManager:\n \"\"\"Manager for Lambda Cloud instances with Clustrix integration.\"\"\"\n \n def __init__(self):\n self.active_instances = []\n \n def launch_instance_for_clustrix(self, instance_type, ssh_key_name):\n \"\"\"Launch instance and configure Clustrix.\"\"\"\n # Launch instance\n result = subprocess.run([\n 'lambda-cloud', 'instance', 'launch',\n '--instance-type', instance_type,\n '--ssh-key-name', ssh_key_name\n ], capture_output=True, text=True)\n \n if result.returncode != 0:\n raise Exception(f\"Failed to launch instance: {result.stderr}\")\n \n # Parse instance ID and IP (simplified)\n instance_id = \"extracted_from_output\" # Parse from result.stdout\n instance_ip = \"extracted_from_output\" # Parse from result.stdout\n \n # Wait for instance to be ready\n time.sleep(60) # Wait for startup\n \n # Configure Clustrix\n configure(\n cluster_type=\"ssh\",\n cluster_host=instance_ip,\n username=\"ubuntu\",\n key_file=\"~/.ssh/id_rsa\",\n remote_work_dir=\"/tmp/clustrix\",\n package_manager=\"auto\",\n cleanup_on_success=True\n )\n \n self.active_instances.append({\n 'id': instance_id,\n 'ip': instance_ip,\n 'type': instance_type,\n 'launch_time': time.time()\n })\n \n return instance_id, instance_ip\n \n def cleanup_all_instances(self):\n \"\"\"Clean up all managed instances.\"\"\"\n for instance in self.active_instances:\n try:\n subprocess.run([\n 'lambda-cloud', 'instance', 'terminate', instance['id']\n ], check=True)\n print(f\"Terminated instance {instance['id']}\")\n except subprocess.CalledProcessError as e:\n print(f\"Failed to terminate {instance['id']}: {e}\")\n \n self.active_instances.clear()\n \n def __del__(self):\n \"\"\"Ensure cleanup on object destruction.\"\"\"\n if self.active_instances:\n print(\"Warning: Active instances detected. Cleaning up...\")\n self.cleanup_all_instances()\n\n# Usage example:\n# manager = LambdaCloudManager()\n# try:\n# instance_id, ip = manager.launch_instance_for_clustrix('a100', 'my-ssh-key')\n# # Run your Clustrix computations\n# result = my_clustrix_function()\n# finally:\n# manager.cleanup_all_instances()", "metadata": {}, "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "lambda-summary", "metadata": {}, "source": [ "## Summary\n", "\n", "This tutorial covered:\n", "\n", "1. **Setup**: Lambda Cloud account creation and instance management\n", "2. **GPU Computing**: High-performance GPU instances for ML workloads\n", "3. **Deep Learning**: PyTorch training with GPU acceleration\n", "4. **Transformer Models**: Fine-tuning with HuggingFace Transformers\n", "5. **Computer Vision**: CNN training with data augmentation\n", "6. **Multi-GPU Training**: Distributed training across multiple GPUs\n", "7. **Cost Optimization**: Strategies to minimize GPU computing costs\n", "8. **Best Practices**: Performance optimization and troubleshooting\n", "9. **Instance Management**: Automated cleanup and monitoring\n", "\n", "### Key Advantages of Lambda Cloud + Clustrix\n", "\n", "- **GPU Focus**: Specialized in high-performance GPU computing\n", "- **Cost Effective**: Competitive pricing for GPU instances\n", "- **Simple Management**: Easy instance launching and termination\n", "- **High Performance**: Latest NVIDIA GPUs (A100, H100, RTX)\n", "- **Fast Networking**: InfiniBand for multi-GPU communication\n", "- **ML Optimized**: Pre-configured environments for machine learning\n", "- **Flexible Scaling**: From single GPU to large multi-GPU clusters\n", "\n", "### Lambda Cloud Pricing Advantages\n", "\n", "- **RTX 6000 Ada**: Excellent price/performance for most ML workloads\n", "- **A100 40GB/80GB**: Industry-standard for large-scale training\n", "- **H100**: Cutting-edge performance for the most demanding workloads\n", "- **Multi-GPU**: Cost-effective scaling for distributed training\n", "- **No Hidden Fees**: Simple per-hour pricing\n", "\n", "### Next Steps\n", "\n", "1. Create your Lambda Cloud account and add SSH keys\n", "2. Start with a single GPU instance for testing\n", "3. Configure Clustrix for your Lambda Cloud instance\n", "4. Run the provided examples to verify setup\n", "5. Scale to multi-GPU instances for larger workloads\n", "6. Implement cost monitoring and automated cleanup\n", "\n", "### Use Cases\n", "\n", "- **Deep Learning Research**: Train large neural networks efficiently\n", "- **Computer Vision**: Process large image datasets with CNNs\n", "- **NLP**: Fine-tune transformer models on custom datasets\n", "- **Scientific Computing**: GPU-accelerated simulations and modeling\n", "- **Prototyping**: Rapid experimentation with different architectures\n", "- **Production Training**: Scale up successful experiments\n", "\n", "### Resources\n", "\n", "- [Lambda Cloud Console](https://cloud.lambdalabs.com/)\n", "- [Lambda Cloud Documentation](https://lambdalabs.com/service/gpu-cloud/documentation)\n", "- [Lambda Cloud CLI](https://github.com/LambdaLabsML/lambda-cloud-cli)\n", "- [PyTorch Documentation](https://pytorch.org/docs/)\n", "- [HuggingFace Transformers](https://huggingface.co/transformers/)\n", "- [Clustrix Documentation](https://clustrix.readthedocs.io/)\n", "\n", "**Remember**: Lambda Cloud excels at GPU computing! Always terminate instances when not in use to control costs, and leverage Clustrix's distributed computing capabilities to scale your ML workloads efficiently." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 5 }