{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "lambda-title",
   "metadata": {},
   "source": "# Lambda Cloud Tutorial\n\nThis tutorial demonstrates how to use Clustrix with Lambda Cloud for high-performance GPU computing and distributed machine learning.\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/clustrix/blob/master/docs/source/notebooks/lambda_cloud_tutorial.ipynb)\n\n## Overview\n\nLambda Cloud specializes in GPU cloud computing and integrates well with Clustrix for ML workloads:\n\n- **GPU-Optimized Instances**: High-performance NVIDIA GPUs (A100, H100, RTX)\n- **Cost-Effective**: Competitive pricing for GPU computing\n- **Simple Management**: Easy instance launching and management\n- **Pre-configured Environments**: ML-ready software stacks\n- **High-Speed Networking**: InfiniBand for multi-GPU communications\n- **Persistent Storage**: Fast NVMe and network storage options\n- **SSH Access**: Direct access for Clustrix integration\n- **On-Demand and Reserved**: Flexible pricing models\n\n## Prerequisites\n\n1. Lambda Cloud account with GPU credits\n2. SSH key pair for instance access\n3. Lambda Cloud API key (optional)\n4. Basic understanding of GPU computing",
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "installation",
   "metadata": {},
   "source": [
    "## Installation and Setup\n",
    "\n",
    "Install Clustrix with Lambda Cloud dependencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "install",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install Clustrix with GPU and Lambda Cloud support\n",
    "!pip install clustrix torch torchvision transformers datasets accelerate\n",
    "\n",
    "# Import required libraries\n",
    "import clustrix\n",
    "from clustrix import cluster, configure\n",
    "import torch\n",
    "import numpy as np\n",
    "import time\n",
    "import json\n",
    "import requests\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lambda-setup",
   "metadata": {},
   "source": [
    "## Lambda Cloud Authentication and Setup\n",
    "\n",
    "### Option 1: Web Console Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "web-setup",
   "metadata": {},
   "outputs": [],
   "source": "### Lambda Cloud Web Console Setup\n\n1. **Create Account:**\n   - Visit https://lambdalabs.com/service/gpu-cloud\n   - Sign up and verify your account\n   - Add billing information and credits\n\n2. **Add SSH Key:**\n   - Go to https://cloud.lambdalabs.com/ssh-keys\n   - Click \"Add SSH Key\"\n   - Paste your public key (cat ~/.ssh/id_rsa.pub)\n   - Give it a descriptive name\n\n3. **Launch Instance:**\n   - Go to https://cloud.lambdalabs.com/instances\n   - Click \"Launch instance\"\n   - Select instance type (A100, H100, RTX 6000 Ada, etc.)\n   - Choose region (closest to you for best performance)\n   - Select your SSH key\n   - Launch the instance\n\n4. **Instance Types Available:**\n   - RTX 6000 Ada: 48GB VRAM, ~$0.75/hour\n   - A10: 24GB VRAM, ~$0.60/hour  \n   - A100 (40GB): 40GB VRAM, ~$1.10/hour\n   - A100 (80GB): 80GB VRAM, ~$1.40/hour\n   - H100: 80GB VRAM, ~$2.50/hour (when available)\n\n5. **Access Instance:**\n   - Wait for instance to be \"Running\"\n   - Note the public IP address\n   - SSH: ssh ubuntu@<PUBLIC_IP>"
  },
  {
   "cell_type": "markdown",
   "id": "a5k1lpava3n",
   "source": "**Follow this guide to set up your Lambda Cloud account and launch your first GPU instance.**",
   "metadata": {},
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "lambda-api",
   "metadata": {},
   "source": [
    "### Option 2: API-Based Setup"
   ]
  },
  {
   "cell_type": "code",
   "id": "api-setup",
   "metadata": {},
   "outputs": [],
   "source": "import requests\nimport os\n\nclass LambdaCloudAPI:\n    def __init__(self, api_key=None):\n        self.api_key = api_key or os.getenv('LAMBDA_API_KEY')\n        self.base_url = 'https://cloud.lambdalabs.com/api/v1'\n        self.headers = {\n            'Authorization': f'Bearer {self.api_key}',\n            'Content-Type': 'application/json'\n        }\n    \n    def list_instance_types(self):\n        \"\"\"List available instance types.\"\"\"\n        response = requests.get(f'{self.base_url}/instance-types', headers=self.headers)\n        return response.json()\n    \n    def list_instances(self):\n        \"\"\"List running instances.\"\"\"\n        response = requests.get(f'{self.base_url}/instances', headers=self.headers)\n        return response.json()\n    \n    def launch_instance(self, instance_type, ssh_key_name, region='us-east-1', name=None):\n        \"\"\"Launch a new instance.\"\"\"\n        data = {\n            'instance_type_name': instance_type,\n            'ssh_key_names': [ssh_key_name],\n            'region_name': region\n        }\n        if name:\n            data['name'] = name\n        \n        response = requests.post(f'{self.base_url}/instance-operations/launch', \n                               headers=self.headers, json=data)\n        return response.json()\n    \n    def terminate_instance(self, instance_id):\n        \"\"\"Terminate an instance.\"\"\"\n        data = {'instance_ids': [instance_id]}\n        response = requests.post(f'{self.base_url}/instance-operations/terminate',\n                               headers=self.headers, json=data)\n        return response.json()\n    \n    def get_instance_details(self, instance_id):\n        \"\"\"Get detailed information about an instance.\"\"\"\n        instances = self.list_instances()\n        for instance in instances.get('data', []):\n            if instance['id'] == instance_id:\n                return instance\n        return None\n\n# Example usage:\n# api = LambdaCloudAPI()\n# instance_types = api.list_instance_types()\n# print(json.dumps(instance_types, indent=2))",
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "id": "1fgfjnypmvp",
   "source": "### Lambda Cloud API Setup Guide\n\n#### CLI Setup Steps\n\n1. **Get API Key:**\n   - Go to https://cloud.lambdalabs.com/api-keys\n   - Generate a new API key\n   - Set as environment variable: `export LAMBDA_API_KEY=\"your-key\"`\n\n2. **Install Lambda Cloud CLI:**\n   ```bash\n   pip install lambda-cloud\n   lambda-cloud configure  # Enter your API key\n   ```\n\n3. **Basic CLI Commands:**\n   ```bash\n   # List available instance types\n   lambda-cloud instance-types list\n   \n   # List available regions\n   lambda-cloud regions list\n   \n   # Launch instance\n   lambda-cloud instance launch \\\n     --instance-type a100 \\\n     --ssh-key-name your-key-name \\\n     --region us-east-1\n   \n   # List running instances\n   lambda-cloud instance list\n   \n   # Terminate instance\n   lambda-cloud instance terminate <INSTANCE_ID>\n   ```\n\n#### Python API Client",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "id": "clustrix-lambda-config",
   "metadata": {},
   "source": [
    "## Configure Clustrix for Lambda Cloud"
   ]
  },
  {
   "cell_type": "code",
   "id": "config-lambda",
   "metadata": {},
   "outputs": [],
   "source": "# Configure Clustrix to use your Lambda Cloud instance\nconfigure(\n    cluster_type=\"ssh\",\n    cluster_host=\"your-lambda-instance-ip\",  # Replace with actual IP\n    username=\"ubuntu\",  # Default Lambda Cloud user\n    key_file=\"~/.ssh/id_rsa\",  # Your private SSH key\n    remote_work_dir=\"/tmp/clustrix\",\n    package_manager=\"auto\",  # Will use uv if available\n    default_cores=8,  # Lambda instances typically have 8+ cores\n    default_memory=\"32GB\",  # Generous memory allocation\n    default_time=\"02:00:00\",  # Longer timeout for GPU tasks\n    environment_variables={\n        \"CUDA_VISIBLE_DEVICES\": \"0\",  # Use first GPU\n        \"NVIDIA_VISIBLE_DEVICES\": \"all\"\n    }\n)",
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "id": "mm5s72ijgws",
   "source": "**Replace `your-lambda-instance-ip` with the actual IP address from your Lambda Cloud instance.**",
   "metadata": {},
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "gpu-verification",
   "metadata": {},
   "source": [
    "### GPU Verification and Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "verify-gpu",
   "metadata": {},
   "outputs": [],
   "source": [
    "@cluster(cores=2, memory=\"8GB\")\n",
    "def verify_lambda_gpu_setup():\n",
    "    \"\"\"Verify GPU availability and setup on Lambda Cloud instance.\"\"\"\n",
    "    import torch\n",
    "    import subprocess\n",
    "    import platform\n",
    "    \n",
    "    # System information\n",
    "    system_info = {\n",
    "        'platform': platform.platform(),\n",
    "        'python_version': platform.python_version(),\n",
    "        'architecture': platform.architecture()[0]\n",
    "    }\n",
    "    \n",
    "    # PyTorch and CUDA info\n",
    "    torch_info = {\n",
    "        'pytorch_version': torch.__version__,\n",
    "        'cuda_available': torch.cuda.is_available(),\n",
    "        'cuda_version': torch.version.cuda if torch.cuda.is_available() else None,\n",
    "        'cudnn_version': torch.backends.cudnn.version() if torch.cuda.is_available() else None,\n",
    "        'device_count': torch.cuda.device_count() if torch.cuda.is_available() else 0\n",
    "    }\n",
    "    \n",
    "    # GPU details\n",
    "    gpu_info = []\n",
    "    if torch.cuda.is_available():\n",
    "        for i in range(torch.cuda.device_count()):\n",
    "            props = torch.cuda.get_device_properties(i)\n",
    "            gpu_info.append({\n",
    "                'device_id': i,\n",
    "                'name': props.name,\n",
    "                'total_memory_gb': props.total_memory / (1024**3),\n",
    "                'major': props.major,\n",
    "                'minor': props.minor,\n",
    "                'multiprocessor_count': props.multi_processor_count\n",
    "            })\n",
    "    \n",
    "    # NVIDIA-SMI output\n",
    "    nvidia_smi = None\n",
    "    try:\n",
    "        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)\n",
    "        if result.returncode == 0:\n",
    "            nvidia_smi = result.stdout\n",
    "    except FileNotFoundError:\n",
    "        nvidia_smi = \"nvidia-smi not found\"\n",
    "    \n",
    "    # Test GPU computation\n",
    "    gpu_test_result = None\n",
    "    if torch.cuda.is_available():\n",
    "        try:\n",
    "            # Simple GPU computation test\n",
    "            device = torch.device('cuda:0')\n",
    "            x = torch.randn(1000, 1000, device=device)\n",
    "            y = torch.randn(1000, 1000, device=device)\n",
    "            \n",
    "            start_time = torch.cuda.Event(enable_timing=True)\n",
    "            end_time = torch.cuda.Event(enable_timing=True)\n",
    "            \n",
    "            start_time.record()\n",
    "            z = torch.mm(x, y)\n",
    "            torch.cuda.synchronize()\n",
    "            end_time.record()\n",
    "            torch.cuda.synchronize()\n",
    "            \n",
    "            gpu_test_result = {\n",
    "                'test_passed': True,\n",
    "                'computation_time_ms': start_time.elapsed_time(end_time),\n",
    "                'result_shape': z.shape,\n",
    "                'memory_allocated_mb': torch.cuda.memory_allocated() / (1024**2),\n",
    "                'memory_reserved_mb': torch.cuda.memory_reserved() / (1024**2)\n",
    "            }\n",
    "        except Exception as e:\n",
    "            gpu_test_result = {\n",
    "                'test_passed': False,\n",
    "                'error': str(e)\n",
    "            }\n",
    "    \n",
    "    return {\n",
    "        'system_info': system_info,\n",
    "        'torch_info': torch_info,\n",
    "        'gpu_info': gpu_info,\n",
    "        'nvidia_smi': nvidia_smi,\n",
    "        'gpu_test': gpu_test_result\n",
    "    }\n",
    "\n",
    "# Run GPU verification\n",
    "# gpu_status = verify_lambda_gpu_setup()\n",
    "# print(json.dumps(gpu_status, indent=2, default=str))\n",
    "print(\"GPU verification function defined. Uncomment the lines above to run on Lambda Cloud.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ml-training-example",
   "metadata": {},
   "source": [
    "## Example 1: Distributed Deep Learning Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dl-training",
   "metadata": {},
   "outputs": [],
   "source": [
    "@cluster(cores=8, memory=\"16GB\", time=\"01:30:00\")\n",
    "def lambda_deep_learning_training(model_config, training_config):\n",
    "    \"\"\"Train a deep learning model on Lambda Cloud GPU.\"\"\"\n",
    "    import torch\n",
    "    import torch.nn as nn\n",
    "    import torch.optim as optim\n",
    "    from torch.utils.data import DataLoader, TensorDataset\n",
    "    import numpy as np\n",
    "    import time\n",
    "    \n",
    "    # Set device\n",
    "    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "    print(f\"Training on device: {device}\")\n",
    "    \n",
    "    # Create synthetic dataset\n",
    "    n_samples = training_config['n_samples']\n",
    "    n_features = training_config['n_features']\n",
    "    n_classes = training_config['n_classes']\n",
    "    \n",
    "    # Generate random data\n",
    "    X = torch.randn(n_samples, n_features)\n",
    "    y = torch.randint(0, n_classes, (n_samples,))\n",
    "    \n",
    "    # Create dataset and dataloader\n",
    "    dataset = TensorDataset(X, y)\n",
    "    dataloader = DataLoader(\n",
    "        dataset, \n",
    "        batch_size=training_config['batch_size'], \n",
    "        shuffle=True\n",
    "    )\n",
    "    \n",
    "    # Define model architecture\n",
    "    class DeepNet(nn.Module):\n",
    "        def __init__(self, input_size, hidden_sizes, output_size, dropout=0.2):\n",
    "            super(DeepNet, self).__init__()\n",
    "            \n",
    "            layers = []\n",
    "            prev_size = input_size\n",
    "            \n",
    "            for hidden_size in hidden_sizes:\n",
    "                layers.extend([\n",
    "                    nn.Linear(prev_size, hidden_size),\n",
    "                    nn.ReLU(),\n",
    "                    nn.BatchNorm1d(hidden_size),\n",
    "                    nn.Dropout(dropout)\n",
    "                ])\n",
    "                prev_size = hidden_size\n",
    "            \n",
    "            layers.append(nn.Linear(prev_size, output_size))\n",
    "            self.network = nn.Sequential(*layers)\n",
    "        \n",
    "        def forward(self, x):\n",
    "            return self.network(x)\n",
    "    \n",
    "    # Create model\n",
    "    model = DeepNet(\n",
    "        input_size=n_features,\n",
    "        hidden_sizes=model_config['hidden_sizes'],\n",
    "        output_size=n_classes,\n",
    "        dropout=model_config.get('dropout', 0.2)\n",
    "    ).to(device)\n",
    "    \n",
    "    # Loss and optimizer\n",
    "    criterion = nn.CrossEntropyLoss()\n",
    "    optimizer = optim.Adam(\n",
    "        model.parameters(), \n",
    "        lr=training_config['learning_rate'],\n",
    "        weight_decay=training_config.get('weight_decay', 1e-4)\n",
    "    )\n",
    "    \n",
    "    # Training loop\n",
    "    model.train()\n",
    "    training_start = time.time()\n",
    "    \n",
    "    epoch_losses = []\n",
    "    epoch_accuracies = []\n",
    "    \n",
    "    for epoch in range(training_config['epochs']):\n",
    "        epoch_loss = 0.0\n",
    "        correct = 0\n",
    "        total = 0\n",
    "        \n",
    "        for batch_idx, (data, target) in enumerate(dataloader):\n",
    "            data, target = data.to(device), target.to(device)\n",
    "            \n",
    "            optimizer.zero_grad()\n",
    "            output = model(data)\n",
    "            loss = criterion(output, target)\n",
    "            loss.backward()\n",
    "            optimizer.step()\n",
    "            \n",
    "            epoch_loss += loss.item()\n",
    "            _, predicted = torch.max(output.data, 1)\n",
    "            total += target.size(0)\n",
    "            correct += (predicted == target).sum().item()\n",
    "        \n",
    "        avg_loss = epoch_loss / len(dataloader)\n",
    "        accuracy = 100.0 * correct / total\n",
    "        \n",
    "        epoch_losses.append(avg_loss)\n",
    "        epoch_accuracies.append(accuracy)\n",
    "        \n",
    "        if epoch % 10 == 0 or epoch == training_config['epochs'] - 1:\n",
    "            print(f'Epoch {epoch+1}/{training_config[\"epochs\"]}: '\n",
    "                  f'Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')\n",
    "    \n",
    "    training_time = time.time() - training_start\n",
    "    \n",
    "    # Model evaluation\n",
    "    model.eval()\n",
    "    with torch.no_grad():\n",
    "        test_data = torch.randn(1000, n_features).to(device)\n",
    "        test_output = model(test_data)\n",
    "        test_predictions = torch.max(test_output, 1)[1]\n",
    "    \n",
    "    # Memory usage\n",
    "    memory_info = {}\n",
    "    if torch.cuda.is_available():\n",
    "        memory_info = {\n",
    "            'allocated_mb': torch.cuda.memory_allocated() / (1024**2),\n",
    "            'reserved_mb': torch.cuda.memory_reserved() / (1024**2),\n",
    "            'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)\n",
    "        }\n",
    "    \n",
    "    return {\n",
    "        'training_completed': True,\n",
    "        'device_used': str(device),\n",
    "        'model_parameters': sum(p.numel() for p in model.parameters()),\n",
    "        'trainable_parameters': sum(p.numel() for p in model.parameters() if p.requires_grad),\n",
    "        'training_time': training_time,\n",
    "        'final_loss': epoch_losses[-1],\n",
    "        'final_accuracy': epoch_accuracies[-1],\n",
    "        'best_accuracy': max(epoch_accuracies),\n",
    "        'epoch_losses': epoch_losses,\n",
    "        'epoch_accuracies': epoch_accuracies,\n",
    "        'memory_info': memory_info,\n",
    "        'model_architecture': str(model)\n",
    "    }\n",
    "\n",
    "# Example configuration\n",
    "model_config = {\n",
    "    'hidden_sizes': [512, 256, 128, 64],\n",
    "    'dropout': 0.3\n",
    "}\n",
    "\n",
    "training_config = {\n",
    "    'n_samples': 10000,\n",
    "    'n_features': 100,\n",
    "    'n_classes': 10,\n",
    "    'batch_size': 64,\n",
    "    'epochs': 50,\n",
    "    'learning_rate': 0.001,\n",
    "    'weight_decay': 1e-4\n",
    "}\n",
    "\n",
    "# Run training\n",
    "# result = lambda_deep_learning_training(model_config, training_config)\n",
    "# print(f\"Training completed! Final accuracy: {result['final_accuracy']:.2f}%\")\n",
    "# print(f\"Training time: {result['training_time']:.2f} seconds\")\n",
    "# print(f\"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB\")\n",
    "\n",
    "print(\"Deep learning training function defined. Uncomment the lines above to run on Lambda Cloud.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "transformer-example",
   "metadata": {},
   "source": [
    "## Example 2: Transformer Model Fine-tuning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "transformer-finetuning",
   "metadata": {},
   "outputs": [],
   "source": [
    "@cluster(cores=8, memory=\"32GB\", time=\"02:00:00\")\n",
    "def lambda_transformer_finetuning(model_name, training_params):\n",
    "    \"\"\"Fine-tune a transformer model on Lambda Cloud GPU.\"\"\"\n",
    "    import torch\n",
    "    from transformers import (\n",
    "        AutoTokenizer, AutoModelForSequenceClassification,\n",
    "        TrainingArguments, Trainer, DataCollatorWithPadding\n",
    "    )\n",
    "    from datasets import Dataset\n",
    "    import numpy as np\n",
    "    import time\n",
    "    \n",
    "    # Set device\n",
    "    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "    print(f\"Fine-tuning on device: {device}\")\n",
    "    \n",
    "    # Load tokenizer and model\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "    model = AutoModelForSequenceClassification.from_pretrained(\n",
    "        model_name,\n",
    "        num_labels=training_params['num_labels']\n",
    "    )\n",
    "    \n",
    "    if tokenizer.pad_token is None:\n",
    "        tokenizer.pad_token = tokenizer.eos_token\n",
    "    \n",
    "    # Create synthetic dataset\n",
    "    def generate_synthetic_text_data(n_samples, num_labels):\n",
    "        \"\"\"Generate synthetic text classification data.\"\"\"\n",
    "        \n",
    "        # Simple text templates for different classes\n",
    "        templates = {\n",
    "            0: [\"This is a positive example about {}\", \"Great work on {}\", \"Excellent {}\"],\n",
    "            1: [\"This is a negative example about {}\", \"Poor {}\", \"Terrible {}\"],\n",
    "            2: [\"This is a neutral example about {}\", \"Average {}\", \"Okay {}\"] if num_labels > 2 else []\n",
    "        }\n",
    "        \n",
    "        topics = [\"technology\", \"sports\", \"food\", \"movies\", \"music\", \"books\", \"travel\", \"science\"]\n",
    "        \n",
    "        texts = []\n",
    "        labels = []\n",
    "        \n",
    "        for _ in range(n_samples):\n",
    "            label = np.random.randint(0, num_labels)\n",
    "            template = np.random.choice(templates[label])\n",
    "            topic = np.random.choice(topics)\n",
    "            text = template.format(topic)\n",
    "            \n",
    "            texts.append(text)\n",
    "            labels.append(label)\n",
    "        \n",
    "        return texts, labels\n",
    "    \n",
    "    # Generate data\n",
    "    train_texts, train_labels = generate_synthetic_text_data(\n",
    "        training_params['train_samples'], training_params['num_labels']\n",
    "    )\n",
    "    eval_texts, eval_labels = generate_synthetic_text_data(\n",
    "        training_params['eval_samples'], training_params['num_labels']\n",
    "    )\n",
    "    \n",
    "    # Tokenize data\n",
    "    def tokenize_function(examples):\n",
    "        return tokenizer(\n",
    "            examples['text'],\n",
    "            truncation=True,\n",
    "            padding=True,\n",
    "            max_length=training_params.get('max_length', 512)\n",
    "        )\n",
    "    \n",
    "    # Create datasets\n",
    "    train_dataset = Dataset.from_dict({'text': train_texts, 'labels': train_labels})\n",
    "    eval_dataset = Dataset.from_dict({'text': eval_texts, 'labels': eval_labels})\n",
    "    \n",
    "    train_dataset = train_dataset.map(tokenize_function, batched=True)\n",
    "    eval_dataset = eval_dataset.map(tokenize_function, batched=True)\n",
    "    \n",
    "    # Data collator\n",
    "    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n",
    "    \n",
    "    # Training arguments\n",
    "    training_args = TrainingArguments(\n",
    "        output_dir='/tmp/results',\n",
    "        num_train_epochs=training_params.get('epochs', 3),\n",
    "        per_device_train_batch_size=training_params.get('batch_size', 8),\n",
    "        per_device_eval_batch_size=training_params.get('eval_batch_size', 8),\n",
    "        warmup_steps=training_params.get('warmup_steps', 100),\n",
    "        weight_decay=training_params.get('weight_decay', 0.01),\n",
    "        learning_rate=training_params.get('learning_rate', 2e-5),\n",
    "        logging_dir='/tmp/logs',\n",
    "        logging_steps=10,\n",
    "        evaluation_strategy=\"epoch\",\n",
    "        save_strategy=\"epoch\",\n",
    "        load_best_model_at_end=True,\n",
    "        metric_for_best_model=\"eval_loss\",\n",
    "        greater_is_better=False,\n",
    "        fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available\n",
    "        dataloader_pin_memory=torch.cuda.is_available(),\n",
    "        remove_unused_columns=False\n",
    "    )\n",
    "    \n",
    "    # Define compute metrics\n",
    "    def compute_metrics(eval_pred):\n",
    "        predictions, labels = eval_pred\n",
    "        predictions = np.argmax(predictions, axis=1)\n",
    "        accuracy = (predictions == labels).mean()\n",
    "        return {'accuracy': accuracy}\n",
    "    \n",
    "    # Create trainer\n",
    "    trainer = Trainer(\n",
    "        model=model,\n",
    "        args=training_args,\n",
    "        train_dataset=train_dataset,\n",
    "        eval_dataset=eval_dataset,\n",
    "        tokenizer=tokenizer,\n",
    "        data_collator=data_collator,\n",
    "        compute_metrics=compute_metrics\n",
    "    )\n",
    "    \n",
    "    # Training\n",
    "    start_time = time.time()\n",
    "    train_result = trainer.train()\n",
    "    training_time = time.time() - start_time\n",
    "    \n",
    "    # Final evaluation\n",
    "    eval_result = trainer.evaluate()\n",
    "    \n",
    "    # Memory usage\n",
    "    memory_info = {}\n",
    "    if torch.cuda.is_available():\n",
    "        memory_info = {\n",
    "            'allocated_mb': torch.cuda.memory_allocated() / (1024**2),\n",
    "            'reserved_mb': torch.cuda.memory_reserved() / (1024**2),\n",
    "            'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)\n",
    "        }\n",
    "    \n",
    "    # Model info\n",
    "    total_params = sum(p.numel() for p in model.parameters())\n",
    "    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
    "    \n",
    "    return {\n",
    "        'model_name': model_name,\n",
    "        'device_used': str(device),\n",
    "        'training_completed': True,\n",
    "        'training_time': training_time,\n",
    "        'total_parameters': total_params,\n",
    "        'trainable_parameters': trainable_params,\n",
    "        'train_loss': train_result.training_loss,\n",
    "        'eval_loss': eval_result['eval_loss'],\n",
    "        'eval_accuracy': eval_result['eval_accuracy'],\n",
    "        'train_steps': train_result.global_step,\n",
    "        'memory_info': memory_info,\n",
    "        'training_params': training_params\n",
    "    }\n",
    "\n",
    "# Example configuration\n",
    "training_params = {\n",
    "    'num_labels': 3,\n",
    "    'train_samples': 1000,\n",
    "    'eval_samples': 200,\n",
    "    'epochs': 3,\n",
    "    'batch_size': 16,\n",
    "    'eval_batch_size': 32,\n",
    "    'learning_rate': 2e-5,\n",
    "    'weight_decay': 0.01,\n",
    "    'warmup_steps': 100,\n",
    "    'max_length': 256\n",
    "}\n",
    "\n",
    "# Run fine-tuning\n",
    "# result = lambda_transformer_finetuning('distilbert-base-uncased', training_params)\n",
    "# print(f\"Fine-tuning completed! Final accuracy: {result['eval_accuracy']:.4f}\")\n",
    "# print(f\"Training time: {result['training_time']:.2f} seconds\")\n",
    "# print(f\"Model parameters: {result['total_parameters']:,}\")\n",
    "# print(f\"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB\")\n",
    "\n",
    "print(\"Transformer fine-tuning function defined. Uncomment the lines above to run on Lambda Cloud.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "computer-vision",
   "metadata": {},
   "source": [
    "## Example 3: Computer Vision with Large Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cv-training",
   "metadata": {},
   "outputs": [],
   "source": [
    "@cluster(cores=8, memory=\"32GB\", time=\"01:30:00\")\n",
    "def lambda_computer_vision_training(model_config, data_config):\n",
    "    \"\"\"Train a computer vision model on Lambda Cloud GPU.\"\"\"\n",
    "    import torch\n",
    "    import torch.nn as nn\n",
    "    import torch.optim as optim\n",
    "    import torchvision\n",
    "    import torchvision.transforms as transforms\n",
    "    from torch.utils.data import DataLoader, TensorDataset\n",
    "    import numpy as np\n",
    "    import time\n",
    "    \n",
    "    # Set device\n",
    "    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "    print(f\"Training computer vision model on device: {device}\")\n",
    "    \n",
    "    # Data augmentation and preprocessing\n",
    "    transform_train = transforms.Compose([\n",
    "        transforms.ToPILImage(),\n",
    "        transforms.RandomResizedCrop(data_config['image_size']),\n",
    "        transforms.RandomHorizontalFlip(p=0.5),\n",
    "        transforms.RandomRotation(degrees=15),\n",
    "        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),\n",
    "        transforms.ToTensor(),\n",
    "        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])\n",
    "    ])\n",
    "    \n",
    "    transform_val = transforms.Compose([\n",
    "        transforms.ToPILImage(),\n",
    "        transforms.Resize((data_config['image_size'], data_config['image_size'])),\n",
    "        transforms.ToTensor(),\n",
    "        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])\n",
    "    ])\n",
    "    \n",
    "    # Generate synthetic image data\n",
    "    def create_synthetic_images(n_samples, image_size, n_channels, n_classes):\n",
    "        \"\"\"Create synthetic image dataset.\"\"\"\n",
    "        images = np.random.randint(0, 256, (n_samples, image_size, image_size, n_channels), dtype=np.uint8)\n",
    "        labels = np.random.randint(0, n_classes, n_samples)\n",
    "        return images, labels\n",
    "    \n",
    "    # Create datasets\n",
    "    train_images, train_labels = create_synthetic_images(\n",
    "        data_config['train_samples'],\n",
    "        data_config['image_size'],\n",
    "        data_config['n_channels'],\n",
    "        data_config['n_classes']\n",
    "    )\n",
    "    \n",
    "    val_images, val_labels = create_synthetic_images(\n",
    "        data_config['val_samples'],\n",
    "        data_config['image_size'],\n",
    "        data_config['n_channels'],\n",
    "        data_config['n_classes']\n",
    "    )\n",
    "    \n",
    "    # Custom dataset class\n",
    "    class SyntheticImageDataset(torch.utils.data.Dataset):\n",
    "        def __init__(self, images, labels, transform=None):\n",
    "            self.images = images\n",
    "            self.labels = labels\n",
    "            self.transform = transform\n",
    "        \n",
    "        def __len__(self):\n",
    "            return len(self.images)\n",
    "        \n",
    "        def __getitem__(self, idx):\n",
    "            image = self.images[idx]\n",
    "            label = self.labels[idx]\n",
    "            \n",
    "            if self.transform:\n",
    "                image = self.transform(image)\n",
    "            else:\n",
    "                image = torch.from_numpy(image).permute(2, 0, 1).float() / 255.0\n",
    "            \n",
    "            return image, label\n",
    "    \n",
    "    # Create data loaders\n",
    "    train_dataset = SyntheticImageDataset(train_images, train_labels, transform_train)\n",
    "    val_dataset = SyntheticImageDataset(val_images, val_labels, transform_val)\n",
    "    \n",
    "    train_loader = DataLoader(\n",
    "        train_dataset,\n",
    "        batch_size=data_config['batch_size'],\n",
    "        shuffle=True,\n",
    "        num_workers=4,\n",
    "        pin_memory=True if torch.cuda.is_available() else False\n",
    "    )\n",
    "    \n",
    "    val_loader = DataLoader(\n",
    "        val_dataset,\n",
    "        batch_size=data_config['batch_size'],\n",
    "        shuffle=False,\n",
    "        num_workers=4,\n",
    "        pin_memory=True if torch.cuda.is_available() else False\n",
    "    )\n",
    "    \n",
    "    # Model definition\n",
    "    if model_config['model_type'] == 'resnet':\n",
    "        if model_config['pretrained']:\n",
    "            model = torchvision.models.resnet50(pretrained=True)\n",
    "            model.fc = nn.Linear(model.fc.in_features, data_config['n_classes'])\n",
    "        else:\n",
    "            model = torchvision.models.resnet50(pretrained=False, num_classes=data_config['n_classes'])\n",
    "    elif model_config['model_type'] == 'efficientnet':\n",
    "        if model_config['pretrained']:\n",
    "            model = torchvision.models.efficientnet_b0(pretrained=True)\n",
    "            model.classifier[1] = nn.Linear(model.classifier[1].in_features, data_config['n_classes'])\n",
    "        else:\n",
    "            model = torchvision.models.efficientnet_b0(pretrained=False, num_classes=data_config['n_classes'])\n",
    "    else:\n",
    "        raise ValueError(f\"Unsupported model type: {model_config['model_type']}\")\n",
    "    \n",
    "    model = model.to(device)\n",
    "    \n",
    "    # Loss and optimizer\n",
    "    criterion = nn.CrossEntropyLoss()\n",
    "    optimizer = optim.AdamW(\n",
    "        model.parameters(),\n",
    "        lr=model_config['learning_rate'],\n",
    "        weight_decay=model_config['weight_decay']\n",
    "    )\n",
    "    \n",
    "    # Learning rate scheduler\n",
    "    scheduler = optim.lr_scheduler.CosineAnnealingLR(\n",
    "        optimizer, T_max=model_config['epochs']\n",
    "    )\n",
    "    \n",
    "    # Training loop\n",
    "    start_time = time.time()\n",
    "    train_losses = []\n",
    "    val_accuracies = []\n",
    "    \n",
    "    for epoch in range(model_config['epochs']):\n",
    "        # Training phase\n",
    "        model.train()\n",
    "        running_loss = 0.0\n",
    "        \n",
    "        for batch_idx, (data, target) in enumerate(train_loader):\n",
    "            data, target = data.to(device), target.to(device)\n",
    "            \n",
    "            optimizer.zero_grad()\n",
    "            output = model(data)\n",
    "            loss = criterion(output, target)\n",
    "            loss.backward()\n",
    "            optimizer.step()\n",
    "            \n",
    "            running_loss += loss.item()\n",
    "        \n",
    "        avg_train_loss = running_loss / len(train_loader)\n",
    "        train_losses.append(avg_train_loss)\n",
    "        \n",
    "        # Validation phase\n",
    "        model.eval()\n",
    "        correct = 0\n",
    "        total = 0\n",
    "        val_loss = 0.0\n",
    "        \n",
    "        with torch.no_grad():\n",
    "            for data, target in val_loader:\n",
    "                data, target = data.to(device), target.to(device)\n",
    "                output = model(data)\n",
    "                val_loss += criterion(output, target).item()\n",
    "                \n",
    "                _, predicted = torch.max(output.data, 1)\n",
    "                total += target.size(0)\n",
    "                correct += (predicted == target).sum().item()\n",
    "        \n",
    "        val_accuracy = 100.0 * correct / total\n",
    "        val_accuracies.append(val_accuracy)\n",
    "        \n",
    "        scheduler.step()\n",
    "        \n",
    "        if epoch % 5 == 0 or epoch == model_config['epochs'] - 1:\n",
    "            print(f'Epoch {epoch+1}/{model_config[\"epochs\"]}: '\n",
    "                  f'Train Loss: {avg_train_loss:.4f}, '\n",
    "                  f'Val Accuracy: {val_accuracy:.2f}%, '\n",
    "                  f'LR: {scheduler.get_last_lr()[0]:.6f}')\n",
    "    \n",
    "    training_time = time.time() - start_time\n",
    "    \n",
    "    # Memory usage\n",
    "    memory_info = {}\n",
    "    if torch.cuda.is_available():\n",
    "        memory_info = {\n",
    "            'allocated_mb': torch.cuda.memory_allocated() / (1024**2),\n",
    "            'reserved_mb': torch.cuda.memory_reserved() / (1024**2),\n",
    "            'max_allocated_mb': torch.cuda.max_memory_allocated() / (1024**2)\n",
    "        }\n",
    "    \n",
    "    return {\n",
    "        'training_completed': True,\n",
    "        'device_used': str(device),\n",
    "        'model_type': model_config['model_type'],\n",
    "        'model_parameters': sum(p.numel() for p in model.parameters()),\n",
    "        'training_time': training_time,\n",
    "        'final_train_loss': train_losses[-1],\n",
    "        'final_val_accuracy': val_accuracies[-1],\n",
    "        'best_val_accuracy': max(val_accuracies),\n",
    "        'train_losses': train_losses,\n",
    "        'val_accuracies': val_accuracies,\n",
    "        'memory_info': memory_info,\n",
    "        'data_config': data_config,\n",
    "        'model_config': model_config\n",
    "    }\n",
    "\n",
    "# Example configuration\n",
    "model_config = {\n",
    "    'model_type': 'resnet',  # or 'efficientnet'\n",
    "    'pretrained': True,\n",
    "    'epochs': 20,\n",
    "    'learning_rate': 0.001,\n",
    "    'weight_decay': 1e-4\n",
    "}\n",
    "\n",
    "data_config = {\n",
    "    'train_samples': 5000,\n",
    "    'val_samples': 1000,\n",
    "    'image_size': 224,\n",
    "    'n_channels': 3,\n",
    "    'n_classes': 10,\n",
    "    'batch_size': 32\n",
    "}\n",
    "\n",
    "# Run training\n",
    "# result = lambda_computer_vision_training(model_config, data_config)\n",
    "# print(f\"CV training completed! Best accuracy: {result['best_val_accuracy']:.2f}%\")\n",
    "# print(f\"Training time: {result['training_time']:.2f} seconds\")\n",
    "# print(f\"Model parameters: {result['model_parameters']:,}\")\n",
    "# print(f\"GPU memory used: {result['memory_info'].get('max_allocated_mb', 0):.1f} MB\")\n",
    "\n",
    "print(\"Computer vision training function defined. Uncomment the lines above to run on Lambda Cloud.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "multi-gpu",
   "metadata": {},
   "source": [
    "## Multi-GPU Training on Lambda Cloud"
   ]
  },
  {
   "cell_type": "code",
   "id": "multi-gpu-setup",
   "metadata": {},
   "outputs": [],
   "source": "@cluster(cores=16, memory=\"128GB\", time=\"04:00:00\")\ndef lambda_multi_gpu_training(model_config, training_config):\n    \"\"\"Multi-GPU training example using PyTorch DDP.\"\"\"\n    import torch\n    import torch.nn as nn\n    import torch.multiprocessing as mp\n    from torch.nn.parallel import DistributedDataParallel as DDP\n    from torch.distributed import init_process_group, destroy_process_group\n    import os\n    \n    def setup_ddp(rank, world_size):\n        \"\"\"Setup distributed data parallel.\"\"\"\n        os.environ['MASTER_ADDR'] = 'localhost'\n        os.environ['MASTER_PORT'] = '12355'\n        init_process_group(backend=\"nccl\", rank=rank, world_size=world_size)\n        torch.cuda.set_device(rank)\n    \n    def cleanup_ddp():\n        \"\"\"Clean up distributed training.\"\"\"\n        destroy_process_group()\n    \n    def train_on_gpu(rank, world_size, model_config, training_config):\n        \"\"\"Training function for each GPU.\"\"\"\n        setup_ddp(rank, world_size)\n        \n        # Create model and move to GPU\n        model = create_model(model_config).to(rank)\n        model = DDP(model, device_ids=[rank])\n        \n        # Create data loader with DistributedSampler\n        train_loader = create_distributed_dataloader(training_config, rank, world_size)\n        \n        # Training loop\n        optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])\n        \n        for epoch in range(training_config['epochs']):\n            train_loader.sampler.set_epoch(epoch)  # Important for proper shuffling\n            \n            for batch_idx, (data, target) in enumerate(train_loader):\n                data, target = data.to(rank), target.to(rank)\n                \n                optimizer.zero_grad()\n                output = model(data)\n                loss = nn.CrossEntropyLoss()(output, target)\n                loss.backward()\n                optimizer.step()\n                \n                if rank == 0 and batch_idx % 100 == 0:\n                    print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')\n        \n        cleanup_ddp()\n    \n    # Launch multi-GPU training\n    world_size = torch.cuda.device_count()\n    print(f\"Starting multi-GPU training on {world_size} GPUs\")\n    \n    mp.spawn(\n        train_on_gpu,\n        args=(world_size, model_config, training_config),\n        nprocs=world_size,\n        join=True\n    )\n    \n    return {\"training_completed\": True, \"gpus_used\": world_size}",
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "id": "3b6jiq9gqq2",
   "source": "### HuggingFace Accelerate Example\n\nAlternative approach using HuggingFace Accelerate for easier multi-GPU setup:",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "id": "ayqhp0nnsnb",
   "source": "@cluster(cores=16, memory=\"128GB\", time=\"04:00:00\")\ndef lambda_accelerate_training(model_config, training_config):\n    \"\"\"Multi-GPU training using HuggingFace Accelerate.\"\"\"\n    from accelerate import Accelerator\n    import torch\n    import torch.nn as nn\n    \n    # Initialize accelerator\n    accelerator = Accelerator()\n    device = accelerator.device\n    \n    # Create model and optimizer\n    model = create_model(model_config)\n    optimizer = torch.optim.AdamW(model.parameters(), lr=training_config['lr'])\n    train_loader = create_dataloader(training_config)\n    \n    # Prepare for distributed training\n    model, optimizer, train_loader = accelerator.prepare(\n        model, optimizer, train_loader\n    )\n    \n    # Training loop\n    model.train()\n    for epoch in range(training_config['epochs']):\n        for batch_idx, (data, target) in enumerate(train_loader):\n            with accelerator.accumulate(model):\n                output = model(data)\n                loss = nn.CrossEntropyLoss()(output, target)\n                \n                accelerator.backward(loss)\n                optimizer.step()\n                optimizer.zero_grad()\n            \n            if accelerator.is_main_process and batch_idx % 100 == 0:\n                print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')\n    \n    return {\n        \"training_completed\": True,\n        \"num_processes\": accelerator.num_processes,\n        \"device\": str(device)\n    }",
   "metadata": {},
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "id": "1kzkcvp11ms",
   "source": "## Multi-GPU Training on Lambda Cloud\n\n### Available Multi-GPU Instances\n\n- **2x A100 (40GB)**: ~$2.20/hour\n- **4x A100 (40GB)**: ~$4.40/hour  \n- **8x A100 (40GB)**: ~$8.80/hour\n- **2x A100 (80GB)**: ~$2.80/hour\n- **4x A100 (80GB)**: ~$5.60/hour\n- **8x A100 (80GB)**: ~$11.20/hour\n- **8x H100**: ~$20.00/hour (when available)\n\n### Setup Requirements\n\n1. **Launch multi-GPU instance** via Lambda Cloud console\n2. **Install additional packages** for distributed training:\n   ```bash\n   pip install accelerate deepspeed\n   ```\n3. **Configure Clustrix** for multi-GPU environment\n4. **Use appropriate parallelization strategy**\n\n### Parallelization Strategies\n\n- **Data Parallel (DP)**: Simple, works for most models\n- **Distributed Data Parallel (DDP)**: Better performance, recommended\n- **Model Parallel**: For very large models that don't fit on single GPU\n- **Pipeline Parallel**: For extremely large models\n- **DeepSpeed ZeRO**: For memory-efficient training of large models\n\n### PyTorch DDP Example",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "id": "cost-optimization",
   "metadata": {},
   "source": [
    "## Cost Optimization Strategies"
   ]
  },
  {
   "cell_type": "code",
   "id": "cost-optimization-lambda",
   "metadata": {},
   "outputs": [],
   "source": "# Import Clustrix cost monitoring functionality\nfrom clustrix import cost_tracking_decorator, get_cost_monitor, generate_cost_report\n\n# Example 1: Using the cost tracking decorator\n@cost_tracking_decorator('lambda', 'a100_40gb')\n@cluster(cores=8, memory=\"32GB\")\ndef lambda_training_with_cost_tracking():\n    \"\"\"Example training function with automatic cost tracking.\"\"\"\n    import time\n    import numpy as np\n    \n    # Simulate training workload\n    print(\"Starting training...\")\n    time.sleep(2)  # Simulate 2 seconds of work\n    \n    # Simulate some compute\n    data = np.random.randn(1000, 1000)\n    result = np.dot(data, data.T)\n    \n    print(\"Training completed!\")\n    return {\n        'model_accuracy': 0.95,\n        'training_samples': 10000,\n        'final_loss': 0.032\n    }\n\n# Example 2: Manual cost monitoring\ndef manual_cost_monitoring_example():\n    \"\"\"Example of manual cost monitoring.\"\"\"\n    # Start cost monitoring\n    monitor = get_cost_monitor('lambda')\n    if monitor:\n        monitor.start_monitoring()\n        \n        # Your computation here\n        import time\n        time.sleep(1)\n        \n        # Stop monitoring and get report\n        cost_report = monitor.stop_monitoring()\n        if cost_report:\n            print(f\"Computation completed in {cost_report.duration_seconds:.2f} seconds\")\n            print(f\"Estimated cost: ${cost_report.cost_estimate.estimated_cost:.4f}\")\n            print(f\"GPU utilization: {len(cost_report.resource_usage.gpu_stats or [])} GPUs\")\n            \n            if cost_report.recommendations:\n                print(\"Cost optimization recommendations:\")\n                for rec in cost_report.recommendations:\n                    print(f\"  - {rec}\")\n\n# Example 3: Generate real-time cost report\ndef get_current_cost_status():\n    \"\"\"Get current cost and resource status.\"\"\"\n    report = generate_cost_report('lambda', 'a100_40gb')\n    if report:\n        print(\"Current Lambda Cloud Status:\")\n        print(f\"  CPU Usage: {report['resource_usage']['cpu_percent']:.1f}%\")\n        print(f\"  Memory Usage: {report['resource_usage']['memory_percent']:.1f}%\")\n        if report['resource_usage']['gpu_stats']:\n            avg_gpu = sum(gpu['utilization_percent'] for gpu in report['resource_usage']['gpu_stats']) / len(report['resource_usage']['gpu_stats'])\n            print(f\"  GPU Usage: {avg_gpu:.1f}%\")\n        print(f\"  Hourly Rate: ${report['cost_estimate']['hourly_rate']:.2f}\")\n\n# Example 4: Compare pricing across instance types\ndef compare_lambda_pricing():\n    \"\"\"Compare pricing for different Lambda Cloud instance types.\"\"\"\n    from clustrix import get_pricing_info\n    \n    pricing = get_pricing_info('lambda')\n    if pricing:\n        print(\"Lambda Cloud Instance Pricing (USD/hour):\")\n        \n        # Group by category\n        single_gpu = {k: v for k, v in pricing.items() if not k.startswith(('2x', '4x', '8x')) and k != 'default'}\n        multi_gpu = {k: v for k, v in pricing.items() if k.startswith(('2x', '4x', '8x'))}\n        \n        print(\"\\nSingle GPU Instances:\")\n        for instance, price in sorted(single_gpu.items(), key=lambda x: x[1]):\n            print(f\"  {instance:<15}: ${price:.2f}/hour\")\n        \n        print(\"\\nMulti-GPU Instances:\")\n        for instance, price in sorted(multi_gpu.items(), key=lambda x: x[1]):\n            print(f\"  {instance:<15}: ${price:.2f}/hour\")\n\n# Run examples (uncomment to test)\n# print(\"1. Cost tracking decorator example:\")\n# result = lambda_training_with_cost_tracking()\n# print(f\"Training result: {result}\")\n\n# print(\"\\n2. Manual cost monitoring example:\")\n# manual_cost_monitoring_example()\n\n# print(\"\\n3. Current cost status:\")\n# get_current_cost_status()\n\nprint(\"4. Lambda Cloud pricing comparison:\")\ncompare_lambda_pricing()\n\nprint(\"\\n\u2705 Lambda Cloud cost monitoring examples ready!\")\nprint(\"\ud83d\udca1 Use @cost_tracking_decorator('lambda', 'instance_type') for automatic cost tracking\")",
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "id": "iyk1bplps9",
   "source": "## Lambda Cloud Cost Optimization\n\n### Cost Monitoring and Tracking\n\nMonitor GPU utilization and track costs effectively:",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "id": "h19wkr887g",
   "source": "### Lambda Cloud Cost Optimization\n\n#### \ud83d\udcb0 Instance Selection\n- **RTX 6000 Ada**: Best value for most ML workloads (~$0.75/hour)\n- **A10**: Good balance of performance and cost (~$0.60/hour)\n- **A100 40GB**: For large models requiring more VRAM (~$1.10/hour)\n- **A100 80GB**: Only when 40GB is insufficient (~$1.40/hour)\n- **H100**: Premium option for cutting-edge research (~$2.50/hour)\n\n#### \u23f0 Usage Patterns\n- Use \"persistent\" instances for ongoing development\n- Terminate instances immediately after training completion\n- Schedule training jobs during off-peak hours if possible\n- Use local development for debugging, GPU for final training\n\n#### \ud83d\udd27 Optimization Techniques\n- Mixed precision training (fp16) to reduce memory usage\n- Gradient accumulation for effective larger batch sizes\n- Model checkpointing to resume interrupted training\n- Efficient data loading with multiple workers\n- Early stopping to avoid overtraining\n\n#### \ud83d\udcca Monitoring and Management\n- Monitor GPU utilization with nvidia-smi\n- Track training progress with logging\n- Set training time limits to prevent runaway costs\n- Use Clustrix timeouts as safety nets\n- Regular cost reviews and budget alerts\n\n#### \ud83d\ude80 Clustrix-Specific Optimizations\n- Use Clustrix auto-cleanup features\n- Implement job queuing for multiple experiments\n- Leverage Clustrix's timeout mechanisms\n- Use remote environment caching",
   "metadata": {},
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "best-practices",
   "metadata": {},
   "source": [
    "## Best Practices and Troubleshooting"
   ]
  },
  {
   "cell_type": "code",
   "id": "best-practices-lambda",
   "metadata": {},
   "outputs": [],
   "source": "# Example usage of monitoring functions\ndef create_monitoring_script():\n    \"\"\"Create and save the GPU monitoring script.\"\"\"\n    script_content = '''#!/bin/bash\n# Lambda Cloud monitoring script\n\necho \"Lambda Cloud Training Monitor\"\necho \"============================\"\necho \"Start time: $(date)\"\necho \"\"\n\n# System information\necho \"System Information:\"\necho \"------------------\"\nnvidia-smi --query-gpu=gpu_name,memory.total,power.draw --format=csv\necho \"\"\n\n# Monitor GPU usage every 30 seconds\nwhile true; do\n    echo \"GPU Status at $(date):\"\n    nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader\n    echo \"\"\n    \n    # Check if training process is still running\n    if ! pgrep -f python > /dev/null; then\n        echo \"No Python processes found. Training may have completed.\"\n        break\n    fi\n    \n    sleep 30\ndone\n\necho \"Monitoring completed at $(date)\"\n'''\n    \n    with open('monitor_training.sh', 'w') as f:\n        f.write(script_content)\n    \n    # Make executable\n    import os\n    os.chmod('monitor_training.sh', 0o755)\n    \n    return \"Monitoring script created: monitor_training.sh\"\n\n# Uncomment to create the monitoring script:\n# result = create_monitoring_script()\n# print(result)",
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "id": "9e011y2ptia",
   "source": "## Lambda Cloud Best Practices\n\n### GPU Monitoring Script\n\nUse this monitoring script to track GPU usage during training. Save as `monitor_training.sh` and run with: `bash monitor_training.sh`\n\n```bash\n#!/bin/bash\n# Lambda Cloud monitoring script\n\necho \"Lambda Cloud Training Monitor\"\necho \"============================\"\necho \"Start time: $(date)\"\necho \"\"\n\n# System information\necho \"System Information:\"\necho \"------------------\"\nnvidia-smi --query-gpu=gpu_name,memory.total,power.draw --format=csv\necho \"\"\n\n# Monitor GPU usage every 30 seconds\nwhile true; do\n    echo \"GPU Status at $(date):\"\n    nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader\n    echo \"\"\n    \n    # Check if training process is still running\n    if ! pgrep -f python > /dev/null; then\n        echo \"No Python processes found. Training may have completed.\"\n        break\n    fi\n    \n    sleep 30\ndone\n\necho \"Monitoring completed at $(date)\"\n```",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "id": "2d105yt16xr",
   "source": "### Lambda Cloud + Clustrix Best Practices\n\n#### \ud83d\ude80 Performance Optimization\n- Always use mixed precision (fp16) when possible\n- Optimize data loading with multiple workers and pin_memory\n- Use appropriate batch sizes to maximize GPU utilization\n- Enable tensor cores for compatible operations\n- Pre-allocate GPU memory to avoid fragmentation\n\n#### \ud83d\udcbe Data Management\n- Store datasets on fast NVMe storage when available\n- Use data streaming for very large datasets\n- Implement efficient data preprocessing pipelines\n- Cache frequently used data in memory\n- Use appropriate data formats (e.g., HDF5, Parquet)\n\n#### \ud83d\udd27 Environment Setup\n- Use conda environments for reproducible setups\n- Pin package versions in requirements.txt\n- Install packages from conda-forge when possible\n- Use uv package manager for faster installs\n- Set up proper CUDA environment variables\n\n#### \ud83d\udee0\ufe0f Development Workflow\n- Develop and debug locally, train on Lambda Cloud\n- Use small datasets for initial testing\n- Implement proper logging and monitoring\n- Save model checkpoints regularly\n- Use version control for experiment tracking\n\n#### \ud83d\udd12 Security\n- Use SSH keys instead of passwords\n- Keep SSH keys secure and rotate regularly\n- Don't store credentials in code or notebooks\n- Use environment variables for configuration\n- Monitor instance access logs",
   "metadata": {},
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "0lrqgzg1xis",
   "source": "### Common Issues and Solutions\n\n#### \u274c CUDA out of memory errors\n\u2705 **Solutions:**\n- Reduce batch size\n- Enable gradient checkpointing\n- Use mixed precision training\n- Clear GPU cache with torch.cuda.empty_cache()\n- Consider model parallelism for large models\n\n#### \u274c Slow data loading\n\u2705 **Solutions:**\n- Increase num_workers in DataLoader\n- Enable pin_memory for GPU transfers\n- Use faster storage (NVMe over network storage)\n- Implement data prefetching\n- Optimize data preprocessing\n\n#### \u274c SSH connection timeouts\n\u2705 **Solutions:**\n- Configure SSH keep-alive settings\n- Use screen or tmux for long-running jobs\n- Implement proper error handling in Clustrix\n- Set appropriate timeout values\n- Monitor network connectivity\n\n#### \u274c Low GPU utilization\n\u2705 **Solutions:**\n- Increase batch size if memory allows\n- Optimize data loading pipeline\n- Use asynchronous data transfers\n- Profile code to identify bottlenecks\n- Consider multi-GPU training\n\n#### \u274c Package installation failures\n\u2705 **Solutions:**\n- Use conda for system-level packages\n- Check CUDA compatibility versions\n- Clear pip cache if needed\n- Use --no-cache-dir flag for pip\n- Install packages in correct order",
   "metadata": {},
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "cleanup-lambda",
   "metadata": {},
   "source": [
    "## Instance Management and Cleanup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cleanup-instances",
   "metadata": {},
   "outputs": [],
   "source": "### Lambda Cloud Instance Management\n\n#### \ud83d\udd0d Check Running Instances\n\n**Via CLI:**\n```bash\nlambda-cloud instance list\n```\n\n**Via Web Console:**\nVisit: https://cloud.lambdalabs.com/instances\n\n#### \u23f9\ufe0f Terminate Instances\n\n**Terminate specific instance:**\n```bash\nlambda-cloud instance terminate <INSTANCE_ID>\n```\n\n**Terminate all instances (DANGEROUS!):**\n```bash\nlambda-cloud instance list --format=csv | grep -v \"instance_id\" | cut -d',' -f1 | xargs -I {} lambda-cloud instance terminate {}\n```\n\n#### \ud83d\udcbe Save Work Before Termination\n\n**Save models to persistent storage:**\n```bash\nrsync -avz ubuntu@<INSTANCE_IP>:/path/to/models/ ./local_models/\n```\n\n**Save logs and results:**\n```bash\nscp -r ubuntu@<INSTANCE_IP>:/tmp/clustrix/ ./results/\n```\n\n#### \ud83d\udcca Cost Monitoring\n\n**Check current usage:**\n```bash\nlambda-cloud instance list --format=table\n```\n\n**Estimate costs:**\n```bash\nlambda-cloud instance list --format=csv | awk -F',' 'NR>1 {print $2, $3}' | while read type status; do\n    if [ \"$status\" = \"active\" ]; then\n        echo \"Active instance: $type\"\n    fi\ndone\n```\n\n### Automated Cleanup Script\n\nSave this as `lambda_cleanup.sh` for automated instance management:\n\n```bash\n#!/bin/bash\n# Automated cleanup script for Lambda Cloud\n# Save as lambda_cleanup.sh\n\nset -e\n\necho \"Lambda Cloud Automated Cleanup\"\necho \"==============================\"\n\n# Check if lambda-cloud CLI is installed\nif ! command -v lambda-cloud &> /dev/null; then\n    echo \"Error: lambda-cloud CLI not found. Please install it first.\"\n    exit 1\nfi\n\n# List current instances\necho \"Current instances:\"\nlambda-cloud instance list\necho \"\"\n\n# Ask for confirmation\nread -p \"Do you want to terminate ALL instances? (y/N): \" -n 1 -r\necho \"\"\nif [[ ! $REPLY =~ ^[Yy]$ ]]; then\n    echo \"Cleanup cancelled.\"\n    exit 0\nfi\n\n# Get instance IDs\nINSTANCE_IDS=$(lambda-cloud instance list --format=csv | grep -v \"instance_id\" | cut -d',' -f1)\n\nif [ -z \"$INSTANCE_IDS\" ]; then\n    echo \"No instances to terminate.\"\n    exit 0\nfi\n\n# Terminate instances\necho \"Terminating instances...\"\nfor instance_id in $INSTANCE_IDS; do\n    echo \"Terminating instance: $instance_id\"\n    lambda-cloud instance terminate $instance_id\ndone\n\necho \"All instances terminated.\"\necho \"Please verify termination in the web console: https://cloud.lambdalabs.com/instances\"\n```\n\n### Clustrix Integration Manager"
  },
  {
   "cell_type": "code",
   "id": "5f4n1hu8qdb",
   "source": "# Integrate cleanup with Clustrix workflows\n\nfrom clustrix import configure\nimport subprocess\nimport time\n\nclass LambdaCloudManager:\n    \"\"\"Manager for Lambda Cloud instances with Clustrix integration.\"\"\"\n    \n    def __init__(self):\n        self.active_instances = []\n    \n    def launch_instance_for_clustrix(self, instance_type, ssh_key_name):\n        \"\"\"Launch instance and configure Clustrix.\"\"\"\n        # Launch instance\n        result = subprocess.run([\n            'lambda-cloud', 'instance', 'launch',\n            '--instance-type', instance_type,\n            '--ssh-key-name', ssh_key_name\n        ], capture_output=True, text=True)\n        \n        if result.returncode != 0:\n            raise Exception(f\"Failed to launch instance: {result.stderr}\")\n        \n        # Parse instance ID and IP (simplified)\n        instance_id = \"extracted_from_output\"  # Parse from result.stdout\n        instance_ip = \"extracted_from_output\"   # Parse from result.stdout\n        \n        # Wait for instance to be ready\n        time.sleep(60)  # Wait for startup\n        \n        # Configure Clustrix\n        configure(\n            cluster_type=\"ssh\",\n            cluster_host=instance_ip,\n            username=\"ubuntu\",\n            key_file=\"~/.ssh/id_rsa\",\n            remote_work_dir=\"/tmp/clustrix\",\n            package_manager=\"auto\",\n            cleanup_on_success=True\n        )\n        \n        self.active_instances.append({\n            'id': instance_id,\n            'ip': instance_ip,\n            'type': instance_type,\n            'launch_time': time.time()\n        })\n        \n        return instance_id, instance_ip\n    \n    def cleanup_all_instances(self):\n        \"\"\"Clean up all managed instances.\"\"\"\n        for instance in self.active_instances:\n            try:\n                subprocess.run([\n                    'lambda-cloud', 'instance', 'terminate', instance['id']\n                ], check=True)\n                print(f\"Terminated instance {instance['id']}\")\n            except subprocess.CalledProcessError as e:\n                print(f\"Failed to terminate {instance['id']}: {e}\")\n        \n        self.active_instances.clear()\n    \n    def __del__(self):\n        \"\"\"Ensure cleanup on object destruction.\"\"\"\n        if self.active_instances:\n            print(\"Warning: Active instances detected. Cleaning up...\")\n            self.cleanup_all_instances()\n\n# Usage example:\n# manager = LambdaCloudManager()\n# try:\n#     instance_id, ip = manager.launch_instance_for_clustrix('a100', 'my-ssh-key')\n#     # Run your Clustrix computations\n#     result = my_clustrix_function()\n# finally:\n#     manager.cleanup_all_instances()",
   "metadata": {},
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "id": "lambda-summary",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "This tutorial covered:\n",
    "\n",
    "1. **Setup**: Lambda Cloud account creation and instance management\n",
    "2. **GPU Computing**: High-performance GPU instances for ML workloads\n",
    "3. **Deep Learning**: PyTorch training with GPU acceleration\n",
    "4. **Transformer Models**: Fine-tuning with HuggingFace Transformers\n",
    "5. **Computer Vision**: CNN training with data augmentation\n",
    "6. **Multi-GPU Training**: Distributed training across multiple GPUs\n",
    "7. **Cost Optimization**: Strategies to minimize GPU computing costs\n",
    "8. **Best Practices**: Performance optimization and troubleshooting\n",
    "9. **Instance Management**: Automated cleanup and monitoring\n",
    "\n",
    "### Key Advantages of Lambda Cloud + Clustrix\n",
    "\n",
    "- **GPU Focus**: Specialized in high-performance GPU computing\n",
    "- **Cost Effective**: Competitive pricing for GPU instances\n",
    "- **Simple Management**: Easy instance launching and termination\n",
    "- **High Performance**: Latest NVIDIA GPUs (A100, H100, RTX)\n",
    "- **Fast Networking**: InfiniBand for multi-GPU communication\n",
    "- **ML Optimized**: Pre-configured environments for machine learning\n",
    "- **Flexible Scaling**: From single GPU to large multi-GPU clusters\n",
    "\n",
    "### Lambda Cloud Pricing Advantages\n",
    "\n",
    "- **RTX 6000 Ada**: Excellent price/performance for most ML workloads\n",
    "- **A100 40GB/80GB**: Industry-standard for large-scale training\n",
    "- **H100**: Cutting-edge performance for the most demanding workloads\n",
    "- **Multi-GPU**: Cost-effective scaling for distributed training\n",
    "- **No Hidden Fees**: Simple per-hour pricing\n",
    "\n",
    "### Next Steps\n",
    "\n",
    "1. Create your Lambda Cloud account and add SSH keys\n",
    "2. Start with a single GPU instance for testing\n",
    "3. Configure Clustrix for your Lambda Cloud instance\n",
    "4. Run the provided examples to verify setup\n",
    "5. Scale to multi-GPU instances for larger workloads\n",
    "6. Implement cost monitoring and automated cleanup\n",
    "\n",
    "### Use Cases\n",
    "\n",
    "- **Deep Learning Research**: Train large neural networks efficiently\n",
    "- **Computer Vision**: Process large image datasets with CNNs\n",
    "- **NLP**: Fine-tune transformer models on custom datasets\n",
    "- **Scientific Computing**: GPU-accelerated simulations and modeling\n",
    "- **Prototyping**: Rapid experimentation with different architectures\n",
    "- **Production Training**: Scale up successful experiments\n",
    "\n",
    "### Resources\n",
    "\n",
    "- [Lambda Cloud Console](https://cloud.lambdalabs.com/)\n",
    "- [Lambda Cloud Documentation](https://lambdalabs.com/service/gpu-cloud/documentation)\n",
    "- [Lambda Cloud CLI](https://github.com/LambdaLabsML/lambda-cloud-cli)\n",
    "- [PyTorch Documentation](https://pytorch.org/docs/)\n",
    "- [HuggingFace Transformers](https://huggingface.co/transformers/)\n",
    "- [Clustrix Documentation](https://clustrix.readthedocs.io/)\n",
    "\n",
    "**Remember**: Lambda Cloud excels at GPU computing! Always terminate instances when not in use to control costs, and leverage Clustrix's distributed computing capabilities to scale your ML workloads efficiently."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}