{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# SLURM Cluster Tutorial\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/clustrix/blob/master/docs/notebooks/slurm_tutorial.ipynb)\n",
        "\n",
        "This tutorial demonstrates how to use Clustrix with SLURM (Simple Linux Utility for Resource Management) clusters. SLURM is one of the most popular workload managers for HPC clusters.\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "- Access to a SLURM cluster\n",
        "- SSH key configured for the cluster\n",
        "- Clustrix installed: `pip install clustrix`"
      ],
      "id": "cell-0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Installation and Setup\n",
        "\n",
        "First, install Clustrix if you haven't already:"
      ],
      "id": "cell-1"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Install Clustrix (uncomment if needed)\n",
        "# !pip install clustrix\n",
        "\n",
        "import clustrix\n",
        "from clustrix import cluster, configure\n",
        "import numpy as np\n",
        "import time"
      ],
      "id": "cell-2"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Basic SLURM Configuration\n",
        "\n",
        "Configure Clustrix to connect to your SLURM cluster:"
      ],
      "id": "cell-3"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Configure for SLURM cluster\n",
        "configure(\n",
        "    cluster_type=\"slurm\",\n",
        "    cluster_host=\"your-slurm-cluster.edu\",  # Replace with your cluster hostname\n",
        "    username=\"your-username\",               # Replace with your username\n",
        "    key_file=\"~/.ssh/id_rsa\",              # Path to your SSH key\n",
        "    \n",
        "    # Default resource requirements\n",
        "    default_cores=4,\n",
        "    default_memory=\"8GB\",\n",
        "    default_time=\"01:00:00\",\n",
        "    default_partition=\"normal\",             # Replace with your default partition\n",
        "    \n",
        "    # Remote work directory\n",
        "    remote_work_dir=\"/scratch/your-username/clustrix\",  # Adjust for your cluster\n",
        "    \n",
        "    # Optional: Load modules on the cluster\n",
        "    module_loads=[\"python/3.9\", \"gcc/9.3.0\"],\n",
        "    \n",
        "    # Cleanup settings\n",
        "    cleanup_on_success=True,\n",
        "    max_parallel_jobs=20\n",
        ")\n",
        "\n",
        "print(\"SLURM cluster configured successfully!\")"
      ],
      "id": "cell-4"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Example 1: Simple Mathematical Computation\n",
        "\n",
        "Let's start with a basic example that performs a mathematical computation on the cluster:"
      ],
      "id": "cell-5"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(cores=2, memory=\"4GB\", time=\"00:10:00\")\n",
        "def calculate_pi_monte_carlo(n_samples=1000000):\n",
        "    \"\"\"\n",
        "    Calculate pi using Monte Carlo method.\n",
        "    This will run on the SLURM cluster.\n",
        "    \"\"\"\n",
        "    import numpy as np\n",
        "    \n",
        "    # Generate random points\n",
        "    x = np.random.uniform(-1, 1, n_samples)\n",
        "    y = np.random.uniform(-1, 1, n_samples)\n",
        "    \n",
        "    # Check if points are inside unit circle\n",
        "    inside_circle = (x**2 + y**2) <= 1\n",
        "    \n",
        "    # Estimate pi\n",
        "    pi_estimate = 4 * np.sum(inside_circle) / n_samples\n",
        "    \n",
        "    return {\n",
        "        'pi_estimate': pi_estimate,\n",
        "        'n_samples': n_samples,\n",
        "        'error': abs(pi_estimate - np.pi)\n",
        "    }\n",
        "\n",
        "# Execute on cluster (this will submit a SLURM job)\n",
        "result = calculate_pi_monte_carlo(5000000)\n",
        "print(f\"Pi estimate: {result['pi_estimate']:.6f}\")\n",
        "print(f\"Error: {result['error']:.6f}\")\n",
        "print(f\"Samples used: {result['n_samples']:,}\")"
      ],
      "id": "cell-6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Example 2: Machine Learning Model Training\n",
        "\n",
        "Train a machine learning model with specific resource requirements:"
      ],
      "id": "cell-7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(\n",
        "    cores=8, \n",
        "    memory=\"32GB\", \n",
        "    time=\"02:00:00\",\n",
        "    partition=\"gpu\",  # Use GPU partition if available\n",
        "    gres=\"gpu:1\"      # Request 1 GPU (SLURM-specific)\n",
        ")\n",
        "def train_random_forest(n_samples=100000, n_features=50, n_estimators=200):\n",
        "    \"\"\"\n",
        "    Train a Random Forest model on synthetic data.\n",
        "    \"\"\"\n",
        "    from sklearn.ensemble import RandomForestClassifier\n",
        "    from sklearn.datasets import make_classification\n",
        "    from sklearn.model_selection import train_test_split, cross_val_score\n",
        "    from sklearn.metrics import accuracy_score\n",
        "    import numpy as np\n",
        "    \n",
        "    print(f\"Generating dataset with {n_samples:,} samples and {n_features} features...\")\n",
        "    \n",
        "    # Generate synthetic dataset\n",
        "    X, y = make_classification(\n",
        "        n_samples=n_samples,\n",
        "        n_features=n_features,\n",
        "        n_informative=int(n_features * 0.7),\n",
        "        n_redundant=int(n_features * 0.2),\n",
        "        n_clusters_per_class=2,\n",
        "        random_state=42\n",
        "    )\n",
        "    \n",
        "    # Split the data\n",
        "    X_train, X_test, y_train, y_test = train_test_split(\n",
        "        X, y, test_size=0.2, random_state=42\n",
        "    )\n",
        "    \n",
        "    print(f\"Training Random Forest with {n_estimators} estimators...\")\n",
        "    \n",
        "    # Train model\n",
        "    model = RandomForestClassifier(\n",
        "        n_estimators=n_estimators,\n",
        "        max_depth=20,\n",
        "        min_samples_split=5,\n",
        "        n_jobs=-1,  # Use all available cores\n",
        "        random_state=42\n",
        "    )\n",
        "    \n",
        "    model.fit(X_train, y_train)\n",
        "    \n",
        "    # Evaluate model\n",
        "    train_accuracy = accuracy_score(y_train, model.predict(X_train))\n",
        "    test_accuracy = accuracy_score(y_test, model.predict(X_test))\n",
        "    \n",
        "    # Cross-validation\n",
        "    cv_scores = cross_val_score(model, X, y, cv=5, n_jobs=-1)\n",
        "    \n",
        "    return {\n",
        "        'train_accuracy': train_accuracy,\n",
        "        'test_accuracy': test_accuracy,\n",
        "        'cv_mean': np.mean(cv_scores),\n",
        "        'cv_std': np.std(cv_scores),\n",
        "        'feature_importance': model.feature_importances_.tolist(),\n",
        "        'n_samples': n_samples,\n",
        "        'n_features': n_features,\n",
        "        'n_estimators': n_estimators\n",
        "    }\n",
        "\n",
        "# Train model on cluster\n",
        "ml_result = train_random_forest(n_samples=50000, n_features=30, n_estimators=100)\n",
        "\n",
        "print(f\"Training Accuracy: {ml_result['train_accuracy']:.4f}\")\n",
        "print(f\"Test Accuracy: {ml_result['test_accuracy']:.4f}\")\n",
        "print(f\"Cross-validation: {ml_result['cv_mean']:.4f} ± {ml_result['cv_std']:.4f}\")"
      ],
      "id": "cell-8"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Example 3: Parallel Data Processing with Automatic Loop Distribution\n",
        "\n",
        "Process multiple data chunks in parallel using Clustrix's automatic loop parallelization:"
      ],
      "id": "cell-9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(\n",
        "    cores=16, \n",
        "    memory=\"64GB\", \n",
        "    time=\"01:30:00\",\n",
        "    parallel=True  # Enable automatic loop parallelization\n",
        ")\n",
        "def process_data_chunks(chunk_size=10000, num_chunks=20):\n",
        "    \"\"\"\n",
        "    Process multiple data chunks in parallel.\n",
        "    The for loop will be automatically distributed across cores.\n",
        "    \"\"\"\n",
        "    import numpy as np\n",
        "    from scipy import stats\n",
        "    \n",
        "    results = []\n",
        "    \n",
        "    # This loop will be automatically parallelized by Clustrix\n",
        "    for chunk_id in range(num_chunks):\n",
        "        # Generate chunk data with different random seed\n",
        "        np.random.seed(chunk_id * 42)\n",
        "        data = np.random.exponential(scale=2.0, size=chunk_size)\n",
        "        \n",
        "        # Perform statistical analysis on chunk\n",
        "        chunk_stats = {\n",
        "            'chunk_id': chunk_id,\n",
        "            'mean': np.mean(data),\n",
        "            'std': np.std(data),\n",
        "            'median': np.median(data),\n",
        "            'skewness': stats.skew(data),\n",
        "            'kurtosis': stats.kurtosis(data),\n",
        "            'min': np.min(data),\n",
        "            'max': np.max(data),\n",
        "            'percentile_95': np.percentile(data, 95)\n",
        "        }\n",
        "        \n",
        "        results.append(chunk_stats)\n",
        "    \n",
        "    # Aggregate results\n",
        "    overall_stats = {\n",
        "        'num_chunks': len(results),\n",
        "        'total_samples': num_chunks * chunk_size,\n",
        "        'mean_of_means': np.mean([r['mean'] for r in results]),\n",
        "        'std_of_means': np.std([r['mean'] for r in results]),\n",
        "        'chunk_results': results\n",
        "    }\n",
        "    \n",
        "    return overall_stats\n",
        "\n",
        "# Process data chunks in parallel\n",
        "parallel_result = process_data_chunks(chunk_size=5000, num_chunks=10)\n",
        "\n",
        "print(f\"Processed {parallel_result['num_chunks']} chunks\")\n",
        "print(f\"Total samples: {parallel_result['total_samples']:,}\")\n",
        "print(f\"Mean of chunk means: {parallel_result['mean_of_means']:.4f}\")\n",
        "print(f\"Std of chunk means: {parallel_result['std_of_means']:.4f}\")\n",
        "\n",
        "# Display first few chunk results\n",
        "print(\"\\nFirst 3 chunk results:\")\n",
        "for i, chunk in enumerate(parallel_result['chunk_results'][:3]):\n",
        "    print(f\"  Chunk {chunk['chunk_id']}: mean={chunk['mean']:.3f}, std={chunk['std']:.3f}\")"
      ],
      "id": "cell-10"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Example 4: Scientific Computing - Numerical Integration\n",
        "\n",
        "Perform numerical integration using high-performance computing resources:"
      ],
      "id": "cell-11"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(\n",
        "    cores=32,\n",
        "    memory=\"128GB\",\n",
        "    time=\"03:00:00\",\n",
        "    partition=\"bigmem\"  # Use high-memory partition\n",
        ")\n",
        "def numerical_integration_adaptive(function_type=\"gaussian\", intervals=1000000, precision_target=1e-8):\n",
        "    \"\"\"\n",
        "    Perform high-precision numerical integration using adaptive methods.\n",
        "    \"\"\"\n",
        "    import numpy as np\n",
        "    from scipy import integrate\n",
        "    import math\n",
        "    \n",
        "    def gaussian_function(x):\n",
        "        \"\"\"Standard Gaussian function\"\"\"\n",
        "        return np.exp(-x**2 / 2) / np.sqrt(2 * np.pi)\n",
        "    \n",
        "    def oscillatory_function(x):\n",
        "        \"\"\"Highly oscillatory function\"\"\"\n",
        "        return np.sin(100 * x) * np.exp(-x**2)\n",
        "    \n",
        "    def polynomial_function(x):\n",
        "        \"\"\"High-degree polynomial\"\"\"\n",
        "        return x**10 * np.exp(-x)\n",
        "    \n",
        "    # Select function based on type\n",
        "    functions = {\n",
        "        \"gaussian\": (gaussian_function, -5, 5, math.erf(5/np.sqrt(2)) - math.erf(-5/np.sqrt(2))),\n",
        "        \"oscillatory\": (oscillatory_function, -2, 2, None),  # No analytical solution\n",
        "        \"polynomial\": (polynomial_function, 0, 10, math.gamma(11))  # Analytical: 10!\n",
        "    }\n",
        "    \n",
        "    if function_type not in functions:\n",
        "        raise ValueError(f\"Unknown function type: {function_type}\")\n",
        "    \n",
        "    func, a, b, analytical = functions[function_type]\n",
        "    \n",
        "    print(f\"Integrating {function_type} function from {a} to {b}...\")\n",
        "    print(f\"Target precision: {precision_target}\")\n",
        "    \n",
        "    # High-precision adaptive integration\n",
        "    result, error = integrate.quad(\n",
        "        func, a, b, \n",
        "        epsabs=precision_target,\n",
        "        epsrel=precision_target,\n",
        "        limit=intervals\n",
        "    )\n",
        "    \n",
        "    # Monte Carlo integration for comparison\n",
        "    n_mc = 10000000  # 10 million samples\n",
        "    x_mc = np.random.uniform(a, b, n_mc)\n",
        "    y_mc = func(x_mc)\n",
        "    mc_result = (b - a) * np.mean(y_mc)\n",
        "    mc_error = (b - a) * np.std(y_mc) / np.sqrt(n_mc)\n",
        "    \n",
        "    integration_result = {\n",
        "        'function_type': function_type,\n",
        "        'integration_bounds': [a, b],\n",
        "        'adaptive_result': result,\n",
        "        'adaptive_error': error,\n",
        "        'monte_carlo_result': mc_result,\n",
        "        'monte_carlo_error': mc_error,\n",
        "        'precision_target': precision_target,\n",
        "        'mc_samples': n_mc\n",
        "    }\n",
        "    \n",
        "    if analytical is not None:\n",
        "        integration_result['analytical_result'] = analytical\n",
        "        integration_result['adaptive_vs_analytical'] = abs(result - analytical)\n",
        "        integration_result['mc_vs_analytical'] = abs(mc_result - analytical)\n",
        "    \n",
        "    return integration_result\n",
        "\n",
        "# Perform numerical integration\n",
        "integration_results = []\n",
        "\n",
        "for func_type in [\"gaussian\", \"polynomial\", \"oscillatory\"]:\n",
        "    result = numerical_integration_adaptive(func_type, precision_target=1e-10)\n",
        "    integration_results.append(result)\n",
        "    \n",
        "    print(f\"\\n{func_type.upper()} FUNCTION INTEGRATION:\")\n",
        "    print(f\"Adaptive result: {result['adaptive_result']:.10f} ± {result['adaptive_error']:.2e}\")\n",
        "    print(f\"Monte Carlo result: {result['monte_carlo_result']:.10f} ± {result['monte_carlo_error']:.2e}\")\n",
        "    \n",
        "    if 'analytical_result' in result:\n",
        "        print(f\"Analytical result: {result['analytical_result']:.10f}\")\n",
        "        print(f\"Adaptive error vs analytical: {result['adaptive_vs_analytical']:.2e}\")\n",
        "        print(f\"MC error vs analytical: {result['mc_vs_analytical']:.2e}\")"
      ],
      "id": "cell-12"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Example 5: Bioinformatics - Sequence Analysis\n",
        "\n",
        "Analyze biological sequences using cluster computing:"
      ],
      "id": "cell-13"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(\n",
        "    cores=24,\n",
        "    memory=\"96GB\", \n",
        "    time=\"04:00:00\",\n",
        "    partition=\"bioqueue\"  # Specialized bioinformatics partition\n",
        ")\n",
        "def analyze_genome_sequences(num_sequences=1000, sequence_length=10000):\n",
        "    \"\"\"\n",
        "    Analyze synthetic genome sequences for various biological properties.\n",
        "    \"\"\"\n",
        "    import numpy as np\n",
        "    import random\n",
        "    from collections import Counter\n",
        "    import re\n",
        "    \n",
        "    # DNA bases\n",
        "    bases = ['A', 'T', 'G', 'C']\n",
        "    \n",
        "    # Common biological motifs\n",
        "    motifs = {\n",
        "        'CpG_sites': 'CG',\n",
        "        'TATA_box': 'TATAAA',\n",
        "        'start_codon': 'ATG',\n",
        "        'stop_codons': ['TAA', 'TAG', 'TGA'],\n",
        "        'poly_A': 'AAAAAAA',  # 7 consecutive A's\n",
        "        'GC_rich': 'GCGCGC'\n",
        "    }\n",
        "    \n",
        "    def generate_sequence(length, gc_content=0.5):\n",
        "        \"\"\"Generate a random DNA sequence with specified GC content\"\"\"\n",
        "        # Adjust probabilities for GC content\n",
        "        gc_prob = gc_content / 2  # Equal prob for G and C\n",
        "        at_prob = (1 - gc_content) / 2  # Equal prob for A and T\n",
        "        \n",
        "        probs = [at_prob, at_prob, gc_prob, gc_prob]  # A, T, G, C\n",
        "        return ''.join(np.random.choice(bases, size=length, p=probs))\n",
        "    \n",
        "    def analyze_sequence(sequence):\n",
        "        \"\"\"Analyze a single sequence for biological properties\"\"\"\n",
        "        # Basic composition\n",
        "        composition = Counter(sequence)\n",
        "        total_bases = len(sequence)\n",
        "        \n",
        "        gc_content = (composition['G'] + composition['C']) / total_bases\n",
        "        at_content = (composition['A'] + composition['T']) / total_bases\n",
        "        \n",
        "        # Motif analysis\n",
        "        motif_counts = {}\n",
        "        motif_counts['CpG_sites'] = len(re.findall(motifs['CpG_sites'], sequence))\n",
        "        motif_counts['TATA_boxes'] = len(re.findall(motifs['TATA_box'], sequence))\n",
        "        motif_counts['start_codons'] = len(re.findall(motifs['start_codon'], sequence))\n",
        "        motif_counts['poly_A_signals'] = len(re.findall(motifs['poly_A'], sequence))\n",
        "        motif_counts['GC_rich_regions'] = len(re.findall(motifs['GC_rich'], sequence))\n",
        "        \n",
        "        # Stop codons (any of the three)\n",
        "        stop_codon_count = sum(len(re.findall(codon, sequence)) for codon in motifs['stop_codons'])\n",
        "        motif_counts['stop_codons'] = stop_codon_count\n",
        "        \n",
        "        # Calculate complexity (entropy)\n",
        "        entropy = -sum((count/total_bases) * np.log2(count/total_bases) \n",
        "                      for count in composition.values() if count > 0)\n",
        "        \n",
        "        # Find longest homopolymer runs\n",
        "        max_runs = {}\n",
        "        for base in bases:\n",
        "            runs = re.findall(f'{base}+', sequence)\n",
        "            max_runs[f'max_{base}_run'] = max(len(run) for run in runs) if runs else 0\n",
        "        \n",
        "        return {\n",
        "            'length': total_bases,\n",
        "            'gc_content': gc_content,\n",
        "            'at_content': at_content,\n",
        "            'base_composition': dict(composition),\n",
        "            'entropy': entropy,\n",
        "            'motif_counts': motif_counts,\n",
        "            'max_homopolymer_runs': max_runs\n",
        "        }\n",
        "    \n",
        "    print(f\"Generating and analyzing {num_sequences:,} sequences of length {sequence_length:,}...\")\n",
        "    \n",
        "    # Generate sequences with varying GC content\n",
        "    gc_contents = np.random.uniform(0.3, 0.7, num_sequences)  # Realistic range\n",
        "    \n",
        "    sequence_analyses = []\n",
        "    \n",
        "    for i, gc_content in enumerate(gc_contents):\n",
        "        if i % 100 == 0:\n",
        "            print(f\"Analyzing sequence {i+1}/{num_sequences}...\")\n",
        "        \n",
        "        sequence = generate_sequence(sequence_length, gc_content)\n",
        "        analysis = analyze_sequence(sequence)\n",
        "        analysis['target_gc_content'] = gc_content\n",
        "        analysis['sequence_id'] = i\n",
        "        sequence_analyses.append(analysis)\n",
        "    \n",
        "    # Aggregate statistics\n",
        "    gc_contents_actual = [s['gc_content'] for s in sequence_analyses]\n",
        "    entropies = [s['entropy'] for s in sequence_analyses]\n",
        "    \n",
        "    # Motif statistics\n",
        "    all_motif_counts = {motif: [s['motif_counts'][motif] for s in sequence_analyses] \n",
        "                       for motif in sequence_analyses[0]['motif_counts'].keys()}\n",
        "    \n",
        "    aggregate_results = {\n",
        "        'num_sequences_analyzed': len(sequence_analyses),\n",
        "        'total_bases_analyzed': len(sequence_analyses) * sequence_length,\n",
        "        'gc_content_stats': {\n",
        "            'mean': np.mean(gc_contents_actual),\n",
        "            'std': np.std(gc_contents_actual),\n",
        "            'min': np.min(gc_contents_actual),\n",
        "            'max': np.max(gc_contents_actual)\n",
        "        },\n",
        "        'entropy_stats': {\n",
        "            'mean': np.mean(entropies),\n",
        "            'std': np.std(entropies),\n",
        "            'min': np.min(entropies),\n",
        "            'max': np.max(entropies)\n",
        "        },\n",
        "        'motif_statistics': {\n",
        "            motif: {\n",
        "                'total_found': sum(counts),\n",
        "                'mean_per_sequence': np.mean(counts),\n",
        "                'std_per_sequence': np.std(counts),\n",
        "                'sequences_with_motif': sum(1 for c in counts if c > 0)\n",
        "            } for motif, counts in all_motif_counts.items()\n",
        "        },\n",
        "        'individual_analyses': sequence_analyses[:10]  # Return first 10 for inspection\n",
        "    }\n",
        "    \n",
        "    return aggregate_results\n",
        "\n",
        "# Analyze genome sequences\n",
        "genome_results = analyze_genome_sequences(num_sequences=500, sequence_length=5000)\n",
        "\n",
        "print(f\"\\nGENOME SEQUENCE ANALYSIS COMPLETE\")\n",
        "print(f\"Sequences analyzed: {genome_results['num_sequences_analyzed']:,}\")\n",
        "print(f\"Total bases: {genome_results['total_bases_analyzed']:,}\")\n",
        "\n",
        "print(\"\\nGC Content Statistics:\")\n",
        "gc_stats = genome_results['gc_content_stats']\n",
        "print(f\"  Mean: {gc_stats['mean']:.3f} ± {gc_stats['std']:.3f}\")\n",
        "print(f\"  Range: {gc_stats['min']:.3f} - {gc_stats['max']:.3f}\")\n",
        "\n",
        "print(\"\\nSequence Complexity (Entropy):\")\n",
        "entropy_stats = genome_results['entropy_stats']\n",
        "print(f\"  Mean: {entropy_stats['mean']:.3f} ± {entropy_stats['std']:.3f}\")\n",
        "print(f\"  Range: {entropy_stats['min']:.3f} - {entropy_stats['max']:.3f}\")\n",
        "\n",
        "print(\"\\nMotif Analysis:\")\n",
        "for motif, stats in genome_results['motif_statistics'].items():\n",
        "    print(f\"  {motif}: {stats['total_found']} total, \"\n",
        "          f\"{stats['mean_per_sequence']:.1f}±{stats['std_per_sequence']:.1f} per sequence, \"\n",
        "          f\"{stats['sequences_with_motif']} sequences contain motif\")"
      ],
      "id": "cell-14"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Advanced SLURM Features\n",
        "\n",
        "### Job Arrays for Parameter Sweeps\n",
        "\n",
        "Use SLURM job arrays to efficiently run parameter sweeps:"
      ],
      "id": "cell-15"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(\n",
        "    cores=4,\n",
        "    memory=\"16GB\",\n",
        "    time=\"00:30:00\",\n",
        "    array=\"1-10\"  # SLURM job array with 10 tasks\n",
        ")\n",
        "def parameter_sweep_simulation(base_params):\n",
        "    \"\"\"\n",
        "    Run simulation with parameter variations using SLURM job arrays.\n",
        "    Each array task will run with different parameters.\n",
        "    \"\"\"\n",
        "    import os\n",
        "    import numpy as np\n",
        "    \n",
        "    # Get SLURM array task ID\n",
        "    task_id = int(os.environ.get('SLURM_ARRAY_TASK_ID', '1'))\n",
        "    \n",
        "    # Define parameter variations\n",
        "    learning_rates = np.logspace(-4, -1, 10)  # 10 different learning rates\n",
        "    learning_rate = learning_rates[task_id - 1]  # SLURM arrays start from 1\n",
        "    \n",
        "    # Update parameters\n",
        "    params = base_params.copy()\n",
        "    params['learning_rate'] = learning_rate\n",
        "    params['task_id'] = task_id\n",
        "    \n",
        "    print(f\"Task {task_id}: Running with learning_rate = {learning_rate:.6f}\")\n",
        "    \n",
        "    # Simulate training process\n",
        "    np.random.seed(task_id * 42)  # Reproducible but different per task\n",
        "    \n",
        "    losses = []\n",
        "    current_loss = 10.0  # Starting loss\n",
        "    \n",
        "    for epoch in range(params['epochs']):\n",
        "        # Simulate gradient descent\n",
        "        gradient = np.random.normal(0, 0.1) + 0.1 * current_loss\n",
        "        current_loss -= learning_rate * gradient\n",
        "        current_loss = max(0.01, current_loss)  # Prevent negative loss\n",
        "        losses.append(current_loss)\n",
        "    \n",
        "    final_loss = losses[-1]\n",
        "    convergence_epoch = next((i for i, loss in enumerate(losses) if loss < 0.1), len(losses))\n",
        "    \n",
        "    return {\n",
        "        'task_id': task_id,\n",
        "        'learning_rate': learning_rate,\n",
        "        'final_loss': final_loss,\n",
        "        'convergence_epoch': convergence_epoch,\n",
        "        'loss_history': losses[::10],  # Every 10th loss for brevity\n",
        "        'converged': final_loss < 0.1\n",
        "    }\n",
        "\n",
        "# Run parameter sweep\n",
        "base_parameters = {\n",
        "    'epochs': 1000,\n",
        "    'batch_size': 32,\n",
        "    'model_size': 'medium'\n",
        "}\n",
        "\n",
        "# This will submit a SLURM job array with 10 tasks\n",
        "sweep_results = parameter_sweep_simulation(base_parameters)\n",
        "\n",
        "print(f\"Parameter sweep completed for task {sweep_results['task_id']}\")\n",
        "print(f\"Learning rate: {sweep_results['learning_rate']:.6f}\")\n",
        "print(f\"Final loss: {sweep_results['final_loss']:.4f}\")\n",
        "print(f\"Converged: {sweep_results['converged']}\")\n",
        "if sweep_results['converged']:\n",
        "    print(f\"Convergence epoch: {sweep_results['convergence_epoch']}\")"
      ],
      "id": "cell-16"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Monitoring and Debugging\n",
        "\n",
        "Use Clustrix's built-in monitoring capabilities:"
      ],
      "id": "cell-17"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from clustrix import ClusterExecutor\n",
        "\n",
        "# Get the configured executor\n",
        "config = clustrix.get_config()\n",
        "executor = ClusterExecutor(config)\n",
        "\n",
        "# Check cluster connectivity\n",
        "try:\n",
        "    executor.connect()\n",
        "    print(\"✓ Successfully connected to SLURM cluster\")\n",
        "    \n",
        "    # Test basic command execution\n",
        "    stdout, stderr = executor._execute_command(\"sinfo --version\")\n",
        "    print(f\"✓ SLURM version: {stdout.strip()}\")\n",
        "    \n",
        "    # Check available partitions\n",
        "    stdout, stderr = executor._execute_command(\"sinfo -h -o '%P %A %l'\")\n",
        "    print(\"\\nAvailable partitions:\")\n",
        "    for line in stdout.strip().split('\\n')[:5]:  # Show first 5 partitions\n",
        "        parts = line.split()\n",
        "        if len(parts) >= 3:\n",
        "            partition, avail, timelimit = parts[0], parts[1], parts[2]\n",
        "            print(f\"  {partition}: {avail} nodes available, time limit: {timelimit}\")\n",
        "    \n",
        "    executor.disconnect()\n",
        "    print(\"\\n✓ Connection test completed successfully\")\n",
        "    \n",
        "except Exception as e:\n",
        "    print(f\"✗ Connection failed: {e}\")\n",
        "    print(\"Please check your cluster configuration and SSH setup\")"
      ],
      "id": "cell-18"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Configuration Best Practices\n",
        "\n",
        "### 1. Environment-Specific Configuration\n",
        "\n",
        "Create different configurations for different environments:"
      ],
      "id": "cell-19"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Development configuration (smaller resources)\n",
        "dev_config = {\n",
        "    'cluster_type': 'slurm',\n",
        "    'cluster_host': 'dev-cluster.university.edu',\n",
        "    'username': 'your-username',\n",
        "    'default_cores': 2,\n",
        "    'default_memory': '4GB',\n",
        "    'default_time': '00:15:00',\n",
        "    'default_partition': 'debug',\n",
        "    'max_parallel_jobs': 5\n",
        "}\n",
        "\n",
        "# Production configuration (larger resources)\n",
        "prod_config = {\n",
        "    'cluster_type': 'slurm',\n",
        "    'cluster_host': 'hpc-cluster.university.edu',\n",
        "    'username': 'your-username',\n",
        "    'default_cores': 16,\n",
        "    'default_memory': '64GB',\n",
        "    'default_time': '04:00:00',\n",
        "    'default_partition': 'normal',\n",
        "    'max_parallel_jobs': 50\n",
        "}\n",
        "\n",
        "# Choose configuration based on environment\n",
        "import os\n",
        "environment = os.environ.get('CLUSTRIX_ENV', 'development')\n",
        "\n",
        "if environment == 'production':\n",
        "    clustrix.configure(**prod_config)\n",
        "    print(\"Configured for production environment\")\n",
        "else:\n",
        "    clustrix.configure(**dev_config)\n",
        "    print(\"Configured for development environment\")"
      ],
      "id": "cell-20"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 2. Resource Estimation Guidelines\n",
        "\n",
        "Guidelines for choosing appropriate resources:"
      ],
      "id": "cell-21"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def estimate_resources(task_type, data_size_mb, complexity='medium'):\n",
        "    \"\"\"\n",
        "    Estimate computational resources needed for different task types.\n",
        "    \"\"\"\n",
        "    \n",
        "    base_configs = {\n",
        "        'data_processing': {\n",
        "            'cores': max(2, min(16, data_size_mb // 100)),\n",
        "            'memory_gb': max(4, min(64, data_size_mb // 10)),\n",
        "            'time_hours': max(0.5, min(8, data_size_mb / 1000))\n",
        "        },\n",
        "        'machine_learning': {\n",
        "            'cores': max(4, min(32, data_size_mb // 50)),\n",
        "            'memory_gb': max(8, min(128, data_size_mb // 5)),\n",
        "            'time_hours': max(1, min(12, data_size_mb / 500))\n",
        "        },\n",
        "        'simulation': {\n",
        "            'cores': max(8, min(64, data_size_mb // 25)),\n",
        "            'memory_gb': max(16, min(256, data_size_mb // 2)),\n",
        "            'time_hours': max(2, min(24, data_size_mb / 100))\n",
        "        },\n",
        "        'bioinformatics': {\n",
        "            'cores': max(4, min(24, data_size_mb // 20)),\n",
        "            'memory_gb': max(16, min(128, data_size_mb // 2)),\n",
        "            'time_hours': max(1, min(16, data_size_mb / 200))\n",
        "        }\n",
        "    }\n",
        "    \n",
        "    if task_type not in base_configs:\n",
        "        raise ValueError(f\"Unknown task type: {task_type}\")\n",
        "    \n",
        "    config = base_configs[task_type].copy()\n",
        "    \n",
        "    # Adjust for complexity\n",
        "    complexity_multipliers = {\n",
        "        'low': 0.7,\n",
        "        'medium': 1.0,\n",
        "        'high': 1.5,\n",
        "        'very_high': 2.0\n",
        "    }\n",
        "    \n",
        "    multiplier = complexity_multipliers.get(complexity, 1.0)\n",
        "    \n",
        "    config['cores'] = int(config['cores'] * multiplier)\n",
        "    config['memory_gb'] = int(config['memory_gb'] * multiplier)\n",
        "    config['time_hours'] = config['time_hours'] * multiplier\n",
        "    \n",
        "    # Format time as HH:MM:SS\n",
        "    hours = int(config['time_hours'])\n",
        "    minutes = int((config['time_hours'] - hours) * 60)\n",
        "    config['time_formatted'] = f\"{hours:02d}:{minutes:02d}:00\"\n",
        "    \n",
        "    return config\n",
        "\n",
        "# Example usage\n",
        "examples = [\n",
        "    ('machine_learning', 1000, 'high'),\n",
        "    ('data_processing', 5000, 'medium'),\n",
        "    ('simulation', 100, 'very_high'),\n",
        "    ('bioinformatics', 2000, 'high')\n",
        "]\n",
        "\n",
        "print(\"Resource Estimation Examples:\")\n",
        "print(\"=\" * 80)\n",
        "\n",
        "for task_type, data_size, complexity in examples:\n",
        "    resources = estimate_resources(task_type, data_size, complexity)\n",
        "    print(f\"\\n{task_type.replace('_', ' ').title()} ({data_size} MB, {complexity} complexity):\")\n",
        "    print(f\"  Cores: {resources['cores']}\")\n",
        "    print(f\"  Memory: {resources['memory_gb']} GB\")\n",
        "    print(f\"  Time: {resources['time_formatted']} ({resources['time_hours']:.1f} hours)\")"
      ],
      "id": "cell-22"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Summary\n",
        "\n",
        "This tutorial covered:\n",
        "\n",
        "1. **Basic SLURM Configuration** - Setting up Clustrix for SLURM clusters\n",
        "2. **Simple Computations** - Monte Carlo methods and mathematical functions\n",
        "3. **Machine Learning** - Training models with GPU support\n",
        "4. **Parallel Processing** - Automatic loop distribution across cores\n",
        "5. **Scientific Computing** - High-precision numerical integration\n",
        "6. **Bioinformatics** - Genome sequence analysis\n",
        "7. **Advanced Features** - Job arrays and parameter sweeps\n",
        "8. **Monitoring** - Connection testing and debugging\n",
        "9. **Best Practices** - Resource estimation and configuration management\n",
        "\n",
        "### Key Takeaways:\n",
        "\n",
        "- **Resource Planning**: Always estimate resources based on your data size and complexity\n",
        "- **Partition Selection**: Choose appropriate SLURM partitions for your workload\n",
        "- **Time Limits**: Set realistic time limits with some buffer for completion\n",
        "- **Memory Management**: Monitor memory usage and adjust accordingly\n",
        "- **Parallel Efficiency**: Use automatic parallelization for loop-heavy computations\n",
        "- **Error Handling**: Always test connectivity and handle failures gracefully\n",
        "\n",
        "### Next Steps:\n",
        "\n",
        "- Check out the [PBS Tutorial](pbs_tutorial.ipynb) for Torque/PBS clusters\n",
        "- Explore [Kubernetes Tutorial](kubernetes_tutorial.ipynb) for containerized computing\n",
        "- Review the [SSH Setup Guide](../ssh_setup.rst) for secure authentication\n",
        "- Read the [API Documentation](../api/decorator.rst) for advanced decorator options\n",
        "\n",
        "For more information, visit the [Clustrix Documentation](https://clustrix.readthedocs.io)."
      ],
      "id": "cell-23"
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}