{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# PBS/Torque Cluster Tutorial\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/clustrix/blob/master/docs/notebooks/pbs_tutorial.ipynb)\n",
        "\n",
        "This tutorial demonstrates how to use Clustrix with PBS (Portable Batch System) and Torque clusters. PBS is widely used in academic and research computing environments.\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "- Access to a PBS/Torque cluster\n",
        "- SSH key configured for the cluster\n",
        "- Clustrix installed: `pip install clustrix`"
      ],
      "id": "cell-0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Installation and Setup"
      ],
      "id": "cell-1"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Install Clustrix (uncomment if needed)\n",
        "# !pip install clustrix\n",
        "\n",
        "import clustrix\n",
        "from clustrix import cluster, configure\n",
        "import numpy as np\n",
        "import pandas as pd"
      ],
      "id": "cell-2"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## PBS Cluster Configuration\n",
        "\n",
        "Configure Clustrix for your PBS/Torque cluster:"
      ],
      "id": "cell-3"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Configure for PBS cluster\n",
        "configure(\n",
        "    cluster_type=\"pbs\",\n",
        "    cluster_host=\"pbs-cluster.university.edu\",  # Replace with your cluster\n",
        "    username=\"your-username\",                   # Replace with your username\n",
        "    key_file=\"~/.ssh/id_rsa\",                  # Path to SSH key\n",
        "    \n",
        "    # Default PBS resource requirements\n",
        "    default_cores=4,\n",
        "    default_memory=\"16GB\",\n",
        "    default_time=\"02:00:00\",\n",
        "    default_queue=\"normal\",                     # PBS queue name\n",
        "    \n",
        "    # PBS-specific options\n",
        "    remote_work_dir=\"/home/your-username/clustrix\",  # Adjust for your cluster\n",
        "    \n",
        "    # Environment setup\n",
        "    module_loads=[\"python/3.9\", \"openmpi/4.0\"],  # Common PBS modules\n",
        "    \n",
        "    # Job management\n",
        "    cleanup_on_success=True,\n",
        "    max_parallel_jobs=25\n",
        ")\n",
        "\n",
        "print(\"PBS cluster configured successfully!\")"
      ],
      "id": "cell-4"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Example 1: Bioinformatics - DNA Sequence Analysis\n",
        "\n",
        "PBS clusters are popular in bioinformatics. Let's analyze DNA sequences:"
      ],
      "id": "cell-5"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(\n",
        "    cores=8, \n",
        "    memory=\"32GB\", \n",
        "    time=\"03:00:00\", \n",
        "    queue=\"bioqueue\",  # Specialized bioinformatics queue\n",
        "    walltime=\"03:00:00\"  # PBS uses 'walltime' parameter\n",
        ")\n",
        "def analyze_dna_sequences(sequences, analysis_type=\"comprehensive\"):\n",
        "    \"\"\"\n",
        "    Comprehensive DNA sequence analysis for bioinformatics research.\n",
        "    \"\"\"\n",
        "    import numpy as np\n",
        "    import random\n",
        "    from collections import Counter, defaultdict\n",
        "    import re\n",
        "    import math\n",
        "    \n",
        "    def calculate_gc_content(sequence):\n",
        "        \"\"\"Calculate GC content percentage\"\"\"\n",
        "        gc_count = sequence.count('G') + sequence.count('C')\n",
        "        return (gc_count / len(sequence)) * 100 if sequence else 0\n",
        "    \n",
        "    def find_orfs(sequence, min_length=100):\n",
        "        \"\"\"Find Open Reading Frames (ORFs)\"\"\"\n",
        "        start_codon = 'ATG'\n",
        "        stop_codons = ['TAA', 'TAG', 'TGA']\n",
        "        orfs = []\n",
        "        \n",
        "        for frame in range(3):  # Check all 3 reading frames\n",
        "            for i in range(frame, len(sequence) - 2, 3):\n",
        "                codon = sequence[i:i+3]\n",
        "                if codon == start_codon:\n",
        "                    # Look for stop codon\n",
        "                    for j in range(i+3, len(sequence) - 2, 3):\n",
        "                        stop_codon = sequence[j:j+3]\n",
        "                        if stop_codon in stop_codons:\n",
        "                            orf_length = j - i + 3\n",
        "                            if orf_length >= min_length:\n",
        "                                orfs.append({\n",
        "                                    'start': i,\n",
        "                                    'end': j + 3,\n",
        "                                    'length': orf_length,\n",
        "                                    'frame': frame + 1,\n",
        "                                    'sequence': sequence[i:j+3]\n",
        "                                })\n",
        "                            break\n",
        "        return orfs\n",
        "    \n",
        "    def analyze_codon_usage(sequence):\n",
        "        \"\"\"Analyze codon usage patterns\"\"\"\n",
        "        codons = [sequence[i:i+3] for i in range(0, len(sequence)-2, 3) \n",
        "                 if len(sequence[i:i+3]) == 3]\n",
        "        codon_counts = Counter(codons)\n",
        "        \n",
        "        # Standard genetic code mapping\n",
        "        genetic_code = {\n",
        "            'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L',\n",
        "            'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TCG': 'S',\n",
        "            'TAT': 'Y', 'TAC': 'Y', 'TAA': '*', 'TAG': '*',\n",
        "            'TGT': 'C', 'TGC': 'C', 'TGA': '*', 'TGG': 'W',\n",
        "            'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L',\n",
        "            'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',\n",
        "            'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',\n",
        "            'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',\n",
        "            'ATT': 'I', 'ATC': 'I', 'ATA': 'I', 'ATG': 'M',\n",
        "            'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',\n",
        "            'AAT': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',\n",
        "            'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',\n",
        "            'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',\n",
        "            'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',\n",
        "            'GAT': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',\n",
        "            'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'\n",
        "        }\n",
        "        \n",
        "        amino_acid_counts = defaultdict(int)\n",
        "        for codon, count in codon_counts.items():\n",
        "            if codon in genetic_code:\n",
        "                amino_acid_counts[genetic_code[codon]] += count\n",
        "        \n",
        "        return dict(codon_counts), dict(amino_acid_counts)\n",
        "    \n",
        "    def find_tandem_repeats(sequence, min_repeat_length=3, max_repeat_length=20):\n",
        "        \"\"\"Find tandem repeats in DNA sequence\"\"\"\n",
        "        repeats = []\n",
        "        \n",
        "        for repeat_len in range(min_repeat_length, max_repeat_length + 1):\n",
        "            for i in range(len(sequence) - repeat_len * 2 + 1):\n",
        "                motif = sequence[i:i + repeat_len]\n",
        "                count = 1\n",
        "                j = i + repeat_len\n",
        "                \n",
        "                while j + repeat_len <= len(sequence) and sequence[j:j + repeat_len] == motif:\n",
        "                    count += 1\n",
        "                    j += repeat_len\n",
        "                \n",
        "                if count >= 3:  # At least 3 repeats\n",
        "                    repeats.append({\n",
        "                        'motif': motif,\n",
        "                        'start': i,\n",
        "                        'end': j,\n",
        "                        'repeat_count': count,\n",
        "                        'total_length': j - i\n",
        "                    })\n",
        "        \n",
        "        return repeats\n",
        "    \n",
        "    # Main analysis loop\n",
        "    results = []\n",
        "    \n",
        "    for seq_idx, sequence in enumerate(sequences):\n",
        "        print(f\"Analyzing sequence {seq_idx + 1}/{len(sequences)} (length: {len(sequence)})...\")\n",
        "        \n",
        "        # Basic composition analysis\n",
        "        base_composition = Counter(sequence)\n",
        "        gc_content = calculate_gc_content(sequence)\n",
        "        \n",
        "        # Advanced analyses\n",
        "        orfs = find_orfs(sequence, min_length=150)\n",
        "        codon_usage, amino_acid_freq = analyze_codon_usage(sequence)\n",
        "        tandem_repeats = find_tandem_repeats(sequence)\n",
        "        \n",
        "        # CpG island detection (simplified)\n",
        "        cpg_sites = len(re.findall('CG', sequence))\n",
        "        cpg_density = (cpg_sites / (len(sequence) - 1)) * 100 if len(sequence) > 1 else 0\n",
        "        \n",
        "        # Complexity analysis\n",
        "        def calculate_complexity(seq, window_size=50):\n",
        "            complexities = []\n",
        "            for i in range(0, len(seq) - window_size + 1, window_size):\n",
        "                window = seq[i:i + window_size]\n",
        "                counter = Counter(window)\n",
        "                entropy = -sum((count/window_size) * math.log2(count/window_size) \n",
        "                              for count in counter.values() if count > 0)\n",
        "                complexities.append(entropy)\n",
        "            return np.mean(complexities) if complexities else 0\n",
        "        \n",
        "        complexity = calculate_complexity(sequence)\n",
        "        \n",
        "        sequence_result = {\n",
        "            'sequence_id': seq_idx,\n",
        "            'length': len(sequence),\n",
        "            'base_composition': dict(base_composition),\n",
        "            'gc_content': gc_content,\n",
        "            'complexity': complexity,\n",
        "            'orfs_found': len(orfs),\n",
        "            'longest_orf': max(orfs, key=lambda x: x['length'])['length'] if orfs else 0,\n",
        "            'cpg_sites': cpg_sites,\n",
        "            'cpg_density': cpg_density,\n",
        "            'tandem_repeats': len(tandem_repeats),\n",
        "            'repeat_details': tandem_repeats[:5],  # Keep first 5 for analysis\n",
        "            'codon_diversity': len(codon_usage),\n",
        "            'amino_acid_diversity': len(amino_acid_freq),\n",
        "            'most_common_amino_acid': max(amino_acid_freq.items(), key=lambda x: x[1])[0] if amino_acid_freq else 'N/A'\n",
        "        }\n",
        "        \n",
        "        results.append(sequence_result)\n",
        "    \n",
        "    # Aggregate statistics\n",
        "    aggregate_stats = {\n",
        "        'total_sequences': len(results),\n",
        "        'total_base_pairs': sum(r['length'] for r in results),\n",
        "        'average_gc_content': np.mean([r['gc_content'] for r in results]),\n",
        "        'gc_content_std': np.std([r['gc_content'] for r in results]),\n",
        "        'average_complexity': np.mean([r['complexity'] for r in results]),\n",
        "        'total_orfs_found': sum(r['orfs_found'] for r in results),\n",
        "        'total_cpg_sites': sum(r['cpg_sites'] for r in results),\n",
        "        'sequences_with_repeats': sum(1 for r in results if r['tandem_repeats'] > 0),\n",
        "        'individual_results': results\n",
        "    }\n",
        "    \n",
        "    return aggregate_stats\n",
        "\n",
        "# Generate sample DNA sequences for analysis\n",
        "def generate_realistic_dna(length, gc_content=0.5):\n",
        "    \"\"\"Generate realistic DNA sequences with specific GC content\"\"\"\n",
        "    bases = ['A', 'T', 'G', 'C']\n",
        "    gc_prob = gc_content / 2\n",
        "    at_prob = (1 - gc_content) / 2\n",
        "    probs = [at_prob, at_prob, gc_prob, gc_prob]\n",
        "    \n",
        "    return ''.join(np.random.choice(bases, size=length, p=probs))\n",
        "\n",
        "# Create test sequences\n",
        "test_sequences = [\n",
        "    generate_realistic_dna(5000, 0.4),   # AT-rich\n",
        "    generate_realistic_dna(8000, 0.6),   # GC-rich\n",
        "    generate_realistic_dna(3000, 0.5),   # Balanced\n",
        "    generate_realistic_dna(12000, 0.45), # Large AT-rich\n",
        "    generate_realistic_dna(6000, 0.55)   # Medium GC-rich\n",
        "]\n",
        "\n",
        "# Run analysis on PBS cluster\n",
        "bio_results = analyze_dna_sequences(test_sequences, analysis_type=\"comprehensive\")\n",
        "\n",
        "print(f\"\\nBIOINFORMATICS ANALYSIS COMPLETE\")\n",
        "print(f\"Sequences analyzed: {bio_results['total_sequences']}\")\n",
        "print(f\"Total base pairs: {bio_results['total_base_pairs']:,}\")\n",
        "print(f\"Average GC content: {bio_results['average_gc_content']:.2f}% ± {bio_results['gc_content_std']:.2f}%\")\n",
        "print(f\"Total ORFs found: {bio_results['total_orfs_found']}\")\n",
        "print(f\"Total CpG sites: {bio_results['total_cpg_sites']}\")\n",
        "print(f\"Sequences with tandem repeats: {bio_results['sequences_with_repeats']}/{bio_results['total_sequences']}\")"
      ],
      "id": "cell-6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Example 2: Materials Science - Molecular Dynamics Simulation\n",
        "\n",
        "Simulate molecular systems commonly done on PBS clusters:"
      ],
      "id": "cell-7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(\n",
        "    cores=16,\n",
        "    memory=\"64GB\",\n",
        "    time=\"06:00:00\",\n",
        "    queue=\"physics\",\n",
        "    features=\"infiniband\"  # PBS feature for high-speed networking\n",
        ")\n",
        "def molecular_dynamics_simulation(n_particles=10000, n_steps=100000, temperature=300.0):\n",
        "    \"\"\"\n",
        "    Simplified molecular dynamics simulation for materials science.\n",
        "    \"\"\"\n",
        "    import numpy as np\n",
        "    import math\n",
        "    \n",
        "    # Physical constants\n",
        "    kb = 1.380649e-23  # Boltzmann constant (J/K)\n",
        "    mass = 1.66054e-27  # Approximate atomic mass (kg)\n",
        "    dt = 1e-15  # Time step (s)\n",
        "    sigma = 3.4e-10  # Lennard-Jones parameter (m)\n",
        "    epsilon = 1.65e-21  # Lennard-Jones parameter (J)\n",
        "    \n",
        "    print(f\"Starting MD simulation with {n_particles:,} particles for {n_steps:,} steps...\")\n",
        "    print(f\"Temperature: {temperature} K\")\n",
        "    \n",
        "    # Initialize system\n",
        "    box_size = (n_particles / 0.8) ** (1/3) * sigma  # Density ~0.8\n",
        "    \n",
        "    # Random initial positions\n",
        "    positions = np.random.uniform(0, box_size, (n_particles, 3))\n",
        "    \n",
        "    # Maxwell-Boltzmann velocity distribution\n",
        "    velocity_scale = math.sqrt(kb * temperature / mass)\n",
        "    velocities = np.random.normal(0, velocity_scale, (n_particles, 3))\n",
        "    \n",
        "    # Remove center of mass motion\n",
        "    velocities -= np.mean(velocities, axis=0)\n",
        "    \n",
        "    # Storage for analysis\n",
        "    energies = []\n",
        "    temperatures = []\n",
        "    pressures = []\n",
        "    radial_distribution = []\n",
        "    \n",
        "    def lennard_jones_force(r):\n",
        "        \"\"\"Calculate Lennard-Jones force\"\"\"\n",
        "        if r < 1e-12:  # Avoid division by zero\n",
        "            return 0\n",
        "        sr6 = (sigma / r) ** 6\n",
        "        sr12 = sr6 ** 2\n",
        "        return 24 * epsilon * (2 * sr12 - sr6) / r\n",
        "    \n",
        "    def calculate_forces(pos):\n",
        "        \"\"\"Calculate forces on all particles\"\"\"\n",
        "        forces = np.zeros_like(pos)\n",
        "        potential_energy = 0\n",
        "        \n",
        "        for i in range(n_particles):\n",
        "            for j in range(i + 1, n_particles):\n",
        "                # Distance vector with periodic boundary conditions\n",
        "                dr = pos[j] - pos[i]\n",
        "                dr = dr - box_size * np.round(dr / box_size)\n",
        "                r = np.linalg.norm(dr)\n",
        "                \n",
        "                if r < 2.5 * sigma:  # Cutoff distance\n",
        "                    force_magnitude = lennard_jones_force(r)\n",
        "                    force_vector = force_magnitude * dr / r\n",
        "                    \n",
        "                    forces[i] += force_vector\n",
        "                    forces[j] -= force_vector\n",
        "                    \n",
        "                    # Potential energy\n",
        "                    sr6 = (sigma / r) ** 6\n",
        "                    sr12 = sr6 ** 2\n",
        "                    potential_energy += 4 * epsilon * (sr12 - sr6)\n",
        "        \n",
        "        return forces, potential_energy\n",
        "    \n",
        "    def calculate_temperature(vel):\n",
        "        \"\"\"Calculate instantaneous temperature\"\"\"\n",
        "        kinetic_energy = 0.5 * mass * np.sum(vel ** 2)\n",
        "        return 2 * kinetic_energy / (3 * n_particles * kb)\n",
        "    \n",
        "    def calculate_pressure(pos, forces):\n",
        "        \"\"\"Calculate pressure using virial theorem\"\"\"\n",
        "        kinetic_term = n_particles * kb * calculate_temperature(velocities)\n",
        "        virial = np.sum(positions * forces)\n",
        "        volume = box_size ** 3\n",
        "        return (kinetic_term + virial/3) / volume\n",
        "    \n",
        "    # Main simulation loop\n",
        "    for step in range(n_steps):\n",
        "        if step % (n_steps // 10) == 0:\n",
        "            print(f\"Step {step:,}/{n_steps:,} ({100*step/n_steps:.1f}%)\")\n",
        "        \n",
        "        # Calculate forces\n",
        "        forces, potential_energy = calculate_forces(positions)\n",
        "        \n",
        "        # Velocity Verlet integration\n",
        "        # Update positions\n",
        "        positions += velocities * dt + 0.5 * forces / mass * dt ** 2\n",
        "        \n",
        "        # Apply periodic boundary conditions\n",
        "        positions = positions % box_size\n",
        "        \n",
        "        # Update velocities\n",
        "        new_forces, _ = calculate_forces(positions)\n",
        "        velocities += 0.5 * (forces + new_forces) / mass * dt\n",
        "        \n",
        "        # Calculate thermodynamic properties\n",
        "        if step % 1000 == 0:  # Sample every 1000 steps\n",
        "            kinetic_energy = 0.5 * mass * np.sum(velocities ** 2)\n",
        "            total_energy = kinetic_energy + potential_energy\n",
        "            temp = calculate_temperature(velocities)\n",
        "            pressure = calculate_pressure(positions, new_forces)\n",
        "            \n",
        "            energies.append({\n",
        "                'step': step,\n",
        "                'kinetic': kinetic_energy,\n",
        "                'potential': potential_energy,\n",
        "                'total': total_energy\n",
        "            })\n",
        "            temperatures.append(temp)\n",
        "            pressures.append(pressure)\n",
        "        \n",
        "        # Simple thermostat (velocity rescaling)\n",
        "        if step % 100 == 0:  # Apply every 100 steps\n",
        "            current_temp = calculate_temperature(velocities)\n",
        "            if current_temp > 0:\n",
        "                scaling_factor = math.sqrt(temperature / current_temp)\n",
        "                velocities *= scaling_factor\n",
        "    \n",
        "    # Calculate radial distribution function (simplified)\n",
        "    def calculate_rdf(pos, n_bins=100, max_r=None):\n",
        "        if max_r is None:\n",
        "            max_r = box_size / 2\n",
        "        \n",
        "        bin_width = max_r / n_bins\n",
        "        rdf = np.zeros(n_bins)\n",
        "        \n",
        "        for i in range(min(1000, n_particles)):  # Sample subset for efficiency\n",
        "            for j in range(i + 1, min(1000, n_particles)):\n",
        "                dr = pos[j] - pos[i]\n",
        "                dr = dr - box_size * np.round(dr / box_size)\n",
        "                r = np.linalg.norm(dr)\n",
        "                \n",
        "                if r < max_r:\n",
        "                    bin_index = int(r / bin_width)\n",
        "                    if bin_index < n_bins:\n",
        "                        rdf[bin_index] += 1\n",
        "        \n",
        "        # Normalize\n",
        "        for i in range(n_bins):\n",
        "            r = (i + 0.5) * bin_width\n",
        "            volume = 4 * math.pi * r ** 2 * bin_width\n",
        "            density = n_particles / box_size ** 3\n",
        "            rdf[i] /= (volume * density * 1000)  # 1000 particles sampled\n",
        "        \n",
        "        return rdf, np.arange(0.5 * bin_width, max_r, bin_width)\n",
        "    \n",
        "    rdf_values, rdf_distances = calculate_rdf(positions)\n",
        "    \n",
        "    # Final analysis\n",
        "    avg_temperature = np.mean(temperatures[-50:])  # Last 50 samples\n",
        "    avg_pressure = np.mean(pressures[-50:])\n",
        "    final_energy = energies[-1]['total'] if energies else 0\n",
        "    \n",
        "    simulation_results = {\n",
        "        'n_particles': n_particles,\n",
        "        'n_steps': n_steps,\n",
        "        'target_temperature': temperature,\n",
        "        'average_temperature': avg_temperature,\n",
        "        'temperature_stability': np.std(temperatures[-50:]),\n",
        "        'average_pressure': avg_pressure,\n",
        "        'final_energy': final_energy,\n",
        "        'box_size': box_size,\n",
        "        'density': n_particles / box_size ** 3,\n",
        "        'energy_trajectory': energies[::10],  # Every 10th point\n",
        "        'temperature_trajectory': temperatures[::10],\n",
        "        'pressure_trajectory': pressures[::10],\n",
        "        'radial_distribution': {\n",
        "            'distances': rdf_distances.tolist(),\n",
        "            'values': rdf_values.tolist()\n",
        "        },\n",
        "        'simulation_time_ns': n_steps * dt * 1e9  # Convert to nanoseconds\n",
        "    }\n",
        "    \n",
        "    return simulation_results\n",
        "\n",
        "# Run molecular dynamics simulation\n",
        "md_results = molecular_dynamics_simulation(\n",
        "    n_particles=5000, \n",
        "    n_steps=50000, \n",
        "    temperature=298.15  # Room temperature\n",
        ")\n",
        "\n",
        "print(f\"\\nMOLECULAR DYNAMICS SIMULATION COMPLETE\")\n",
        "print(f\"Particles: {md_results['n_particles']:,}\")\n",
        "print(f\"Steps: {md_results['n_steps']:,}\")\n",
        "print(f\"Simulation time: {md_results['simulation_time_ns']:.2f} ns\")\n",
        "print(f\"Target temperature: {md_results['target_temperature']:.1f} K\")\n",
        "print(f\"Average temperature: {md_results['average_temperature']:.1f} K\")\n",
        "print(f\"Temperature stability: ±{md_results['temperature_stability']:.1f} K\")\n",
        "print(f\"Average pressure: {md_results['average_pressure']:.2e} Pa\")\n",
        "print(f\"System density: {md_results['density']:.2e} particles/m³\")"
      ],
      "id": "cell-8"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Example 3: Environmental Science - Climate Data Analysis\n",
        "\n",
        "Analyze large climate datasets commonly processed on research clusters:"
      ],
      "id": "cell-9"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(\n",
        "    cores=12,\n",
        "    memory=\"48GB\",\n",
        "    time=\"04:00:00\",\n",
        "    queue=\"climate\",\n",
        "    parallel=True  # Enable automatic parallelization\n",
        ")\n",
        "def analyze_climate_data(years_to_analyze=50, stations_per_year=1000):\n",
        "    \"\"\"\n",
        "    Comprehensive climate data analysis for environmental research.\n",
        "    \"\"\"\n",
        "    import numpy as np\n",
        "    import pandas as pd\n",
        "    from datetime import datetime, timedelta\n",
        "    import random\n",
        "    from scipy import stats\n",
        "    import math\n",
        "    \n",
        "    def generate_realistic_climate_data(year, station_id, latitude, longitude):\n",
        "        \"\"\"Generate realistic climate data for a station\"\"\"\n",
        "        np.random.seed(year * 1000 + station_id)  # Reproducible but varied\n",
        "        \n",
        "        # Base temperature influenced by latitude\n",
        "        base_temp = 25 - abs(latitude) * 0.6  # Cooler at higher latitudes\n",
        "        \n",
        "        # Generate daily data for the year\n",
        "        start_date = datetime(year, 1, 1)\n",
        "        days_in_year = 366 if year % 4 == 0 else 365\n",
        "        \n",
        "        daily_data = []\n",
        "        \n",
        "        for day in range(days_in_year):\n",
        "            date = start_date + timedelta(days=day)\n",
        "            day_of_year = day + 1\n",
        "            \n",
        "            # Seasonal temperature variation\n",
        "            seasonal_temp = base_temp + 15 * math.cos(2 * math.pi * (day_of_year - 172) / 365)\n",
        "            \n",
        "            # Add random variation and trends\n",
        "            climate_trend = 0.01 * (year - 1970)  # 0.01°C/year warming\n",
        "            daily_temp = seasonal_temp + climate_trend + np.random.normal(0, 3)\n",
        "            \n",
        "            # Precipitation (higher in tropics and certain seasons)\n",
        "            base_precip = max(0, 10 - abs(latitude) * 0.3)\n",
        "            seasonal_precip_factor = 1 + 0.5 * math.cos(2 * math.pi * (day_of_year - 30) / 365)\n",
        "            daily_precip = max(0, np.random.exponential(base_precip * seasonal_precip_factor))\n",
        "            \n",
        "            # Humidity (correlated with temperature and precipitation)\n",
        "            base_humidity = 60 - abs(latitude) * 0.5\n",
        "            humidity = base_humidity + daily_precip * 0.5 - (daily_temp - base_temp) * 0.3\n",
        "            humidity = max(10, min(100, humidity + np.random.normal(0, 5)))\n",
        "            \n",
        "            # Wind speed (more variable at higher latitudes)\n",
        "            base_wind = 5 + abs(latitude) * 0.1\n",
        "            wind_speed = max(0, np.random.gamma(2, base_wind / 2))\n",
        "            \n",
        "            # Atmospheric pressure (altitude and weather dependent)\n",
        "            base_pressure = 1013.25  # Sea level\n",
        "            pressure = base_pressure + np.random.normal(0, 10)\n",
        "            \n",
        "            daily_data.append({\n",
        "                'date': date,\n",
        "                'temperature': daily_temp,\n",
        "                'precipitation': daily_precip,\n",
        "                'humidity': humidity,\n",
        "                'wind_speed': wind_speed,\n",
        "                'pressure': pressure\n",
        "            })\n",
        "        \n",
        "        return daily_data\n",
        "    \n",
        "    def analyze_station_trends(station_data):\n",
        "        \"\"\"Analyze trends for a single weather station\"\"\"\n",
        "        df = pd.DataFrame(station_data)\n",
        "        \n",
        "        # Calculate annual statistics\n",
        "        annual_stats = {\n",
        "            'mean_temperature': df['temperature'].mean(),\n",
        "            'temperature_range': df['temperature'].max() - df['temperature'].min(),\n",
        "            'total_precipitation': df['precipitation'].sum(),\n",
        "            'mean_humidity': df['humidity'].mean(),\n",
        "            'mean_wind_speed': df['wind_speed'].mean(),\n",
        "            'mean_pressure': df['pressure'].mean(),\n",
        "            'temperature_std': df['temperature'].std(),\n",
        "            'precipitation_days': (df['precipitation'] > 1.0).sum(),\n",
        "            'extreme_heat_days': (df['temperature'] > df['temperature'].quantile(0.95)).sum(),\n",
        "            'extreme_cold_days': (df['temperature'] < df['temperature'].quantile(0.05)).sum()\n",
        "        }\n",
        "        \n",
        "        # Seasonal analysis\n",
        "        df['month'] = df['date'].dt.month\n",
        "        seasonal_temps = df.groupby(df['month'])['temperature'].mean()\n",
        "        seasonal_precip = df.groupby(df['month'])['precipitation'].sum()\n",
        "        \n",
        "        annual_stats['seasonal_temperature_variation'] = seasonal_temps.std()\n",
        "        annual_stats['wettest_month'] = seasonal_precip.idxmax()\n",
        "        annual_stats['driest_month'] = seasonal_precip.idxmin()\n",
        "        \n",
        "        return annual_stats\n",
        "    \n",
        "    print(f\"Analyzing climate data for {years_to_analyze} years, {stations_per_year} stations per year...\")\n",
        "    print(f\"Total data points: {years_to_analyze * stations_per_year * 365:,}\")\n",
        "    \n",
        "    all_station_results = []\n",
        "    \n",
        "    # This loop will be automatically parallelized by Clustrix\n",
        "    for year in range(1970, 1970 + years_to_analyze):\n",
        "        print(f\"Processing year {year}...\")\n",
        "        \n",
        "        year_results = []\n",
        "        \n",
        "        for station_id in range(stations_per_year):\n",
        "            # Generate random station location\n",
        "            latitude = np.random.uniform(-60, 75)  # Inhabitable latitudes\n",
        "            longitude = np.random.uniform(-180, 180)\n",
        "            \n",
        "            # Generate climate data for this station and year\n",
        "            station_data = generate_realistic_climate_data(year, station_id, latitude, longitude)\n",
        "            \n",
        "            # Analyze the station data\n",
        "            station_analysis = analyze_station_trends(station_data)\n",
        "            station_analysis['year'] = year\n",
        "            station_analysis['station_id'] = station_id\n",
        "            station_analysis['latitude'] = latitude\n",
        "            station_analysis['longitude'] = longitude\n",
        "            \n",
        "            year_results.append(station_analysis)\n",
        "        \n",
        "        all_station_results.extend(year_results)\n",
        "    \n",
        "    # Convert to DataFrame for analysis\n",
        "    results_df = pd.DataFrame(all_station_results)\n",
        "    \n",
        "    # Global trend analysis\n",
        "    yearly_global_temps = results_df.groupby('year')['mean_temperature'].mean()\n",
        "    yearly_global_precip = results_df.groupby('year')['total_precipitation'].mean()\n",
        "    \n",
        "    # Calculate trends\n",
        "    years = yearly_global_temps.index\n",
        "    temp_trend, temp_intercept, temp_r_value, temp_p_value, temp_std_err = stats.linregress(years, yearly_global_temps)\n",
        "    precip_trend, precip_intercept, precip_r_value, precip_p_value, precip_std_err = stats.linregress(years, yearly_global_precip)\n",
        "    \n",
        "    # Regional analysis\n",
        "    def classify_climate_zone(lat):\n",
        "        if abs(lat) < 23.5:\n",
        "            return \"Tropical\"\n",
        "        elif abs(lat) < 35:\n",
        "            return \"Subtropical\"\n",
        "        elif abs(lat) < 50:\n",
        "            return \"Temperate\"\n",
        "        else:\n",
        "            return \"Polar\"\n",
        "    \n",
        "    results_df['climate_zone'] = results_df['latitude'].apply(classify_climate_zone)\n",
        "    zone_analysis = results_df.groupby('climate_zone').agg({\n",
        "        'mean_temperature': ['mean', 'std'],\n",
        "        'total_precipitation': ['mean', 'std'],\n",
        "        'temperature_range': 'mean',\n",
        "        'extreme_heat_days': 'mean',\n",
        "        'extreme_cold_days': 'mean'\n",
        "    }).round(2)\n",
        "    \n",
        "    # Extreme events analysis\n",
        "    extreme_heat_threshold = results_df['mean_temperature'].quantile(0.95)\n",
        "    extreme_cold_threshold = results_df['mean_temperature'].quantile(0.05)\n",
        "    drought_threshold = results_df['total_precipitation'].quantile(0.1)\n",
        "    flood_threshold = results_df['total_precipitation'].quantile(0.9)\n",
        "    \n",
        "    extreme_events = {\n",
        "        'extreme_heat_stations': (results_df['mean_temperature'] > extreme_heat_threshold).sum(),\n",
        "        'extreme_cold_stations': (results_df['mean_temperature'] < extreme_cold_threshold).sum(),\n",
        "        'drought_affected_stations': (results_df['total_precipitation'] < drought_threshold).sum(),\n",
        "        'flood_risk_stations': (results_df['total_precipitation'] > flood_threshold).sum()\n",
        "    }\n",
        "    \n",
        "    # Compile final results\n",
        "    climate_analysis = {\n",
        "        'analysis_summary': {\n",
        "            'years_analyzed': years_to_analyze,\n",
        "            'stations_per_year': stations_per_year,\n",
        "            'total_station_years': len(results_df),\n",
        "            'data_points_analyzed': len(results_df) * 365\n",
        "        },\n",
        "        'global_trends': {\n",
        "            'temperature_trend_per_decade': temp_trend * 10,\n",
        "            'temperature_trend_significance': temp_p_value,\n",
        "            'temperature_correlation': temp_r_value ** 2,\n",
        "            'precipitation_trend_per_decade': precip_trend * 10,\n",
        "            'precipitation_trend_significance': precip_p_value,\n",
        "            'precipitation_correlation': precip_r_value ** 2\n",
        "        },\n",
        "        'current_climate_state': {\n",
        "            'global_mean_temperature': yearly_global_temps.iloc[-1],\n",
        "            'global_mean_precipitation': yearly_global_precip.iloc[-1],\n",
        "            'temperature_warming_since_start': yearly_global_temps.iloc[-1] - yearly_global_temps.iloc[0],\n",
        "            'precipitation_change_since_start': yearly_global_precip.iloc[-1] - yearly_global_precip.iloc[0]\n",
        "        },\n",
        "        'regional_analysis': zone_analysis.to_dict(),\n",
        "        'extreme_events': extreme_events,\n",
        "        'statistical_summary': {\n",
        "            'mean_global_temperature': results_df['mean_temperature'].mean(),\n",
        "            'temperature_standard_deviation': results_df['mean_temperature'].std(),\n",
        "            'mean_global_precipitation': results_df['total_precipitation'].mean(),\n",
        "            'precipitation_standard_deviation': results_df['total_precipitation'].std(),\n",
        "            'warmest_station_temp': results_df['mean_temperature'].max(),\n",
        "            'coldest_station_temp': results_df['mean_temperature'].min(),\n",
        "            'wettest_station_precip': results_df['total_precipitation'].max(),\n",
        "            'driest_station_precip': results_df['total_precipitation'].min()\n",
        "        }\n",
        "    }\n",
        "    \n",
        "    return climate_analysis\n",
        "\n",
        "# Run climate analysis\n",
        "climate_results = analyze_climate_data(years_to_analyze=30, stations_per_year=200)\n",
        "\n",
        "print(f\"\\nCLIMATE DATA ANALYSIS COMPLETE\")\n",
        "print(f\"Years analyzed: {climate_results['analysis_summary']['years_analyzed']}\")\n",
        "print(f\"Total station-years: {climate_results['analysis_summary']['total_station_years']:,}\")\n",
        "print(f\"Data points: {climate_results['analysis_summary']['data_points_analyzed']:,}\")\n",
        "\n",
        "print(\"\\nGlobal Trends:\")\n",
        "trends = climate_results['global_trends']\n",
        "print(f\"  Temperature trend: {trends['temperature_trend_per_decade']:.3f}°C per decade (p={trends['temperature_trend_significance']:.4f})\")\n",
        "print(f\"  Precipitation trend: {trends['precipitation_trend_per_decade']:.1f} mm per decade (p={trends['precipitation_trend_significance']:.4f})\")\n",
        "\n",
        "print(\"\\nCurrent Climate State:\")\n",
        "current = climate_results['current_climate_state']\n",
        "print(f\"  Global mean temperature: {current['global_mean_temperature']:.2f}°C\")\n",
        "print(f\"  Temperature change since start: {current['temperature_warming_since_start']:.2f}°C\")\n",
        "print(f\"  Global mean precipitation: {current['global_mean_precipitation']:.1f} mm/year\")\n",
        "\n",
        "print(\"\\nExtreme Events:\")\n",
        "extremes = climate_results['extreme_events']\n",
        "total_stations = climate_results['analysis_summary']['total_station_years']\n",
        "print(f\"  Extreme heat affected: {extremes['extreme_heat_stations']} stations ({100*extremes['extreme_heat_stations']/total_stations:.1f}%)\")\n",
        "print(f\"  Drought affected: {extremes['drought_affected_stations']} stations ({100*extremes['drought_affected_stations']/total_stations:.1f}%)\")\n",
        "print(f\"  Flood risk: {extremes['flood_risk_stations']} stations ({100*extremes['flood_risk_stations']/total_stations:.1f}%)\")"
      ],
      "id": "cell-10"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## PBS Queue Management and Resource Selection\n",
        "\n",
        "Understanding how to choose appropriate PBS queues and resources:"
      ],
      "id": "cell-11"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def select_pbs_resources(workload_type, data_size_mb, urgency=\"normal\"):\n",
        "    \"\"\"\n",
        "    Intelligent PBS resource selection based on workload characteristics.\n",
        "    \"\"\"\n",
        "    \n",
        "    # Base resource templates\n",
        "    resource_templates = {\n",
        "        \"bioinformatics\": {\n",
        "            \"small\": {\"cores\": 4, \"memory\": \"16GB\", \"time\": \"02:00:00\", \"queue\": \"bioqueue\"},\n",
        "            \"medium\": {\"cores\": 8, \"memory\": \"32GB\", \"time\": \"06:00:00\", \"queue\": \"bioqueue\"},\n",
        "            \"large\": {\"cores\": 16, \"memory\": \"64GB\", \"time\": \"12:00:00\", \"queue\": \"bioqueue_long\"}\n",
        "        },\n",
        "        \"physics\": {\n",
        "            \"small\": {\"cores\": 8, \"memory\": \"32GB\", \"time\": \"04:00:00\", \"queue\": \"physics\"},\n",
        "            \"medium\": {\"cores\": 16, \"memory\": \"64GB\", \"time\": \"12:00:00\", \"queue\": \"physics\"},\n",
        "            \"large\": {\"cores\": 32, \"memory\": \"128GB\", \"time\": \"24:00:00\", \"queue\": \"physics_long\"}\n",
        "        },\n",
        "        \"climate\": {\n",
        "            \"small\": {\"cores\": 6, \"memory\": \"24GB\", \"time\": \"03:00:00\", \"queue\": \"climate\"},\n",
        "            \"medium\": {\"cores\": 12, \"memory\": \"48GB\", \"time\": \"08:00:00\", \"queue\": \"climate\"},\n",
        "            \"large\": {\"cores\": 24, \"memory\": \"96GB\", \"time\": \"16:00:00\", \"queue\": \"climate_long\"}\n",
        "        },\n",
        "        \"ml\": {\n",
        "            \"small\": {\"cores\": 4, \"memory\": \"16GB\", \"time\": \"01:00:00\", \"queue\": \"gpu\", \"gres\": \"gpu:1\"},\n",
        "            \"medium\": {\"cores\": 8, \"memory\": \"32GB\", \"time\": \"04:00:00\", \"queue\": \"gpu\", \"gres\": \"gpu:2\"},\n",
        "            \"large\": {\"cores\": 16, \"memory\": \"64GB\", \"time\": \"12:00:00\", \"queue\": \"gpu_long\", \"gres\": \"gpu:4\"}\n",
        "        }\n",
        "    }\n",
        "    \n",
        "    # Determine size category based on data\n",
        "    if data_size_mb < 100:\n",
        "        size_category = \"small\"\n",
        "    elif data_size_mb < 1000:\n",
        "        size_category = \"medium\"\n",
        "    else:\n",
        "        size_category = \"large\"\n",
        "    \n",
        "    # Get base configuration\n",
        "    if workload_type not in resource_templates:\n",
        "        workload_type = \"physics\"  # Default fallback\n",
        "    \n",
        "    config = resource_templates[workload_type][size_category].copy()\n",
        "    \n",
        "    # Adjust for urgency\n",
        "    if urgency == \"urgent\":\n",
        "        # Use express queue with reduced resources\n",
        "        config[\"queue\"] = \"express\"\n",
        "        config[\"time\"] = \"00:30:00\"\n",
        "        config[\"cores\"] = min(4, config[\"cores\"])\n",
        "    elif urgency == \"low\":\n",
        "        # Use long queue with more resources\n",
        "        config[\"queue\"] = config[\"queue\"].replace(\"queue\", \"queue_long\")\n",
        "        config[\"cores\"] = int(config[\"cores\"] * 1.5)\n",
        "        # Increase time limit\n",
        "        time_parts = config[\"time\"].split(\":\")\n",
        "        hours = int(time_parts[0]) * 2\n",
        "        config[\"time\"] = f\"{hours:02d}:{time_parts[1]}:{time_parts[2]}\"\n",
        "    \n",
        "    return config\n",
        "\n",
        "# Example resource selections\n",
        "example_workloads = [\n",
        "    (\"bioinformatics\", 500, \"normal\"),\n",
        "    (\"physics\", 2000, \"low\"),\n",
        "    (\"climate\", 150, \"urgent\"),\n",
        "    (\"ml\", 800, \"normal\")\n",
        "]\n",
        "\n",
        "print(\"PBS Resource Selection Examples:\")\n",
        "print(\"=\" * 70)\n",
        "\n",
        "for workload, data_size, urgency in example_workloads:\n",
        "    resources = select_pbs_resources(workload, data_size, urgency)\n",
        "    print(f\"\\n{workload.upper()} ({data_size} MB, {urgency} priority):\")\n",
        "    for key, value in resources.items():\n",
        "        print(f\"  {key}: {value}\")"
      ],
      "id": "cell-12"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## PBS Job Arrays for Parameter Studies\n",
        "\n",
        "Use PBS job arrays for efficient parameter sweeps:"
      ],
      "id": "cell-13"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "@cluster(\n",
        "    cores=4,\n",
        "    memory=\"16GB\",\n",
        "    time=\"01:00:00\",\n",
        "    queue=\"normal\",\n",
        "    pbs_array=\"1-20\"  # PBS job array with 20 tasks\n",
        ")\n",
        "def drug_discovery_parameter_sweep(base_config):\n",
        "    \"\"\"\n",
        "    Pharmaceutical research parameter sweep using PBS job arrays.\n",
        "    Each array task tests different molecular parameters.\n",
        "    \"\"\"\n",
        "    import os\n",
        "    import numpy as np\n",
        "    import random\n",
        "    from math import exp, log\n",
        "    \n",
        "    # Get PBS array index\n",
        "    array_index = int(os.environ.get('PBS_ARRAYID', '1'))\n",
        "    \n",
        "    # Define parameter space for drug discovery\n",
        "    molecular_weights = np.linspace(150, 500, 20)  # Typical drug MW range\n",
        "    logp_values = np.linspace(-1, 5, 20)           # Lipophilicity\n",
        "    hbd_counts = list(range(0, 6))                 # Hydrogen bond donors\n",
        "    hba_counts = list(range(0, 11))                # Hydrogen bond acceptors\n",
        "    \n",
        "    # Select parameters for this array task\n",
        "    mw = molecular_weights[array_index - 1]\n",
        "    logp = logp_values[array_index - 1]\n",
        "    \n",
        "    # Random selection for other parameters\n",
        "    np.random.seed(array_index * 42)\n",
        "    hbd = random.choice(hbd_counts)\n",
        "    hba = random.choice(hba_counts)\n",
        "    \n",
        "    print(f\"Array task {array_index}: MW={mw:.1f}, LogP={logp:.2f}, HBD={hbd}, HBA={hba}\")\n",
        "    \n",
        "    def calculate_drug_likeness(mw, logp, hbd, hba):\n",
        "        \"\"\"Calculate drug-likeness using Lipinski's Rule of Five\"\"\"\n",
        "        violations = 0\n",
        "        \n",
        "        if mw > 500:\n",
        "            violations += 1\n",
        "        if logp > 5:\n",
        "            violations += 1\n",
        "        if hbd > 5:\n",
        "            violations += 1\n",
        "        if hba > 10:\n",
        "            violations += 1\n",
        "        \n",
        "        drug_likeness = max(0, 1.0 - violations * 0.25)\n",
        "        return drug_likeness, violations\n",
        "    \n",
        "    def simulate_binding_affinity(mw, logp, hbd, hba):\n",
        "        \"\"\"Simulate binding affinity to target protein\"\"\"\n",
        "        # Simplified model based on molecular properties\n",
        "        optimal_mw = 350\n",
        "        optimal_logp = 2.5\n",
        "        optimal_hbd = 2\n",
        "        optimal_hba = 6\n",
        "        \n",
        "        mw_score = exp(-((mw - optimal_mw) / 100) ** 2)\n",
        "        logp_score = exp(-((logp - optimal_logp) / 1.5) ** 2)\n",
        "        hbd_score = exp(-((hbd - optimal_hbd) / 1.5) ** 2)\n",
        "        hba_score = exp(-((hba - optimal_hba) / 2.5) ** 2)\n",
        "        \n",
        "        # Combine scores with some randomness\n",
        "        base_affinity = (mw_score * logp_score * hbd_score * hba_score) ** 0.5\n",
        "        random_factor = np.random.uniform(0.7, 1.3)\n",
        "        \n",
        "        binding_affinity = base_affinity * random_factor\n",
        "        ic50 = 10 ** (-6 - 3 * binding_affinity)  # Convert to IC50 (M)\n",
        "        \n",
        "        return binding_affinity, ic50\n",
        "    \n",
        "    def simulate_admet_properties(mw, logp, hbd, hba):\n",
        "        \"\"\"Simulate ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity)\"\"\"\n",
        "        # Absorption (permeability)\n",
        "        permeability = 1 / (1 + exp(-(logp - 1.5)))\n",
        "        permeability *= np.random.uniform(0.8, 1.2)\n",
        "        \n",
        "        # Distribution (plasma protein binding)\n",
        "        ppb = min(99, max(10, 20 + logp * 15 + np.random.normal(0, 10)))\n",
        "        \n",
        "        # Metabolism (hepatic clearance)\n",
        "        clearance = 0.5 + 0.3 * (1 / (1 + exp(-(mw - 300) / 50)))\n",
        "        clearance *= np.random.uniform(0.7, 1.3)\n",
        "        \n",
        "        # Excretion (renal clearance)\n",
        "        renal_clearance = max(0.1, 0.8 - logp * 0.1 + np.random.normal(0, 0.1))\n",
        "        \n",
        "        # Toxicity (simplified hERG channel binding)\n",
        "        herg_risk = 1 / (1 + exp(-(logp - 3.5)))\n",
        "        if mw > 400:\n",
        "            herg_risk *= 1.5\n",
        "        \n",
        "        return {\n",
        "            'permeability': permeability,\n",
        "            'plasma_protein_binding': ppb,\n",
        "            'hepatic_clearance': clearance,\n",
        "            'renal_clearance': renal_clearance,\n",
        "            'herg_risk': herg_risk\n",
        "        }\n",
        "    \n",
        "    def calculate_developability_score(drug_likeness, binding_affinity, admet):\n",
        "        \"\"\"Calculate overall drug developability score\"\"\"\n",
        "        # Weight different factors\n",
        "        likeness_weight = 0.2\n",
        "        affinity_weight = 0.4\n",
        "        admet_weight = 0.4\n",
        "        \n",
        "        # ADMET composite score\n",
        "        admet_score = (\n",
        "            admet['permeability'] * 0.3 +\n",
        "            (1 - admet['herg_risk']) * 0.3 +\n",
        "            (1 - admet['hepatic_clearance']) * 0.2 +\n",
        "            admet['renal_clearance'] * 0.2\n",
        "        )\n",
        "        \n",
        "        total_score = (\n",
        "            drug_likeness * likeness_weight +\n",
        "            binding_affinity * affinity_weight +\n",
        "            admet_score * admet_weight\n",
        "        )\n",
        "        \n",
        "        return total_score, admet_score\n",
        "    \n",
        "    # Run simulations\n",
        "    drug_likeness, ro5_violations = calculate_drug_likeness(mw, logp, hbd, hba)\n",
        "    binding_affinity, ic50 = simulate_binding_affinity(mw, logp, hbd, hba)\n",
        "    admet_props = simulate_admet_properties(mw, logp, hbd, hba)\n",
        "    developability_score, admet_score = calculate_developability_score(\n",
        "        drug_likeness, binding_affinity, admet_props\n",
        "    )\n",
        "    \n",
        "    # Compile results\n",
        "    compound_results = {\n",
        "        'array_task_id': array_index,\n",
        "        'molecular_properties': {\n",
        "            'molecular_weight': mw,\n",
        "            'logp': logp,\n",
        "            'hbd_count': hbd,\n",
        "            'hba_count': hba\n",
        "        },\n",
        "        'drug_likeness': {\n",
        "            'score': drug_likeness,\n",
        "            'ro5_violations': ro5_violations,\n",
        "            'passes_ro5': ro5_violations <= 1\n",
        "        },\n",
        "        'target_binding': {\n",
        "            'affinity_score': binding_affinity,\n",
        "            'ic50_M': ic50,\n",
        "            'pic50': -log(ic50, 10) if ic50 > 0 else 0\n",
        "        },\n",
        "        'admet_properties': admet_props,\n",
        "        'overall_assessment': {\n",
        "            'developability_score': developability_score,\n",
        "            'admet_score': admet_score,\n",
        "            'promising_candidate': developability_score > 0.6 and binding_affinity > 0.5\n",
        "        }\n",
        "    }\n",
        "    \n",
        "    return compound_results\n",
        "\n",
        "# Run drug discovery parameter sweep\n",
        "drug_config = {\n",
        "    'target_name': 'EGFR',\n",
        "    'assay_type': 'binding',\n",
        "    'screening_library': 'chembl'\n",
        "}\n",
        "\n",
        "# This will run as one task of the PBS array\n",
        "drug_result = drug_discovery_parameter_sweep(drug_config)\n",
        "\n",
        "print(f\"\\nDRUG DISCOVERY ANALYSIS - Task {drug_result['array_task_id']}\")\n",
        "print(\"=\" * 60)\n",
        "\n",
        "mol_props = drug_result['molecular_properties']\n",
        "print(f\"Molecular Weight: {mol_props['molecular_weight']:.1f} Da\")\n",
        "print(f\"LogP: {mol_props['logp']:.2f}\")\n",
        "print(f\"H-bond donors: {mol_props['hbd_count']}\")\n",
        "print(f\"H-bond acceptors: {mol_props['hba_count']}\")\n",
        "\n",
        "drug_like = drug_result['drug_likeness']\n",
        "print(f\"\\nDrug-likeness score: {drug_like['score']:.3f}\")\n",
        "print(f\"Rule of 5 violations: {drug_like['ro5_violations']}\")\n",
        "print(f\"Passes Lipinski's Rule: {drug_like['passes_ro5']}\")\n",
        "\n",
        "binding = drug_result['target_binding']\n",
        "print(f\"\\nBinding affinity score: {binding['affinity_score']:.3f}\")\n",
        "print(f\"IC50: {binding['ic50_M']:.2e} M\")\n",
        "print(f\"pIC50: {binding['pic50']:.2f}\")\n",
        "\n",
        "assessment = drug_result['overall_assessment']\n",
        "print(f\"\\nDevelopability score: {assessment['developability_score']:.3f}\")\n",
        "print(f\"ADMET score: {assessment['admet_score']:.3f}\")\n",
        "print(f\"Promising candidate: {assessment['promising_candidate']}\")"
      ],
      "id": "cell-14"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Monitoring PBS Jobs\n",
        "\n",
        "Monitor and manage PBS jobs using Clustrix:"
      ],
      "id": "cell-15"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from clustrix import ClusterExecutor\n",
        "\n",
        "# Get the configured executor\n",
        "config = clustrix.get_config()\n",
        "executor = ClusterExecutor(config)\n",
        "\n",
        "try:\n",
        "    executor.connect()\n",
        "    print(\"✓ Successfully connected to PBS cluster\")\n",
        "    \n",
        "    # Check PBS version\n",
        "    stdout, stderr = executor._execute_command(\"qstat --version\")\n",
        "    if stdout:\n",
        "        print(f\"✓ PBS version: {stdout.strip()}\")\n",
        "    \n",
        "    # Check available queues\n",
        "    stdout, stderr = executor._execute_command(\"qstat -Q\")\n",
        "    if stdout:\n",
        "        print(\"\\nAvailable queues:\")\n",
        "        lines = stdout.strip().split('\\n')\n",
        "        for line in lines[2:7]:  # Skip header, show first 5 queues\n",
        "            parts = line.split()\n",
        "            if len(parts) >= 3:\n",
        "                queue_name = parts[0]\n",
        "                max_jobs = parts[1] if parts[1] != '--' else 'unlimited'\n",
        "                total_jobs = parts[2]\n",
        "                print(f\"  {queue_name}: {total_jobs} jobs, max: {max_jobs}\")\n",
        "    \n",
        "    # Check node status\n",
        "    stdout, stderr = executor._execute_command(\"pbsnodes -a | grep -E '^(\\w+|\\s+state)' | head -20\")\n",
        "    if stdout:\n",
        "        print(\"\\nNode status (sample):\")\n",
        "        lines = stdout.strip().split('\\n')\n",
        "        current_node = None\n",
        "        for line in lines[:10]:  # Show first few nodes\n",
        "            if not line.startswith(' '):\n",
        "                current_node = line.strip()\n",
        "            elif 'state' in line:\n",
        "                state = line.split('=')[1].strip() if '=' in line else 'unknown'\n",
        "                print(f\"  {current_node}: {state}\")\n",
        "    \n",
        "    # Check user's job status\n",
        "    username = config.username\n",
        "    stdout, stderr = executor._execute_command(f\"qstat -u {username}\")\n",
        "    if stdout and len(stdout.strip().split('\\n')) > 2:\n",
        "        print(f\"\\nYour current jobs:\")\n",
        "        lines = stdout.strip().split('\\n')\n",
        "        for line in lines[2:]:  # Skip headers\n",
        "            print(f\"  {line}\")\n",
        "    else:\n",
        "        print(f\"\\n✓ No jobs currently running for user {username}\")\n",
        "    \n",
        "    executor.disconnect()\n",
        "    print(\"\\n✓ PBS cluster monitoring completed successfully\")\n",
        "    \n",
        "except Exception as e:\n",
        "    print(f\"✗ Connection or monitoring failed: {e}\")\n",
        "    print(\"Please check your PBS cluster configuration and connectivity\")"
      ],
      "id": "cell-16"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## PBS Configuration Best Practices\n",
        "\n",
        "### Environment-Specific Configuration Files\n",
        "\n",
        "Create different configurations for different PBS environments:"
      ],
      "id": "cell-17"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Create PBS configuration for different research domains\n",
        "\n",
        "bioinformatics_config = {\n",
        "    'cluster_type': 'pbs',\n",
        "    'cluster_host': 'bio-cluster.university.edu',\n",
        "    'username': 'researcher',\n",
        "    'default_queue': 'bioqueue',\n",
        "    'default_cores': 8,\n",
        "    'default_memory': '32GB',\n",
        "    'default_time': '06:00:00',\n",
        "    'module_loads': ['python/3.9', 'blast/2.12', 'hmmer/3.3'],\n",
        "    'remote_work_dir': '/scratch/bio/clustrix',\n",
        "    'max_parallel_jobs': 20\n",
        "}\n",
        "\n",
        "physics_config = {\n",
        "    'cluster_type': 'pbs',\n",
        "    'cluster_host': 'physics-hpc.university.edu',\n",
        "    'username': 'physicist',\n",
        "    'default_queue': 'physics',\n",
        "    'default_cores': 16,\n",
        "    'default_memory': '64GB',\n",
        "    'default_time': '12:00:00',\n",
        "    'module_loads': ['python/3.9', 'openmpi/4.1', 'fftw/3.3'],\n",
        "    'remote_work_dir': '/home/physicist/clustrix',\n",
        "    'features': 'infiniband',  # Request high-speed interconnect\n",
        "    'max_parallel_jobs': 10\n",
        "}\n",
        "\n",
        "climate_config = {\n",
        "    'cluster_type': 'pbs',\n",
        "    'cluster_host': 'climate-compute.noaa.gov',\n",
        "    'username': 'climatologist',\n",
        "    'default_queue': 'climate',\n",
        "    'default_cores': 12,\n",
        "    'default_memory': '48GB',\n",
        "    'default_time': '08:00:00',\n",
        "    'module_loads': ['python/3.9', 'netcdf/4.8', 'gdal/3.4'],\n",
        "    'remote_work_dir': '/data/climate/clustrix',\n",
        "    'max_parallel_jobs': 15\n",
        "}\n",
        "\n",
        "# Example of selecting configuration based on research domain\n",
        "def configure_for_domain(domain):\n",
        "    configs = {\n",
        "        'bioinformatics': bioinformatics_config,\n",
        "        'physics': physics_config,\n",
        "        'climate': climate_config\n",
        "    }\n",
        "    \n",
        "    if domain in configs:\n",
        "        clustrix.configure(**configs[domain])\n",
        "        print(f\"Configured Clustrix for {domain} research\")\n",
        "        return configs[domain]\n",
        "    else:\n",
        "        print(f\"Unknown domain: {domain}. Available: {list(configs.keys())}\")\n",
        "        return None\n",
        "\n",
        "# Configure for bioinformatics research\n",
        "selected_config = configure_for_domain('bioinformatics')\n",
        "if selected_config:\n",
        "    print(\"\\nConfiguration details:\")\n",
        "    for key, value in selected_config.items():\n",
        "        print(f\"  {key}: {value}\")"
      ],
      "id": "cell-18"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Summary\n",
        "\n",
        "This tutorial covered PBS/Torque cluster usage with Clustrix:\n",
        "\n",
        "1. **PBS Configuration** - Setting up Clustrix for PBS clusters\n",
        "2. **Bioinformatics Applications** - DNA sequence analysis and genomics\n",
        "3. **Materials Science** - Molecular dynamics simulations\n",
        "4. **Climate Research** - Large-scale environmental data analysis\n",
        "5. **Drug Discovery** - Pharmaceutical parameter sweeps with job arrays\n",
        "6. **Resource Management** - Intelligent queue and resource selection\n",
        "7. **Job Monitoring** - PBS cluster status and job management\n",
        "8. **Best Practices** - Domain-specific configurations\n",
        "\n",
        "### Key PBS Features:\n",
        "\n",
        "- **Queue Management**: Choose appropriate queues for different workload types\n",
        "- **Resource Specification**: Use PBS directives for cores, memory, and time\n",
        "- **Job Arrays**: Efficient parameter sweeps with `pbs_array` parameter\n",
        "- **Feature Requests**: Specify hardware features like InfiniBand\n",
        "- **Module Loading**: Automatic environment setup with required software\n",
        "- **Walltime Management**: Realistic time estimates for job completion\n",
        "\n",
        "### Next Steps:\n",
        "\n",
        "- Explore [SLURM Tutorial](slurm_tutorial.ipynb) for SLURM-specific features\n",
        "- Try [Kubernetes Tutorial](kubernetes_tutorial.ipynb) for containerized computing\n",
        "- Review [SGE Tutorial](sge_tutorial.ipynb) for Sun Grid Engine clusters\n",
        "- Check the [SSH Setup Guide](../ssh_setup.rst) for secure authentication\n",
        "\n",
        "For more information, visit the [Clustrix Documentation](https://clustrix.readthedocs.io)."
      ],
      "id": "cell-19"
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}