{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PBS/Torque Cluster Tutorial\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/clustrix/blob/master/docs/notebooks/pbs_tutorial.ipynb)\n", "\n", "This tutorial demonstrates how to use Clustrix with PBS (Portable Batch System) and Torque clusters. PBS is widely used in academic and research computing environments.\n", "\n", "## Prerequisites\n", "\n", "- Access to a PBS/Torque cluster\n", "- SSH key configured for the cluster\n", "- Clustrix installed: `pip install clustrix`" ], "id": "cell-0" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation and Setup" ], "id": "cell-1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install Clustrix (uncomment if needed)\n", "# !pip install clustrix\n", "\n", "import clustrix\n", "from clustrix import cluster, configure\n", "import numpy as np\n", "import pandas as pd" ], "id": "cell-2" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PBS Cluster Configuration\n", "\n", "Configure Clustrix for your PBS/Torque cluster:" ], "id": "cell-3" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Configure for PBS cluster\n", "configure(\n", " cluster_type=\"pbs\",\n", " cluster_host=\"pbs-cluster.university.edu\", # Replace with your cluster\n", " username=\"your-username\", # Replace with your username\n", " key_file=\"~/.ssh/id_rsa\", # Path to SSH key\n", " \n", " # Default PBS resource requirements\n", " default_cores=4,\n", " default_memory=\"16GB\",\n", " default_time=\"02:00:00\",\n", " default_queue=\"normal\", # PBS queue name\n", " \n", " # PBS-specific options\n", " remote_work_dir=\"/home/your-username/clustrix\", # Adjust for your cluster\n", " \n", " # Environment setup\n", " module_loads=[\"python/3.9\", \"openmpi/4.0\"], # Common PBS modules\n", " \n", " # Job management\n", " cleanup_on_success=True,\n", " max_parallel_jobs=25\n", ")\n", "\n", "print(\"PBS cluster configured successfully!\")" ], "id": "cell-4" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1: Bioinformatics - DNA Sequence Analysis\n", "\n", "PBS clusters are popular in bioinformatics. Let's analyze DNA sequences:" ], "id": "cell-5" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@cluster(\n", " cores=8, \n", " memory=\"32GB\", \n", " time=\"03:00:00\", \n", " queue=\"bioqueue\", # Specialized bioinformatics queue\n", " walltime=\"03:00:00\" # PBS uses 'walltime' parameter\n", ")\n", "def analyze_dna_sequences(sequences, analysis_type=\"comprehensive\"):\n", " \"\"\"\n", " Comprehensive DNA sequence analysis for bioinformatics research.\n", " \"\"\"\n", " import numpy as np\n", " import random\n", " from collections import Counter, defaultdict\n", " import re\n", " import math\n", " \n", " def calculate_gc_content(sequence):\n", " \"\"\"Calculate GC content percentage\"\"\"\n", " gc_count = sequence.count('G') + sequence.count('C')\n", " return (gc_count / len(sequence)) * 100 if sequence else 0\n", " \n", " def find_orfs(sequence, min_length=100):\n", " \"\"\"Find Open Reading Frames (ORFs)\"\"\"\n", " start_codon = 'ATG'\n", " stop_codons = ['TAA', 'TAG', 'TGA']\n", " orfs = []\n", " \n", " for frame in range(3): # Check all 3 reading frames\n", " for i in range(frame, len(sequence) - 2, 3):\n", " codon = sequence[i:i+3]\n", " if codon == start_codon:\n", " # Look for stop codon\n", " for j in range(i+3, len(sequence) - 2, 3):\n", " stop_codon = sequence[j:j+3]\n", " if stop_codon in stop_codons:\n", " orf_length = j - i + 3\n", " if orf_length >= min_length:\n", " orfs.append({\n", " 'start': i,\n", " 'end': j + 3,\n", " 'length': orf_length,\n", " 'frame': frame + 1,\n", " 'sequence': sequence[i:j+3]\n", " })\n", " break\n", " return orfs\n", " \n", " def analyze_codon_usage(sequence):\n", " \"\"\"Analyze codon usage patterns\"\"\"\n", " codons = [sequence[i:i+3] for i in range(0, len(sequence)-2, 3) \n", " if len(sequence[i:i+3]) == 3]\n", " codon_counts = Counter(codons)\n", " \n", " # Standard genetic code mapping\n", " genetic_code = {\n", " 'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L',\n", " 'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TCG': 'S',\n", " 'TAT': 'Y', 'TAC': 'Y', 'TAA': '*', 'TAG': '*',\n", " 'TGT': 'C', 'TGC': 'C', 'TGA': '*', 'TGG': 'W',\n", " 'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L',\n", " 'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',\n", " 'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',\n", " 'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',\n", " 'ATT': 'I', 'ATC': 'I', 'ATA': 'I', 'ATG': 'M',\n", " 'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',\n", " 'AAT': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',\n", " 'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',\n", " 'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',\n", " 'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',\n", " 'GAT': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',\n", " 'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'\n", " }\n", " \n", " amino_acid_counts = defaultdict(int)\n", " for codon, count in codon_counts.items():\n", " if codon in genetic_code:\n", " amino_acid_counts[genetic_code[codon]] += count\n", " \n", " return dict(codon_counts), dict(amino_acid_counts)\n", " \n", " def find_tandem_repeats(sequence, min_repeat_length=3, max_repeat_length=20):\n", " \"\"\"Find tandem repeats in DNA sequence\"\"\"\n", " repeats = []\n", " \n", " for repeat_len in range(min_repeat_length, max_repeat_length + 1):\n", " for i in range(len(sequence) - repeat_len * 2 + 1):\n", " motif = sequence[i:i + repeat_len]\n", " count = 1\n", " j = i + repeat_len\n", " \n", " while j + repeat_len <= len(sequence) and sequence[j:j + repeat_len] == motif:\n", " count += 1\n", " j += repeat_len\n", " \n", " if count >= 3: # At least 3 repeats\n", " repeats.append({\n", " 'motif': motif,\n", " 'start': i,\n", " 'end': j,\n", " 'repeat_count': count,\n", " 'total_length': j - i\n", " })\n", " \n", " return repeats\n", " \n", " # Main analysis loop\n", " results = []\n", " \n", " for seq_idx, sequence in enumerate(sequences):\n", " print(f\"Analyzing sequence {seq_idx + 1}/{len(sequences)} (length: {len(sequence)})...\")\n", " \n", " # Basic composition analysis\n", " base_composition = Counter(sequence)\n", " gc_content = calculate_gc_content(sequence)\n", " \n", " # Advanced analyses\n", " orfs = find_orfs(sequence, min_length=150)\n", " codon_usage, amino_acid_freq = analyze_codon_usage(sequence)\n", " tandem_repeats = find_tandem_repeats(sequence)\n", " \n", " # CpG island detection (simplified)\n", " cpg_sites = len(re.findall('CG', sequence))\n", " cpg_density = (cpg_sites / (len(sequence) - 1)) * 100 if len(sequence) > 1 else 0\n", " \n", " # Complexity analysis\n", " def calculate_complexity(seq, window_size=50):\n", " complexities = []\n", " for i in range(0, len(seq) - window_size + 1, window_size):\n", " window = seq[i:i + window_size]\n", " counter = Counter(window)\n", " entropy = -sum((count/window_size) * math.log2(count/window_size) \n", " for count in counter.values() if count > 0)\n", " complexities.append(entropy)\n", " return np.mean(complexities) if complexities else 0\n", " \n", " complexity = calculate_complexity(sequence)\n", " \n", " sequence_result = {\n", " 'sequence_id': seq_idx,\n", " 'length': len(sequence),\n", " 'base_composition': dict(base_composition),\n", " 'gc_content': gc_content,\n", " 'complexity': complexity,\n", " 'orfs_found': len(orfs),\n", " 'longest_orf': max(orfs, key=lambda x: x['length'])['length'] if orfs else 0,\n", " 'cpg_sites': cpg_sites,\n", " 'cpg_density': cpg_density,\n", " 'tandem_repeats': len(tandem_repeats),\n", " 'repeat_details': tandem_repeats[:5], # Keep first 5 for analysis\n", " 'codon_diversity': len(codon_usage),\n", " 'amino_acid_diversity': len(amino_acid_freq),\n", " 'most_common_amino_acid': max(amino_acid_freq.items(), key=lambda x: x[1])[0] if amino_acid_freq else 'N/A'\n", " }\n", " \n", " results.append(sequence_result)\n", " \n", " # Aggregate statistics\n", " aggregate_stats = {\n", " 'total_sequences': len(results),\n", " 'total_base_pairs': sum(r['length'] for r in results),\n", " 'average_gc_content': np.mean([r['gc_content'] for r in results]),\n", " 'gc_content_std': np.std([r['gc_content'] for r in results]),\n", " 'average_complexity': np.mean([r['complexity'] for r in results]),\n", " 'total_orfs_found': sum(r['orfs_found'] for r in results),\n", " 'total_cpg_sites': sum(r['cpg_sites'] for r in results),\n", " 'sequences_with_repeats': sum(1 for r in results if r['tandem_repeats'] > 0),\n", " 'individual_results': results\n", " }\n", " \n", " return aggregate_stats\n", "\n", "# Generate sample DNA sequences for analysis\n", "def generate_realistic_dna(length, gc_content=0.5):\n", " \"\"\"Generate realistic DNA sequences with specific GC content\"\"\"\n", " bases = ['A', 'T', 'G', 'C']\n", " gc_prob = gc_content / 2\n", " at_prob = (1 - gc_content) / 2\n", " probs = [at_prob, at_prob, gc_prob, gc_prob]\n", " \n", " return ''.join(np.random.choice(bases, size=length, p=probs))\n", "\n", "# Create test sequences\n", "test_sequences = [\n", " generate_realistic_dna(5000, 0.4), # AT-rich\n", " generate_realistic_dna(8000, 0.6), # GC-rich\n", " generate_realistic_dna(3000, 0.5), # Balanced\n", " generate_realistic_dna(12000, 0.45), # Large AT-rich\n", " generate_realistic_dna(6000, 0.55) # Medium GC-rich\n", "]\n", "\n", "# Run analysis on PBS cluster\n", "bio_results = analyze_dna_sequences(test_sequences, analysis_type=\"comprehensive\")\n", "\n", "print(f\"\\nBIOINFORMATICS ANALYSIS COMPLETE\")\n", "print(f\"Sequences analyzed: {bio_results['total_sequences']}\")\n", "print(f\"Total base pairs: {bio_results['total_base_pairs']:,}\")\n", "print(f\"Average GC content: {bio_results['average_gc_content']:.2f}% ± {bio_results['gc_content_std']:.2f}%\")\n", "print(f\"Total ORFs found: {bio_results['total_orfs_found']}\")\n", "print(f\"Total CpG sites: {bio_results['total_cpg_sites']}\")\n", "print(f\"Sequences with tandem repeats: {bio_results['sequences_with_repeats']}/{bio_results['total_sequences']}\")" ], "id": "cell-6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2: Materials Science - Molecular Dynamics Simulation\n", "\n", "Simulate molecular systems commonly done on PBS clusters:" ], "id": "cell-7" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@cluster(\n", " cores=16,\n", " memory=\"64GB\",\n", " time=\"06:00:00\",\n", " queue=\"physics\",\n", " features=\"infiniband\" # PBS feature for high-speed networking\n", ")\n", "def molecular_dynamics_simulation(n_particles=10000, n_steps=100000, temperature=300.0):\n", " \"\"\"\n", " Simplified molecular dynamics simulation for materials science.\n", " \"\"\"\n", " import numpy as np\n", " import math\n", " \n", " # Physical constants\n", " kb = 1.380649e-23 # Boltzmann constant (J/K)\n", " mass = 1.66054e-27 # Approximate atomic mass (kg)\n", " dt = 1e-15 # Time step (s)\n", " sigma = 3.4e-10 # Lennard-Jones parameter (m)\n", " epsilon = 1.65e-21 # Lennard-Jones parameter (J)\n", " \n", " print(f\"Starting MD simulation with {n_particles:,} particles for {n_steps:,} steps...\")\n", " print(f\"Temperature: {temperature} K\")\n", " \n", " # Initialize system\n", " box_size = (n_particles / 0.8) ** (1/3) * sigma # Density ~0.8\n", " \n", " # Random initial positions\n", " positions = np.random.uniform(0, box_size, (n_particles, 3))\n", " \n", " # Maxwell-Boltzmann velocity distribution\n", " velocity_scale = math.sqrt(kb * temperature / mass)\n", " velocities = np.random.normal(0, velocity_scale, (n_particles, 3))\n", " \n", " # Remove center of mass motion\n", " velocities -= np.mean(velocities, axis=0)\n", " \n", " # Storage for analysis\n", " energies = []\n", " temperatures = []\n", " pressures = []\n", " radial_distribution = []\n", " \n", " def lennard_jones_force(r):\n", " \"\"\"Calculate Lennard-Jones force\"\"\"\n", " if r < 1e-12: # Avoid division by zero\n", " return 0\n", " sr6 = (sigma / r) ** 6\n", " sr12 = sr6 ** 2\n", " return 24 * epsilon * (2 * sr12 - sr6) / r\n", " \n", " def calculate_forces(pos):\n", " \"\"\"Calculate forces on all particles\"\"\"\n", " forces = np.zeros_like(pos)\n", " potential_energy = 0\n", " \n", " for i in range(n_particles):\n", " for j in range(i + 1, n_particles):\n", " # Distance vector with periodic boundary conditions\n", " dr = pos[j] - pos[i]\n", " dr = dr - box_size * np.round(dr / box_size)\n", " r = np.linalg.norm(dr)\n", " \n", " if r < 2.5 * sigma: # Cutoff distance\n", " force_magnitude = lennard_jones_force(r)\n", " force_vector = force_magnitude * dr / r\n", " \n", " forces[i] += force_vector\n", " forces[j] -= force_vector\n", " \n", " # Potential energy\n", " sr6 = (sigma / r) ** 6\n", " sr12 = sr6 ** 2\n", " potential_energy += 4 * epsilon * (sr12 - sr6)\n", " \n", " return forces, potential_energy\n", " \n", " def calculate_temperature(vel):\n", " \"\"\"Calculate instantaneous temperature\"\"\"\n", " kinetic_energy = 0.5 * mass * np.sum(vel ** 2)\n", " return 2 * kinetic_energy / (3 * n_particles * kb)\n", " \n", " def calculate_pressure(pos, forces):\n", " \"\"\"Calculate pressure using virial theorem\"\"\"\n", " kinetic_term = n_particles * kb * calculate_temperature(velocities)\n", " virial = np.sum(positions * forces)\n", " volume = box_size ** 3\n", " return (kinetic_term + virial/3) / volume\n", " \n", " # Main simulation loop\n", " for step in range(n_steps):\n", " if step % (n_steps // 10) == 0:\n", " print(f\"Step {step:,}/{n_steps:,} ({100*step/n_steps:.1f}%)\")\n", " \n", " # Calculate forces\n", " forces, potential_energy = calculate_forces(positions)\n", " \n", " # Velocity Verlet integration\n", " # Update positions\n", " positions += velocities * dt + 0.5 * forces / mass * dt ** 2\n", " \n", " # Apply periodic boundary conditions\n", " positions = positions % box_size\n", " \n", " # Update velocities\n", " new_forces, _ = calculate_forces(positions)\n", " velocities += 0.5 * (forces + new_forces) / mass * dt\n", " \n", " # Calculate thermodynamic properties\n", " if step % 1000 == 0: # Sample every 1000 steps\n", " kinetic_energy = 0.5 * mass * np.sum(velocities ** 2)\n", " total_energy = kinetic_energy + potential_energy\n", " temp = calculate_temperature(velocities)\n", " pressure = calculate_pressure(positions, new_forces)\n", " \n", " energies.append({\n", " 'step': step,\n", " 'kinetic': kinetic_energy,\n", " 'potential': potential_energy,\n", " 'total': total_energy\n", " })\n", " temperatures.append(temp)\n", " pressures.append(pressure)\n", " \n", " # Simple thermostat (velocity rescaling)\n", " if step % 100 == 0: # Apply every 100 steps\n", " current_temp = calculate_temperature(velocities)\n", " if current_temp > 0:\n", " scaling_factor = math.sqrt(temperature / current_temp)\n", " velocities *= scaling_factor\n", " \n", " # Calculate radial distribution function (simplified)\n", " def calculate_rdf(pos, n_bins=100, max_r=None):\n", " if max_r is None:\n", " max_r = box_size / 2\n", " \n", " bin_width = max_r / n_bins\n", " rdf = np.zeros(n_bins)\n", " \n", " for i in range(min(1000, n_particles)): # Sample subset for efficiency\n", " for j in range(i + 1, min(1000, n_particles)):\n", " dr = pos[j] - pos[i]\n", " dr = dr - box_size * np.round(dr / box_size)\n", " r = np.linalg.norm(dr)\n", " \n", " if r < max_r:\n", " bin_index = int(r / bin_width)\n", " if bin_index < n_bins:\n", " rdf[bin_index] += 1\n", " \n", " # Normalize\n", " for i in range(n_bins):\n", " r = (i + 0.5) * bin_width\n", " volume = 4 * math.pi * r ** 2 * bin_width\n", " density = n_particles / box_size ** 3\n", " rdf[i] /= (volume * density * 1000) # 1000 particles sampled\n", " \n", " return rdf, np.arange(0.5 * bin_width, max_r, bin_width)\n", " \n", " rdf_values, rdf_distances = calculate_rdf(positions)\n", " \n", " # Final analysis\n", " avg_temperature = np.mean(temperatures[-50:]) # Last 50 samples\n", " avg_pressure = np.mean(pressures[-50:])\n", " final_energy = energies[-1]['total'] if energies else 0\n", " \n", " simulation_results = {\n", " 'n_particles': n_particles,\n", " 'n_steps': n_steps,\n", " 'target_temperature': temperature,\n", " 'average_temperature': avg_temperature,\n", " 'temperature_stability': np.std(temperatures[-50:]),\n", " 'average_pressure': avg_pressure,\n", " 'final_energy': final_energy,\n", " 'box_size': box_size,\n", " 'density': n_particles / box_size ** 3,\n", " 'energy_trajectory': energies[::10], # Every 10th point\n", " 'temperature_trajectory': temperatures[::10],\n", " 'pressure_trajectory': pressures[::10],\n", " 'radial_distribution': {\n", " 'distances': rdf_distances.tolist(),\n", " 'values': rdf_values.tolist()\n", " },\n", " 'simulation_time_ns': n_steps * dt * 1e9 # Convert to nanoseconds\n", " }\n", " \n", " return simulation_results\n", "\n", "# Run molecular dynamics simulation\n", "md_results = molecular_dynamics_simulation(\n", " n_particles=5000, \n", " n_steps=50000, \n", " temperature=298.15 # Room temperature\n", ")\n", "\n", "print(f\"\\nMOLECULAR DYNAMICS SIMULATION COMPLETE\")\n", "print(f\"Particles: {md_results['n_particles']:,}\")\n", "print(f\"Steps: {md_results['n_steps']:,}\")\n", "print(f\"Simulation time: {md_results['simulation_time_ns']:.2f} ns\")\n", "print(f\"Target temperature: {md_results['target_temperature']:.1f} K\")\n", "print(f\"Average temperature: {md_results['average_temperature']:.1f} K\")\n", "print(f\"Temperature stability: ±{md_results['temperature_stability']:.1f} K\")\n", "print(f\"Average pressure: {md_results['average_pressure']:.2e} Pa\")\n", "print(f\"System density: {md_results['density']:.2e} particles/m³\")" ], "id": "cell-8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 3: Environmental Science - Climate Data Analysis\n", "\n", "Analyze large climate datasets commonly processed on research clusters:" ], "id": "cell-9" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@cluster(\n", " cores=12,\n", " memory=\"48GB\",\n", " time=\"04:00:00\",\n", " queue=\"climate\",\n", " parallel=True # Enable automatic parallelization\n", ")\n", "def analyze_climate_data(years_to_analyze=50, stations_per_year=1000):\n", " \"\"\"\n", " Comprehensive climate data analysis for environmental research.\n", " \"\"\"\n", " import numpy as np\n", " import pandas as pd\n", " from datetime import datetime, timedelta\n", " import random\n", " from scipy import stats\n", " import math\n", " \n", " def generate_realistic_climate_data(year, station_id, latitude, longitude):\n", " \"\"\"Generate realistic climate data for a station\"\"\"\n", " np.random.seed(year * 1000 + station_id) # Reproducible but varied\n", " \n", " # Base temperature influenced by latitude\n", " base_temp = 25 - abs(latitude) * 0.6 # Cooler at higher latitudes\n", " \n", " # Generate daily data for the year\n", " start_date = datetime(year, 1, 1)\n", " days_in_year = 366 if year % 4 == 0 else 365\n", " \n", " daily_data = []\n", " \n", " for day in range(days_in_year):\n", " date = start_date + timedelta(days=day)\n", " day_of_year = day + 1\n", " \n", " # Seasonal temperature variation\n", " seasonal_temp = base_temp + 15 * math.cos(2 * math.pi * (day_of_year - 172) / 365)\n", " \n", " # Add random variation and trends\n", " climate_trend = 0.01 * (year - 1970) # 0.01°C/year warming\n", " daily_temp = seasonal_temp + climate_trend + np.random.normal(0, 3)\n", " \n", " # Precipitation (higher in tropics and certain seasons)\n", " base_precip = max(0, 10 - abs(latitude) * 0.3)\n", " seasonal_precip_factor = 1 + 0.5 * math.cos(2 * math.pi * (day_of_year - 30) / 365)\n", " daily_precip = max(0, np.random.exponential(base_precip * seasonal_precip_factor))\n", " \n", " # Humidity (correlated with temperature and precipitation)\n", " base_humidity = 60 - abs(latitude) * 0.5\n", " humidity = base_humidity + daily_precip * 0.5 - (daily_temp - base_temp) * 0.3\n", " humidity = max(10, min(100, humidity + np.random.normal(0, 5)))\n", " \n", " # Wind speed (more variable at higher latitudes)\n", " base_wind = 5 + abs(latitude) * 0.1\n", " wind_speed = max(0, np.random.gamma(2, base_wind / 2))\n", " \n", " # Atmospheric pressure (altitude and weather dependent)\n", " base_pressure = 1013.25 # Sea level\n", " pressure = base_pressure + np.random.normal(0, 10)\n", " \n", " daily_data.append({\n", " 'date': date,\n", " 'temperature': daily_temp,\n", " 'precipitation': daily_precip,\n", " 'humidity': humidity,\n", " 'wind_speed': wind_speed,\n", " 'pressure': pressure\n", " })\n", " \n", " return daily_data\n", " \n", " def analyze_station_trends(station_data):\n", " \"\"\"Analyze trends for a single weather station\"\"\"\n", " df = pd.DataFrame(station_data)\n", " \n", " # Calculate annual statistics\n", " annual_stats = {\n", " 'mean_temperature': df['temperature'].mean(),\n", " 'temperature_range': df['temperature'].max() - df['temperature'].min(),\n", " 'total_precipitation': df['precipitation'].sum(),\n", " 'mean_humidity': df['humidity'].mean(),\n", " 'mean_wind_speed': df['wind_speed'].mean(),\n", " 'mean_pressure': df['pressure'].mean(),\n", " 'temperature_std': df['temperature'].std(),\n", " 'precipitation_days': (df['precipitation'] > 1.0).sum(),\n", " 'extreme_heat_days': (df['temperature'] > df['temperature'].quantile(0.95)).sum(),\n", " 'extreme_cold_days': (df['temperature'] < df['temperature'].quantile(0.05)).sum()\n", " }\n", " \n", " # Seasonal analysis\n", " df['month'] = df['date'].dt.month\n", " seasonal_temps = df.groupby(df['month'])['temperature'].mean()\n", " seasonal_precip = df.groupby(df['month'])['precipitation'].sum()\n", " \n", " annual_stats['seasonal_temperature_variation'] = seasonal_temps.std()\n", " annual_stats['wettest_month'] = seasonal_precip.idxmax()\n", " annual_stats['driest_month'] = seasonal_precip.idxmin()\n", " \n", " return annual_stats\n", " \n", " print(f\"Analyzing climate data for {years_to_analyze} years, {stations_per_year} stations per year...\")\n", " print(f\"Total data points: {years_to_analyze * stations_per_year * 365:,}\")\n", " \n", " all_station_results = []\n", " \n", " # This loop will be automatically parallelized by Clustrix\n", " for year in range(1970, 1970 + years_to_analyze):\n", " print(f\"Processing year {year}...\")\n", " \n", " year_results = []\n", " \n", " for station_id in range(stations_per_year):\n", " # Generate random station location\n", " latitude = np.random.uniform(-60, 75) # Inhabitable latitudes\n", " longitude = np.random.uniform(-180, 180)\n", " \n", " # Generate climate data for this station and year\n", " station_data = generate_realistic_climate_data(year, station_id, latitude, longitude)\n", " \n", " # Analyze the station data\n", " station_analysis = analyze_station_trends(station_data)\n", " station_analysis['year'] = year\n", " station_analysis['station_id'] = station_id\n", " station_analysis['latitude'] = latitude\n", " station_analysis['longitude'] = longitude\n", " \n", " year_results.append(station_analysis)\n", " \n", " all_station_results.extend(year_results)\n", " \n", " # Convert to DataFrame for analysis\n", " results_df = pd.DataFrame(all_station_results)\n", " \n", " # Global trend analysis\n", " yearly_global_temps = results_df.groupby('year')['mean_temperature'].mean()\n", " yearly_global_precip = results_df.groupby('year')['total_precipitation'].mean()\n", " \n", " # Calculate trends\n", " years = yearly_global_temps.index\n", " temp_trend, temp_intercept, temp_r_value, temp_p_value, temp_std_err = stats.linregress(years, yearly_global_temps)\n", " precip_trend, precip_intercept, precip_r_value, precip_p_value, precip_std_err = stats.linregress(years, yearly_global_precip)\n", " \n", " # Regional analysis\n", " def classify_climate_zone(lat):\n", " if abs(lat) < 23.5:\n", " return \"Tropical\"\n", " elif abs(lat) < 35:\n", " return \"Subtropical\"\n", " elif abs(lat) < 50:\n", " return \"Temperate\"\n", " else:\n", " return \"Polar\"\n", " \n", " results_df['climate_zone'] = results_df['latitude'].apply(classify_climate_zone)\n", " zone_analysis = results_df.groupby('climate_zone').agg({\n", " 'mean_temperature': ['mean', 'std'],\n", " 'total_precipitation': ['mean', 'std'],\n", " 'temperature_range': 'mean',\n", " 'extreme_heat_days': 'mean',\n", " 'extreme_cold_days': 'mean'\n", " }).round(2)\n", " \n", " # Extreme events analysis\n", " extreme_heat_threshold = results_df['mean_temperature'].quantile(0.95)\n", " extreme_cold_threshold = results_df['mean_temperature'].quantile(0.05)\n", " drought_threshold = results_df['total_precipitation'].quantile(0.1)\n", " flood_threshold = results_df['total_precipitation'].quantile(0.9)\n", " \n", " extreme_events = {\n", " 'extreme_heat_stations': (results_df['mean_temperature'] > extreme_heat_threshold).sum(),\n", " 'extreme_cold_stations': (results_df['mean_temperature'] < extreme_cold_threshold).sum(),\n", " 'drought_affected_stations': (results_df['total_precipitation'] < drought_threshold).sum(),\n", " 'flood_risk_stations': (results_df['total_precipitation'] > flood_threshold).sum()\n", " }\n", " \n", " # Compile final results\n", " climate_analysis = {\n", " 'analysis_summary': {\n", " 'years_analyzed': years_to_analyze,\n", " 'stations_per_year': stations_per_year,\n", " 'total_station_years': len(results_df),\n", " 'data_points_analyzed': len(results_df) * 365\n", " },\n", " 'global_trends': {\n", " 'temperature_trend_per_decade': temp_trend * 10,\n", " 'temperature_trend_significance': temp_p_value,\n", " 'temperature_correlation': temp_r_value ** 2,\n", " 'precipitation_trend_per_decade': precip_trend * 10,\n", " 'precipitation_trend_significance': precip_p_value,\n", " 'precipitation_correlation': precip_r_value ** 2\n", " },\n", " 'current_climate_state': {\n", " 'global_mean_temperature': yearly_global_temps.iloc[-1],\n", " 'global_mean_precipitation': yearly_global_precip.iloc[-1],\n", " 'temperature_warming_since_start': yearly_global_temps.iloc[-1] - yearly_global_temps.iloc[0],\n", " 'precipitation_change_since_start': yearly_global_precip.iloc[-1] - yearly_global_precip.iloc[0]\n", " },\n", " 'regional_analysis': zone_analysis.to_dict(),\n", " 'extreme_events': extreme_events,\n", " 'statistical_summary': {\n", " 'mean_global_temperature': results_df['mean_temperature'].mean(),\n", " 'temperature_standard_deviation': results_df['mean_temperature'].std(),\n", " 'mean_global_precipitation': results_df['total_precipitation'].mean(),\n", " 'precipitation_standard_deviation': results_df['total_precipitation'].std(),\n", " 'warmest_station_temp': results_df['mean_temperature'].max(),\n", " 'coldest_station_temp': results_df['mean_temperature'].min(),\n", " 'wettest_station_precip': results_df['total_precipitation'].max(),\n", " 'driest_station_precip': results_df['total_precipitation'].min()\n", " }\n", " }\n", " \n", " return climate_analysis\n", "\n", "# Run climate analysis\n", "climate_results = analyze_climate_data(years_to_analyze=30, stations_per_year=200)\n", "\n", "print(f\"\\nCLIMATE DATA ANALYSIS COMPLETE\")\n", "print(f\"Years analyzed: {climate_results['analysis_summary']['years_analyzed']}\")\n", "print(f\"Total station-years: {climate_results['analysis_summary']['total_station_years']:,}\")\n", "print(f\"Data points: {climate_results['analysis_summary']['data_points_analyzed']:,}\")\n", "\n", "print(\"\\nGlobal Trends:\")\n", "trends = climate_results['global_trends']\n", "print(f\" Temperature trend: {trends['temperature_trend_per_decade']:.3f}°C per decade (p={trends['temperature_trend_significance']:.4f})\")\n", "print(f\" Precipitation trend: {trends['precipitation_trend_per_decade']:.1f} mm per decade (p={trends['precipitation_trend_significance']:.4f})\")\n", "\n", "print(\"\\nCurrent Climate State:\")\n", "current = climate_results['current_climate_state']\n", "print(f\" Global mean temperature: {current['global_mean_temperature']:.2f}°C\")\n", "print(f\" Temperature change since start: {current['temperature_warming_since_start']:.2f}°C\")\n", "print(f\" Global mean precipitation: {current['global_mean_precipitation']:.1f} mm/year\")\n", "\n", "print(\"\\nExtreme Events:\")\n", "extremes = climate_results['extreme_events']\n", "total_stations = climate_results['analysis_summary']['total_station_years']\n", "print(f\" Extreme heat affected: {extremes['extreme_heat_stations']} stations ({100*extremes['extreme_heat_stations']/total_stations:.1f}%)\")\n", "print(f\" Drought affected: {extremes['drought_affected_stations']} stations ({100*extremes['drought_affected_stations']/total_stations:.1f}%)\")\n", "print(f\" Flood risk: {extremes['flood_risk_stations']} stations ({100*extremes['flood_risk_stations']/total_stations:.1f}%)\")" ], "id": "cell-10" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PBS Queue Management and Resource Selection\n", "\n", "Understanding how to choose appropriate PBS queues and resources:" ], "id": "cell-11" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def select_pbs_resources(workload_type, data_size_mb, urgency=\"normal\"):\n", " \"\"\"\n", " Intelligent PBS resource selection based on workload characteristics.\n", " \"\"\"\n", " \n", " # Base resource templates\n", " resource_templates = {\n", " \"bioinformatics\": {\n", " \"small\": {\"cores\": 4, \"memory\": \"16GB\", \"time\": \"02:00:00\", \"queue\": \"bioqueue\"},\n", " \"medium\": {\"cores\": 8, \"memory\": \"32GB\", \"time\": \"06:00:00\", \"queue\": \"bioqueue\"},\n", " \"large\": {\"cores\": 16, \"memory\": \"64GB\", \"time\": \"12:00:00\", \"queue\": \"bioqueue_long\"}\n", " },\n", " \"physics\": {\n", " \"small\": {\"cores\": 8, \"memory\": \"32GB\", \"time\": \"04:00:00\", \"queue\": \"physics\"},\n", " \"medium\": {\"cores\": 16, \"memory\": \"64GB\", \"time\": \"12:00:00\", \"queue\": \"physics\"},\n", " \"large\": {\"cores\": 32, \"memory\": \"128GB\", \"time\": \"24:00:00\", \"queue\": \"physics_long\"}\n", " },\n", " \"climate\": {\n", " \"small\": {\"cores\": 6, \"memory\": \"24GB\", \"time\": \"03:00:00\", \"queue\": \"climate\"},\n", " \"medium\": {\"cores\": 12, \"memory\": \"48GB\", \"time\": \"08:00:00\", \"queue\": \"climate\"},\n", " \"large\": {\"cores\": 24, \"memory\": \"96GB\", \"time\": \"16:00:00\", \"queue\": \"climate_long\"}\n", " },\n", " \"ml\": {\n", " \"small\": {\"cores\": 4, \"memory\": \"16GB\", \"time\": \"01:00:00\", \"queue\": \"gpu\", \"gres\": \"gpu:1\"},\n", " \"medium\": {\"cores\": 8, \"memory\": \"32GB\", \"time\": \"04:00:00\", \"queue\": \"gpu\", \"gres\": \"gpu:2\"},\n", " \"large\": {\"cores\": 16, \"memory\": \"64GB\", \"time\": \"12:00:00\", \"queue\": \"gpu_long\", \"gres\": \"gpu:4\"}\n", " }\n", " }\n", " \n", " # Determine size category based on data\n", " if data_size_mb < 100:\n", " size_category = \"small\"\n", " elif data_size_mb < 1000:\n", " size_category = \"medium\"\n", " else:\n", " size_category = \"large\"\n", " \n", " # Get base configuration\n", " if workload_type not in resource_templates:\n", " workload_type = \"physics\" # Default fallback\n", " \n", " config = resource_templates[workload_type][size_category].copy()\n", " \n", " # Adjust for urgency\n", " if urgency == \"urgent\":\n", " # Use express queue with reduced resources\n", " config[\"queue\"] = \"express\"\n", " config[\"time\"] = \"00:30:00\"\n", " config[\"cores\"] = min(4, config[\"cores\"])\n", " elif urgency == \"low\":\n", " # Use long queue with more resources\n", " config[\"queue\"] = config[\"queue\"].replace(\"queue\", \"queue_long\")\n", " config[\"cores\"] = int(config[\"cores\"] * 1.5)\n", " # Increase time limit\n", " time_parts = config[\"time\"].split(\":\")\n", " hours = int(time_parts[0]) * 2\n", " config[\"time\"] = f\"{hours:02d}:{time_parts[1]}:{time_parts[2]}\"\n", " \n", " return config\n", "\n", "# Example resource selections\n", "example_workloads = [\n", " (\"bioinformatics\", 500, \"normal\"),\n", " (\"physics\", 2000, \"low\"),\n", " (\"climate\", 150, \"urgent\"),\n", " (\"ml\", 800, \"normal\")\n", "]\n", "\n", "print(\"PBS Resource Selection Examples:\")\n", "print(\"=\" * 70)\n", "\n", "for workload, data_size, urgency in example_workloads:\n", " resources = select_pbs_resources(workload, data_size, urgency)\n", " print(f\"\\n{workload.upper()} ({data_size} MB, {urgency} priority):\")\n", " for key, value in resources.items():\n", " print(f\" {key}: {value}\")" ], "id": "cell-12" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PBS Job Arrays for Parameter Studies\n", "\n", "Use PBS job arrays for efficient parameter sweeps:" ], "id": "cell-13" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@cluster(\n", " cores=4,\n", " memory=\"16GB\",\n", " time=\"01:00:00\",\n", " queue=\"normal\",\n", " pbs_array=\"1-20\" # PBS job array with 20 tasks\n", ")\n", "def drug_discovery_parameter_sweep(base_config):\n", " \"\"\"\n", " Pharmaceutical research parameter sweep using PBS job arrays.\n", " Each array task tests different molecular parameters.\n", " \"\"\"\n", " import os\n", " import numpy as np\n", " import random\n", " from math import exp, log\n", " \n", " # Get PBS array index\n", " array_index = int(os.environ.get('PBS_ARRAYID', '1'))\n", " \n", " # Define parameter space for drug discovery\n", " molecular_weights = np.linspace(150, 500, 20) # Typical drug MW range\n", " logp_values = np.linspace(-1, 5, 20) # Lipophilicity\n", " hbd_counts = list(range(0, 6)) # Hydrogen bond donors\n", " hba_counts = list(range(0, 11)) # Hydrogen bond acceptors\n", " \n", " # Select parameters for this array task\n", " mw = molecular_weights[array_index - 1]\n", " logp = logp_values[array_index - 1]\n", " \n", " # Random selection for other parameters\n", " np.random.seed(array_index * 42)\n", " hbd = random.choice(hbd_counts)\n", " hba = random.choice(hba_counts)\n", " \n", " print(f\"Array task {array_index}: MW={mw:.1f}, LogP={logp:.2f}, HBD={hbd}, HBA={hba}\")\n", " \n", " def calculate_drug_likeness(mw, logp, hbd, hba):\n", " \"\"\"Calculate drug-likeness using Lipinski's Rule of Five\"\"\"\n", " violations = 0\n", " \n", " if mw > 500:\n", " violations += 1\n", " if logp > 5:\n", " violations += 1\n", " if hbd > 5:\n", " violations += 1\n", " if hba > 10:\n", " violations += 1\n", " \n", " drug_likeness = max(0, 1.0 - violations * 0.25)\n", " return drug_likeness, violations\n", " \n", " def simulate_binding_affinity(mw, logp, hbd, hba):\n", " \"\"\"Simulate binding affinity to target protein\"\"\"\n", " # Simplified model based on molecular properties\n", " optimal_mw = 350\n", " optimal_logp = 2.5\n", " optimal_hbd = 2\n", " optimal_hba = 6\n", " \n", " mw_score = exp(-((mw - optimal_mw) / 100) ** 2)\n", " logp_score = exp(-((logp - optimal_logp) / 1.5) ** 2)\n", " hbd_score = exp(-((hbd - optimal_hbd) / 1.5) ** 2)\n", " hba_score = exp(-((hba - optimal_hba) / 2.5) ** 2)\n", " \n", " # Combine scores with some randomness\n", " base_affinity = (mw_score * logp_score * hbd_score * hba_score) ** 0.5\n", " random_factor = np.random.uniform(0.7, 1.3)\n", " \n", " binding_affinity = base_affinity * random_factor\n", " ic50 = 10 ** (-6 - 3 * binding_affinity) # Convert to IC50 (M)\n", " \n", " return binding_affinity, ic50\n", " \n", " def simulate_admet_properties(mw, logp, hbd, hba):\n", " \"\"\"Simulate ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity)\"\"\"\n", " # Absorption (permeability)\n", " permeability = 1 / (1 + exp(-(logp - 1.5)))\n", " permeability *= np.random.uniform(0.8, 1.2)\n", " \n", " # Distribution (plasma protein binding)\n", " ppb = min(99, max(10, 20 + logp * 15 + np.random.normal(0, 10)))\n", " \n", " # Metabolism (hepatic clearance)\n", " clearance = 0.5 + 0.3 * (1 / (1 + exp(-(mw - 300) / 50)))\n", " clearance *= np.random.uniform(0.7, 1.3)\n", " \n", " # Excretion (renal clearance)\n", " renal_clearance = max(0.1, 0.8 - logp * 0.1 + np.random.normal(0, 0.1))\n", " \n", " # Toxicity (simplified hERG channel binding)\n", " herg_risk = 1 / (1 + exp(-(logp - 3.5)))\n", " if mw > 400:\n", " herg_risk *= 1.5\n", " \n", " return {\n", " 'permeability': permeability,\n", " 'plasma_protein_binding': ppb,\n", " 'hepatic_clearance': clearance,\n", " 'renal_clearance': renal_clearance,\n", " 'herg_risk': herg_risk\n", " }\n", " \n", " def calculate_developability_score(drug_likeness, binding_affinity, admet):\n", " \"\"\"Calculate overall drug developability score\"\"\"\n", " # Weight different factors\n", " likeness_weight = 0.2\n", " affinity_weight = 0.4\n", " admet_weight = 0.4\n", " \n", " # ADMET composite score\n", " admet_score = (\n", " admet['permeability'] * 0.3 +\n", " (1 - admet['herg_risk']) * 0.3 +\n", " (1 - admet['hepatic_clearance']) * 0.2 +\n", " admet['renal_clearance'] * 0.2\n", " )\n", " \n", " total_score = (\n", " drug_likeness * likeness_weight +\n", " binding_affinity * affinity_weight +\n", " admet_score * admet_weight\n", " )\n", " \n", " return total_score, admet_score\n", " \n", " # Run simulations\n", " drug_likeness, ro5_violations = calculate_drug_likeness(mw, logp, hbd, hba)\n", " binding_affinity, ic50 = simulate_binding_affinity(mw, logp, hbd, hba)\n", " admet_props = simulate_admet_properties(mw, logp, hbd, hba)\n", " developability_score, admet_score = calculate_developability_score(\n", " drug_likeness, binding_affinity, admet_props\n", " )\n", " \n", " # Compile results\n", " compound_results = {\n", " 'array_task_id': array_index,\n", " 'molecular_properties': {\n", " 'molecular_weight': mw,\n", " 'logp': logp,\n", " 'hbd_count': hbd,\n", " 'hba_count': hba\n", " },\n", " 'drug_likeness': {\n", " 'score': drug_likeness,\n", " 'ro5_violations': ro5_violations,\n", " 'passes_ro5': ro5_violations <= 1\n", " },\n", " 'target_binding': {\n", " 'affinity_score': binding_affinity,\n", " 'ic50_M': ic50,\n", " 'pic50': -log(ic50, 10) if ic50 > 0 else 0\n", " },\n", " 'admet_properties': admet_props,\n", " 'overall_assessment': {\n", " 'developability_score': developability_score,\n", " 'admet_score': admet_score,\n", " 'promising_candidate': developability_score > 0.6 and binding_affinity > 0.5\n", " }\n", " }\n", " \n", " return compound_results\n", "\n", "# Run drug discovery parameter sweep\n", "drug_config = {\n", " 'target_name': 'EGFR',\n", " 'assay_type': 'binding',\n", " 'screening_library': 'chembl'\n", "}\n", "\n", "# This will run as one task of the PBS array\n", "drug_result = drug_discovery_parameter_sweep(drug_config)\n", "\n", "print(f\"\\nDRUG DISCOVERY ANALYSIS - Task {drug_result['array_task_id']}\")\n", "print(\"=\" * 60)\n", "\n", "mol_props = drug_result['molecular_properties']\n", "print(f\"Molecular Weight: {mol_props['molecular_weight']:.1f} Da\")\n", "print(f\"LogP: {mol_props['logp']:.2f}\")\n", "print(f\"H-bond donors: {mol_props['hbd_count']}\")\n", "print(f\"H-bond acceptors: {mol_props['hba_count']}\")\n", "\n", "drug_like = drug_result['drug_likeness']\n", "print(f\"\\nDrug-likeness score: {drug_like['score']:.3f}\")\n", "print(f\"Rule of 5 violations: {drug_like['ro5_violations']}\")\n", "print(f\"Passes Lipinski's Rule: {drug_like['passes_ro5']}\")\n", "\n", "binding = drug_result['target_binding']\n", "print(f\"\\nBinding affinity score: {binding['affinity_score']:.3f}\")\n", "print(f\"IC50: {binding['ic50_M']:.2e} M\")\n", "print(f\"pIC50: {binding['pic50']:.2f}\")\n", "\n", "assessment = drug_result['overall_assessment']\n", "print(f\"\\nDevelopability score: {assessment['developability_score']:.3f}\")\n", "print(f\"ADMET score: {assessment['admet_score']:.3f}\")\n", "print(f\"Promising candidate: {assessment['promising_candidate']}\")" ], "id": "cell-14" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Monitoring PBS Jobs\n", "\n", "Monitor and manage PBS jobs using Clustrix:" ], "id": "cell-15" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from clustrix import ClusterExecutor\n", "\n", "# Get the configured executor\n", "config = clustrix.get_config()\n", "executor = ClusterExecutor(config)\n", "\n", "try:\n", " executor.connect()\n", " print(\"✓ Successfully connected to PBS cluster\")\n", " \n", " # Check PBS version\n", " stdout, stderr = executor._execute_command(\"qstat --version\")\n", " if stdout:\n", " print(f\"✓ PBS version: {stdout.strip()}\")\n", " \n", " # Check available queues\n", " stdout, stderr = executor._execute_command(\"qstat -Q\")\n", " if stdout:\n", " print(\"\\nAvailable queues:\")\n", " lines = stdout.strip().split('\\n')\n", " for line in lines[2:7]: # Skip header, show first 5 queues\n", " parts = line.split()\n", " if len(parts) >= 3:\n", " queue_name = parts[0]\n", " max_jobs = parts[1] if parts[1] != '--' else 'unlimited'\n", " total_jobs = parts[2]\n", " print(f\" {queue_name}: {total_jobs} jobs, max: {max_jobs}\")\n", " \n", " # Check node status\n", " stdout, stderr = executor._execute_command(\"pbsnodes -a | grep -E '^(\\w+|\\s+state)' | head -20\")\n", " if stdout:\n", " print(\"\\nNode status (sample):\")\n", " lines = stdout.strip().split('\\n')\n", " current_node = None\n", " for line in lines[:10]: # Show first few nodes\n", " if not line.startswith(' '):\n", " current_node = line.strip()\n", " elif 'state' in line:\n", " state = line.split('=')[1].strip() if '=' in line else 'unknown'\n", " print(f\" {current_node}: {state}\")\n", " \n", " # Check user's job status\n", " username = config.username\n", " stdout, stderr = executor._execute_command(f\"qstat -u {username}\")\n", " if stdout and len(stdout.strip().split('\\n')) > 2:\n", " print(f\"\\nYour current jobs:\")\n", " lines = stdout.strip().split('\\n')\n", " for line in lines[2:]: # Skip headers\n", " print(f\" {line}\")\n", " else:\n", " print(f\"\\n✓ No jobs currently running for user {username}\")\n", " \n", " executor.disconnect()\n", " print(\"\\n✓ PBS cluster monitoring completed successfully\")\n", " \n", "except Exception as e:\n", " print(f\"✗ Connection or monitoring failed: {e}\")\n", " print(\"Please check your PBS cluster configuration and connectivity\")" ], "id": "cell-16" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PBS Configuration Best Practices\n", "\n", "### Environment-Specific Configuration Files\n", "\n", "Create different configurations for different PBS environments:" ], "id": "cell-17" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create PBS configuration for different research domains\n", "\n", "bioinformatics_config = {\n", " 'cluster_type': 'pbs',\n", " 'cluster_host': 'bio-cluster.university.edu',\n", " 'username': 'researcher',\n", " 'default_queue': 'bioqueue',\n", " 'default_cores': 8,\n", " 'default_memory': '32GB',\n", " 'default_time': '06:00:00',\n", " 'module_loads': ['python/3.9', 'blast/2.12', 'hmmer/3.3'],\n", " 'remote_work_dir': '/scratch/bio/clustrix',\n", " 'max_parallel_jobs': 20\n", "}\n", "\n", "physics_config = {\n", " 'cluster_type': 'pbs',\n", " 'cluster_host': 'physics-hpc.university.edu',\n", " 'username': 'physicist',\n", " 'default_queue': 'physics',\n", " 'default_cores': 16,\n", " 'default_memory': '64GB',\n", " 'default_time': '12:00:00',\n", " 'module_loads': ['python/3.9', 'openmpi/4.1', 'fftw/3.3'],\n", " 'remote_work_dir': '/home/physicist/clustrix',\n", " 'features': 'infiniband', # Request high-speed interconnect\n", " 'max_parallel_jobs': 10\n", "}\n", "\n", "climate_config = {\n", " 'cluster_type': 'pbs',\n", " 'cluster_host': 'climate-compute.noaa.gov',\n", " 'username': 'climatologist',\n", " 'default_queue': 'climate',\n", " 'default_cores': 12,\n", " 'default_memory': '48GB',\n", " 'default_time': '08:00:00',\n", " 'module_loads': ['python/3.9', 'netcdf/4.8', 'gdal/3.4'],\n", " 'remote_work_dir': '/data/climate/clustrix',\n", " 'max_parallel_jobs': 15\n", "}\n", "\n", "# Example of selecting configuration based on research domain\n", "def configure_for_domain(domain):\n", " configs = {\n", " 'bioinformatics': bioinformatics_config,\n", " 'physics': physics_config,\n", " 'climate': climate_config\n", " }\n", " \n", " if domain in configs:\n", " clustrix.configure(**configs[domain])\n", " print(f\"Configured Clustrix for {domain} research\")\n", " return configs[domain]\n", " else:\n", " print(f\"Unknown domain: {domain}. Available: {list(configs.keys())}\")\n", " return None\n", "\n", "# Configure for bioinformatics research\n", "selected_config = configure_for_domain('bioinformatics')\n", "if selected_config:\n", " print(\"\\nConfiguration details:\")\n", " for key, value in selected_config.items():\n", " print(f\" {key}: {value}\")" ], "id": "cell-18" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This tutorial covered PBS/Torque cluster usage with Clustrix:\n", "\n", "1. **PBS Configuration** - Setting up Clustrix for PBS clusters\n", "2. **Bioinformatics Applications** - DNA sequence analysis and genomics\n", "3. **Materials Science** - Molecular dynamics simulations\n", "4. **Climate Research** - Large-scale environmental data analysis\n", "5. **Drug Discovery** - Pharmaceutical parameter sweeps with job arrays\n", "6. **Resource Management** - Intelligent queue and resource selection\n", "7. **Job Monitoring** - PBS cluster status and job management\n", "8. **Best Practices** - Domain-specific configurations\n", "\n", "### Key PBS Features:\n", "\n", "- **Queue Management**: Choose appropriate queues for different workload types\n", "- **Resource Specification**: Use PBS directives for cores, memory, and time\n", "- **Job Arrays**: Efficient parameter sweeps with `pbs_array` parameter\n", "- **Feature Requests**: Specify hardware features like InfiniBand\n", "- **Module Loading**: Automatic environment setup with required software\n", "- **Walltime Management**: Realistic time estimates for job completion\n", "\n", "### Next Steps:\n", "\n", "- Explore [SLURM Tutorial](slurm_tutorial.ipynb) for SLURM-specific features\n", "- Try [Kubernetes Tutorial](kubernetes_tutorial.ipynb) for containerized computing\n", "- Review [SGE Tutorial](sge_tutorial.ipynb) for Sun Grid Engine clusters\n", "- Check the [SSH Setup Guide](../ssh_setup.rst) for secure authentication\n", "\n", "For more information, visit the [Clustrix Documentation](https://clustrix.readthedocs.io)." ], "id": "cell-19" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 4 }