{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": "# Clustrix Basic Usage Tutorial\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/clustrix/blob/master/docs/notebooks/basic_usage.ipynb)\n\nThis notebook demonstrates the basic usage of Clustrix for distributed computing.\n\n## Installation\n\nFirst, let's install Clustrix:", "outputs": [], "id": "cell-0" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install Clustrix (uncomment if running in Colab)\n", "# !pip install clustrix" ], "id": "cell-1" }, { "cell_type": "markdown", "metadata": {}, "source": "## Configuration Options\n\n### Interactive Widget Configuration (Recommended for Jupyter)\n\nClustrix provides an interactive widget for easy configuration management in Jupyter notebooks:\n\n```python\n%%clusterfy\n# This creates an interactive widget with:\n# - Pre-built cluster templates (AWS, GCP, Azure, SLURM, etc.)\n# - Forms to create and edit configurations\n# - One-click configuration application\n# - Save/load configurations to files\n```\n\n**Widget Features:**\n- **Default Templates**: Pre-configured setups for major cloud providers\n- **Interactive Forms**: GUI elements for all configuration options \n- **Configuration Management**: Create, edit, delete, and apply configurations\n- **File I/O**: Save/load configurations as YAML or JSON files\n\n### Programmatic Configuration\n\nFor programmatic setup, use the `configure()` function:", "id": "cell-2", "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import clustrix\n", "import numpy as np\n", "import time\n", "\n", "# Configure for local execution\n", "clustrix.configure(\n", " cluster_host=None, # Use local execution\n", " default_cores=4,\n", " auto_parallel=True\n", ")\n", "\n", "# Get current configuration\n", "config = clustrix.get_config()\n", "\n", "print(\"Current configuration:\")\n", "print(f\" Cluster type: {config.cluster_type}\")\n", "print(f\" Cluster host: {config.cluster_host}\")\n", "print(f\" Default cores: {config.default_cores}\")\n", "print(f\" Default memory: {config.default_memory}\")\n", "print(f\" Auto parallel: {config.auto_parallel}\")\n", "print(f\" Max parallel jobs: {config.max_parallel_jobs}\")" ], "id": "cell-3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Function Decoration\n", "\n", "The simplest way to use Clustrix is with the `@cluster` decorator:" ], "id": "cell-4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@clustrix.cluster(cores=2)\n", "def simple_computation(x, y):\n", " \"\"\"A simple function that adds two numbers.\"\"\"\n", " result = x + y\n", " print(f\"Computing {x} + {y} = {result}\")\n", " return result\n", "\n", "# Execute the function\n", "result = simple_computation(10, 20)\n", "print(f\"Result: {result}\")" ], "id": "cell-5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CPU-Intensive Computation\n", "\n", "Let's try a more computational task that benefits from parallelization:" ], "id": "cell-6" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@clustrix.cluster(cores=4, parallel=True)\n", "def monte_carlo_pi(n_samples):\n", " \"\"\"Estimate π using Monte Carlo method.\"\"\"\n", " import random\n", " \n", " count_inside = 0\n", " \n", " # This loop could be parallelized automatically\n", " for i in range(n_samples):\n", " x = random.random()\n", " y = random.random()\n", " \n", " if x*x + y*y <= 1:\n", " count_inside += 1\n", " \n", " pi_estimate = 4.0 * count_inside / n_samples\n", " return pi_estimate\n", "\n", "# Run with different sample sizes\n", "for n in [1000, 10000, 100000]:\n", " start_time = time.time()\n", " pi_est = monte_carlo_pi(n)\n", " elapsed = time.time() - start_time\n", " \n", " print(f\"n={n:6d}: π ≈ {pi_est:.6f} (error: {abs(pi_est - np.pi):.6f}, time: {elapsed:.3f}s)\")" ], "id": "cell-7" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Array Processing\n", "\n", "Clustrix works well with NumPy arrays and scientific computing:" ], "id": "cell-8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@clustrix.cluster(cores=4, memory=\"2GB\")\n", "def matrix_computation(size):\n", " \"\"\"Perform matrix operations.\"\"\"\n", " import numpy as np\n", " \n", " # Create random matrices\n", " A = np.random.random((size, size))\n", " B = np.random.random((size, size))\n", " \n", " # Matrix multiplication\n", " C = np.dot(A, B)\n", " \n", " # Some statistics\n", " return {\n", " 'shape': C.shape,\n", " 'mean': np.mean(C),\n", " 'std': np.std(C),\n", " 'max': np.max(C),\n", " 'min': np.min(C)\n", " }\n", "\n", "# Test with different matrix sizes\n", "sizes = [100, 200, 300]\n", "\n", "for size in sizes:\n", " start_time = time.time()\n", " stats = matrix_computation(size)\n", " elapsed = time.time() - start_time\n", " \n", " print(f\"Size {size}x{size}: mean={stats['mean']:.4f}, std={stats['std']:.4f}, time={elapsed:.3f}s\")" ], "id": "cell-9" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Processing Pipeline\n", "\n", "Let's create a more realistic data processing example:" ], "id": "cell-10" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@clustrix.cluster(cores=4, parallel=True)\n", "def process_dataset(data, operations):\n", " \"\"\"Process a dataset with multiple operations.\"\"\"\n", " import numpy as np\n", " \n", " results = []\n", " \n", " # This loop could be parallelized\n", " for item in data:\n", " processed = item\n", " \n", " # Apply operations\n", " for op in operations:\n", " if op == 'square':\n", " processed = processed ** 2\n", " elif op == 'sqrt':\n", " processed = np.sqrt(abs(processed))\n", " elif op == 'log':\n", " processed = np.log(abs(processed) + 1)\n", " elif op == 'normalize':\n", " processed = processed / (1 + abs(processed))\n", " \n", " results.append(processed)\n", " \n", " return results\n", "\n", "# Create test data\n", "test_data = np.random.randn(1000) * 10\n", "operations = ['square', 'sqrt', 'normalize']\n", "\n", "# Process the data\n", "start_time = time.time()\n", "processed_data = process_dataset(test_data, operations)\n", "elapsed = time.time() - start_time\n", "\n", "print(f\"Processed {len(test_data)} items in {elapsed:.3f} seconds\")\n", "print(f\"Input range: [{np.min(test_data):.2f}, {np.max(test_data):.2f}]\")\n", "print(f\"Output range: [{np.min(processed_data):.2f}, {np.max(processed_data):.2f}]\")" ], "id": "cell-11" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance Comparison\n", "\n", "Let's compare parallel vs sequential execution:" ], "id": "cell-12" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cpu_intensive_task(n):\n", " \"\"\"A CPU-intensive task for benchmarking.\"\"\"\n", " result = 0\n", " for i in range(n):\n", " result += i ** 0.5\n", " return result\n", "\n", "# Sequential version\n", "@clustrix.cluster(parallel=False)\n", "def sequential_processing(data):\n", " results = []\n", " for item in data:\n", " results.append(cpu_intensive_task(item))\n", " return results\n", "\n", "# Parallel version\n", "@clustrix.cluster(cores=4, parallel=True)\n", "def parallel_processing(data):\n", " results = []\n", " for item in data:\n", " results.append(cpu_intensive_task(item))\n", " return results\n", "\n", "# Test data\n", "test_sizes = [10000] * 8 # 8 tasks of 10k iterations each\n", "\n", "# Time sequential execution\n", "start_time = time.time()\n", "seq_results = sequential_processing(test_sizes)\n", "seq_time = time.time() - start_time\n", "\n", "# Time parallel execution\n", "start_time = time.time()\n", "par_results = parallel_processing(test_sizes)\n", "par_time = time.time() - start_time\n", "\n", "print(f\"Sequential execution: {seq_time:.3f} seconds\")\n", "print(f\"Parallel execution: {par_time:.3f} seconds\")\n", "print(f\"Speedup: {seq_time/par_time:.2f}x\")\n", "print(f\"Results match: {seq_results == par_results}\")" ], "id": "cell-13" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuration Options\n", "\n", "Clustrix provides many configuration options:" ], "id": "cell-14" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get current configuration\n", "config = clustrix.get_config()\n", "\n", "print(\"Current configuration:\")\n", "print(f\" Cluster type: {config.cluster_type}\")\n", "print(f\" Cluster host: {config.cluster_host}\")\n", "print(f\" Default cores: {config.default_cores}\")\n", "print(f\" Default memory: {config.default_memory}\")\n", "print(f\" Auto parallel: {config.auto_parallel}\")\n", "print(f\" Max parallel jobs: {config.max_parallel_jobs}\")" ], "id": "cell-15" }, { "cell_type": "markdown", "metadata": {}, "source": "## Cost Monitoring\n\nClustrix includes built-in cost monitoring for cloud providers:\n\n```python\nfrom clustrix import cost_tracking_decorator\n\n# Automatic cost tracking\n@cost_tracking_decorator('aws', 'p3.2xlarge')\n@clustrix.cluster(cores=8, memory='60GB')\ndef expensive_training():\n # Your training code here\n pass\n\n# Execution includes cost reporting\nresult = expensive_training()\nprint(f\"Training cost: ${result['cost_report']['cost_estimate']['estimated_cost']:.2f}\")\n```\n\n## Next Steps\n\nThis tutorial covered the basics of Clustrix usage. For more advanced topics, check out:\n\n- **Interactive Widget**: Use `%%clusterfy` for GUI-based configuration management\n- **Cost Monitoring**: Track expenses with built-in cost monitoring for AWS, GCP, Azure, Lambda Cloud\n- **Remote Cluster Configuration**: Setting up SLURM, PBS, or SSH clusters\n- **Advanced Parallelization**: Custom loop detection and optimization\n- **Machine Learning Workflows**: Using Clustrix with scikit-learn, TensorFlow, or PyTorch\n- **Scientific Computing**: Integration with SciPy, pandas, and other scientific libraries\n\nVisit the [Clustrix documentation](https://clustrix.readthedocs.io) for detailed guides and API reference.", "id": "cell-16", "outputs": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }