Clustrix Basic Usage Tutorial

Open In Colab

This notebook demonstrates the basic usage of Clustrix for distributed computing.

Installation

First, let’s install Clustrix:

[ ]:
# Install Clustrix (uncomment if running in Colab)
# !pip install clustrix

Configuration Options

Interactive Widget Configuration (Recommended for Jupyter)

Clustrix provides an interactive widget for easy configuration management in Jupyter notebooks:

%%clusterfy
# This creates an interactive widget with:
# - Pre-built cluster templates (AWS, GCP, Azure, SLURM, etc.)
# - Forms to create and edit configurations
# - One-click configuration application
# - Save/load configurations to files

Widget Features:

  • Default Templates: Pre-configured setups for major cloud providers

  • Interactive Forms: GUI elements for all configuration options

  • Configuration Management: Create, edit, delete, and apply configurations

  • File I/O: Save/load configurations as YAML or JSON files

Programmatic Configuration

For programmatic setup, use the configure() function:

[ ]:
import clustrix
import numpy as np
import time

# Configure for local execution
clustrix.configure(
    cluster_host=None,  # Use local execution
    default_cores=4,
    auto_parallel=True
)

# Get current configuration
config = clustrix.get_config()

print("Current configuration:")
print(f"  Cluster type: {config.cluster_type}")
print(f"  Cluster host: {config.cluster_host}")
print(f"  Default cores: {config.default_cores}")
print(f"  Default memory: {config.default_memory}")
print(f"  Auto parallel: {config.auto_parallel}")
print(f"  Max parallel jobs: {config.max_parallel_jobs}")

Simple Function Decoration

The simplest way to use Clustrix is with the @cluster decorator:

[ ]:
@clustrix.cluster(cores=2)
def simple_computation(x, y):
    """A simple function that adds two numbers."""
    result = x + y
    print(f"Computing {x} + {y} = {result}")
    return result

# Execute the function
result = simple_computation(10, 20)
print(f"Result: {result}")

CPU-Intensive Computation

Let’s try a more computational task that benefits from parallelization:

[ ]:
@clustrix.cluster(cores=4, parallel=True)
def monte_carlo_pi(n_samples):
    """Estimate π using Monte Carlo method."""
    import random

    count_inside = 0

    # This loop could be parallelized automatically
    for i in range(n_samples):
        x = random.random()
        y = random.random()

        if x*x + y*y <= 1:
            count_inside += 1

    pi_estimate = 4.0 * count_inside / n_samples
    return pi_estimate

# Run with different sample sizes
for n in [1000, 10000, 100000]:
    start_time = time.time()
    pi_est = monte_carlo_pi(n)
    elapsed = time.time() - start_time

    print(f"n={n:6d}: π ≈ {pi_est:.6f} (error: {abs(pi_est - np.pi):.6f}, time: {elapsed:.3f}s)")

Array Processing

Clustrix works well with NumPy arrays and scientific computing:

[ ]:
@clustrix.cluster(cores=4, memory="2GB")
def matrix_computation(size):
    """Perform matrix operations."""
    import numpy as np

    # Create random matrices
    A = np.random.random((size, size))
    B = np.random.random((size, size))

    # Matrix multiplication
    C = np.dot(A, B)

    # Some statistics
    return {
        'shape': C.shape,
        'mean': np.mean(C),
        'std': np.std(C),
        'max': np.max(C),
        'min': np.min(C)
    }

# Test with different matrix sizes
sizes = [100, 200, 300]

for size in sizes:
    start_time = time.time()
    stats = matrix_computation(size)
    elapsed = time.time() - start_time

    print(f"Size {size}x{size}: mean={stats['mean']:.4f}, std={stats['std']:.4f}, time={elapsed:.3f}s")

Data Processing Pipeline

Let’s create a more realistic data processing example:

[ ]:
@clustrix.cluster(cores=4, parallel=True)
def process_dataset(data, operations):
    """Process a dataset with multiple operations."""
    import numpy as np

    results = []

    # This loop could be parallelized
    for item in data:
        processed = item

        # Apply operations
        for op in operations:
            if op == 'square':
                processed = processed ** 2
            elif op == 'sqrt':
                processed = np.sqrt(abs(processed))
            elif op == 'log':
                processed = np.log(abs(processed) + 1)
            elif op == 'normalize':
                processed = processed / (1 + abs(processed))

        results.append(processed)

    return results

# Create test data
test_data = np.random.randn(1000) * 10
operations = ['square', 'sqrt', 'normalize']

# Process the data
start_time = time.time()
processed_data = process_dataset(test_data, operations)
elapsed = time.time() - start_time

print(f"Processed {len(test_data)} items in {elapsed:.3f} seconds")
print(f"Input range: [{np.min(test_data):.2f}, {np.max(test_data):.2f}]")
print(f"Output range: [{np.min(processed_data):.2f}, {np.max(processed_data):.2f}]")

Performance Comparison

Let’s compare parallel vs sequential execution:

[ ]:
def cpu_intensive_task(n):
    """A CPU-intensive task for benchmarking."""
    result = 0
    for i in range(n):
        result += i ** 0.5
    return result

# Sequential version
@clustrix.cluster(parallel=False)
def sequential_processing(data):
    results = []
    for item in data:
        results.append(cpu_intensive_task(item))
    return results

# Parallel version
@clustrix.cluster(cores=4, parallel=True)
def parallel_processing(data):
    results = []
    for item in data:
        results.append(cpu_intensive_task(item))
    return results

# Test data
test_sizes = [10000] * 8  # 8 tasks of 10k iterations each

# Time sequential execution
start_time = time.time()
seq_results = sequential_processing(test_sizes)
seq_time = time.time() - start_time

# Time parallel execution
start_time = time.time()
par_results = parallel_processing(test_sizes)
par_time = time.time() - start_time

print(f"Sequential execution: {seq_time:.3f} seconds")
print(f"Parallel execution: {par_time:.3f} seconds")
print(f"Speedup: {seq_time/par_time:.2f}x")
print(f"Results match: {seq_results == par_results}")

Configuration Options

Clustrix provides many configuration options:

[ ]:
# Get current configuration
config = clustrix.get_config()

print("Current configuration:")
print(f"  Cluster type: {config.cluster_type}")
print(f"  Cluster host: {config.cluster_host}")
print(f"  Default cores: {config.default_cores}")
print(f"  Default memory: {config.default_memory}")
print(f"  Auto parallel: {config.auto_parallel}")
print(f"  Max parallel jobs: {config.max_parallel_jobs}")

Cost Monitoring

Clustrix includes built-in cost monitoring for cloud providers:

from clustrix import cost_tracking_decorator

# Automatic cost tracking
@cost_tracking_decorator('aws', 'p3.2xlarge')
@clustrix.cluster(cores=8, memory='60GB')
def expensive_training():
    # Your training code here
    pass

# Execution includes cost reporting
result = expensive_training()
print(f"Training cost: ${result['cost_report']['cost_estimate']['estimated_cost']:.2f}")

Next Steps

This tutorial covered the basics of Clustrix usage. For more advanced topics, check out:

  • Interactive Widget: Use %%clusterfy for GUI-based configuration management

  • Cost Monitoring: Track expenses with built-in cost monitoring for AWS, GCP, Azure, Lambda Cloud

  • Remote Cluster Configuration: Setting up SLURM, PBS, or SSH clusters

  • Advanced Parallelization: Custom loop detection and optimization

  • Machine Learning Workflows: Using Clustrix with scikit-learn, TensorFlow, or PyTorch

  • Scientific Computing: Integration with SciPy, pandas, and other scientific libraries

Visit the Clustrix documentation for detailed guides and API reference.