Clustrix Basic Usage Tutorial¶
This notebook demonstrates the basic usage of Clustrix for distributed computing.
Installation¶
First, let’s install Clustrix:
[ ]:
# Install Clustrix (uncomment if running in Colab)
# !pip install clustrix
Configuration Options¶
Interactive Widget Configuration (Recommended for Jupyter)¶
Clustrix provides an interactive widget for easy configuration management in Jupyter notebooks:
%%clusterfy
# This creates an interactive widget with:
# - Pre-built cluster templates (AWS, GCP, Azure, SLURM, etc.)
# - Forms to create and edit configurations
# - One-click configuration application
# - Save/load configurations to files
Widget Features:
Default Templates: Pre-configured setups for major cloud providers
Interactive Forms: GUI elements for all configuration options
Configuration Management: Create, edit, delete, and apply configurations
File I/O: Save/load configurations as YAML or JSON files
Programmatic Configuration¶
For programmatic setup, use the configure() function:
[ ]:
import clustrix
import numpy as np
import time
# Configure for local execution
clustrix.configure(
cluster_host=None, # Use local execution
default_cores=4,
auto_parallel=True
)
# Get current configuration
config = clustrix.get_config()
print("Current configuration:")
print(f" Cluster type: {config.cluster_type}")
print(f" Cluster host: {config.cluster_host}")
print(f" Default cores: {config.default_cores}")
print(f" Default memory: {config.default_memory}")
print(f" Auto parallel: {config.auto_parallel}")
print(f" Max parallel jobs: {config.max_parallel_jobs}")
Simple Function Decoration¶
The simplest way to use Clustrix is with the @cluster decorator:
[ ]:
@clustrix.cluster(cores=2)
def simple_computation(x, y):
"""A simple function that adds two numbers."""
result = x + y
print(f"Computing {x} + {y} = {result}")
return result
# Execute the function
result = simple_computation(10, 20)
print(f"Result: {result}")
CPU-Intensive Computation¶
Let’s try a more computational task that benefits from parallelization:
[ ]:
@clustrix.cluster(cores=4, parallel=True)
def monte_carlo_pi(n_samples):
"""Estimate π using Monte Carlo method."""
import random
count_inside = 0
# This loop could be parallelized automatically
for i in range(n_samples):
x = random.random()
y = random.random()
if x*x + y*y <= 1:
count_inside += 1
pi_estimate = 4.0 * count_inside / n_samples
return pi_estimate
# Run with different sample sizes
for n in [1000, 10000, 100000]:
start_time = time.time()
pi_est = monte_carlo_pi(n)
elapsed = time.time() - start_time
print(f"n={n:6d}: π ≈ {pi_est:.6f} (error: {abs(pi_est - np.pi):.6f}, time: {elapsed:.3f}s)")
Array Processing¶
Clustrix works well with NumPy arrays and scientific computing:
[ ]:
@clustrix.cluster(cores=4, memory="2GB")
def matrix_computation(size):
"""Perform matrix operations."""
import numpy as np
# Create random matrices
A = np.random.random((size, size))
B = np.random.random((size, size))
# Matrix multiplication
C = np.dot(A, B)
# Some statistics
return {
'shape': C.shape,
'mean': np.mean(C),
'std': np.std(C),
'max': np.max(C),
'min': np.min(C)
}
# Test with different matrix sizes
sizes = [100, 200, 300]
for size in sizes:
start_time = time.time()
stats = matrix_computation(size)
elapsed = time.time() - start_time
print(f"Size {size}x{size}: mean={stats['mean']:.4f}, std={stats['std']:.4f}, time={elapsed:.3f}s")
Data Processing Pipeline¶
Let’s create a more realistic data processing example:
[ ]:
@clustrix.cluster(cores=4, parallel=True)
def process_dataset(data, operations):
"""Process a dataset with multiple operations."""
import numpy as np
results = []
# This loop could be parallelized
for item in data:
processed = item
# Apply operations
for op in operations:
if op == 'square':
processed = processed ** 2
elif op == 'sqrt':
processed = np.sqrt(abs(processed))
elif op == 'log':
processed = np.log(abs(processed) + 1)
elif op == 'normalize':
processed = processed / (1 + abs(processed))
results.append(processed)
return results
# Create test data
test_data = np.random.randn(1000) * 10
operations = ['square', 'sqrt', 'normalize']
# Process the data
start_time = time.time()
processed_data = process_dataset(test_data, operations)
elapsed = time.time() - start_time
print(f"Processed {len(test_data)} items in {elapsed:.3f} seconds")
print(f"Input range: [{np.min(test_data):.2f}, {np.max(test_data):.2f}]")
print(f"Output range: [{np.min(processed_data):.2f}, {np.max(processed_data):.2f}]")
Performance Comparison¶
Let’s compare parallel vs sequential execution:
[ ]:
def cpu_intensive_task(n):
"""A CPU-intensive task for benchmarking."""
result = 0
for i in range(n):
result += i ** 0.5
return result
# Sequential version
@clustrix.cluster(parallel=False)
def sequential_processing(data):
results = []
for item in data:
results.append(cpu_intensive_task(item))
return results
# Parallel version
@clustrix.cluster(cores=4, parallel=True)
def parallel_processing(data):
results = []
for item in data:
results.append(cpu_intensive_task(item))
return results
# Test data
test_sizes = [10000] * 8 # 8 tasks of 10k iterations each
# Time sequential execution
start_time = time.time()
seq_results = sequential_processing(test_sizes)
seq_time = time.time() - start_time
# Time parallel execution
start_time = time.time()
par_results = parallel_processing(test_sizes)
par_time = time.time() - start_time
print(f"Sequential execution: {seq_time:.3f} seconds")
print(f"Parallel execution: {par_time:.3f} seconds")
print(f"Speedup: {seq_time/par_time:.2f}x")
print(f"Results match: {seq_results == par_results}")
Configuration Options¶
Clustrix provides many configuration options:
[ ]:
# Get current configuration
config = clustrix.get_config()
print("Current configuration:")
print(f" Cluster type: {config.cluster_type}")
print(f" Cluster host: {config.cluster_host}")
print(f" Default cores: {config.default_cores}")
print(f" Default memory: {config.default_memory}")
print(f" Auto parallel: {config.auto_parallel}")
print(f" Max parallel jobs: {config.max_parallel_jobs}")
Cost Monitoring¶
Clustrix includes built-in cost monitoring for cloud providers:
from clustrix import cost_tracking_decorator
# Automatic cost tracking
@cost_tracking_decorator('aws', 'p3.2xlarge')
@clustrix.cluster(cores=8, memory='60GB')
def expensive_training():
# Your training code here
pass
# Execution includes cost reporting
result = expensive_training()
print(f"Training cost: ${result['cost_report']['cost_estimate']['estimated_cost']:.2f}")
Next Steps¶
This tutorial covered the basics of Clustrix usage. For more advanced topics, check out:
Interactive Widget: Use
%%clusterfyfor GUI-based configuration managementCost Monitoring: Track expenses with built-in cost monitoring for AWS, GCP, Azure, Lambda Cloud
Remote Cluster Configuration: Setting up SLURM, PBS, or SSH clusters
Advanced Parallelization: Custom loop detection and optimization
Machine Learning Workflows: Using Clustrix with scikit-learn, TensorFlow, or PyTorch
Scientific Computing: Integration with SciPy, pandas, and other scientific libraries
Visit the Clustrix documentation for detailed guides and API reference.