Google Cloud Platform (GCP) Tutorial¶

This tutorial demonstrates how to use Clustrix with Google Cloud Platform (GCP) infrastructure for scalable distributed computing.

Overview¶

GCP provides several services that integrate well with Clustrix:

Compute Engine: Virtual machines for compute clusters
Google Kubernetes Engine (GKE): Managed Kubernetes clusters
Batch: Managed job scheduling service
Cloud Run: Serverless container platform
Vertex AI: Machine learning platform
Cloud Storage: Object storage for data and results
VPC: Network isolation and security
Preemptible VMs: Cost-effective compute instances

Complete Setup Guide from Scratch¶

Step 1: Google Cloud Account Setup¶

Create Google Cloud Account:
- Go to Google Cloud Console
- Sign up with your Google account or create a new one
- Accept the terms of service
Enable Billing:
- Navigate to Billing in the Google Cloud Console
- Create a billing account and add a payment method
- Important: New users get $300 in free credits
- Set up billing alerts to avoid unexpected charges
Create a New Project:
- Go to the Project Selector in the console
- Click “New Project”
- Choose a unique project ID (e.g., my-clustrix-project-123)
- Enable billing for this project

Step 2: Install Google Cloud SDK (gcloud CLI)¶

On macOS:

# Using Homebrew (recommended)
brew install google-cloud-sdk

# Or download installer
curl https://sdk.cloud.google.com | bash
exec -l $SHELL

On Linux:

# Download and install
curl https://sdk.cloud.google.com | bash
exec -l $SHELL

# Or use package manager (Ubuntu/Debian)
sudo apt-get install google-cloud-sdk

On Windows:

Download the installer from Google Cloud SDK page
Run the installer and follow instructions

Step 3: Enable Required APIs¶

Enable the necessary Google Cloud APIs for this tutorial:

# Set your project ID
export PROJECT_ID="your-project-id-here"
gcloud config set project $PROJECT_ID

# Enable required APIs
gcloud services enable compute.googleapis.com
gcloud services enable container.googleapis.com
gcloud services enable batch.googleapis.com
gcloud services enable aiplatform.googleapis.com
gcloud services enable storage.googleapis.com

Prerequisites Checklist¶

Before proceeding, ensure you have:

☐ Google Cloud account with billing enabled
☐ Google Cloud project created
☐ Google Cloud SDK (gcloud) installed locally
☐ Required APIs enabled (compute, container, batch, storage, aiplatform)
☐ SSH key pair for VM access (we’ll create this below)
☐ Basic understanding of command line usage

Installation and Setup¶

Install Clustrix with GCP dependencies:

Step 4: SSH Key Setup¶

Create SSH keys for secure access to your GCP instances:

# Generate SSH key pair (if you don't have one)
ssh-keygen -t rsa -b 4096 -C "your-email@example.com" -f ~/.ssh/gcp_key

# Add the public key to GCP
gcloud compute os-login ssh-keys add --key-file=~/.ssh/gcp_key.pub

# Or add to project metadata (alternative method)
gcloud compute project-info add-metadata --metadata-from-file ssh-keys=~/.ssh/gcp_key.pub

Note: If you’re using Google Cloud Shell, SSH keys are automatically managed.

[ ]:

# Install Clustrix with GCP support
!pip install clustrix google-cloud-compute google-cloud-storage google-auth google-auth-oauthlib

# Import required libraries
import clustrix
from clustrix import cluster, configure
from google.cloud import compute_v1
from google.cloud import storage
from google.auth import default
import os
import numpy as np
import time
import json

GCP Authentication Setup¶

Configure your GCP credentials. Choose the method that best fits your environment:

Option 1: gcloud CLI Authentication (Recommended for Local Development)¶

This method uses your personal Google account credentials:

[ ]:

# Initial authentication and project setup
!gcloud auth login
!gcloud auth application-default login

# Set your project ID (replace with your actual project ID)
PROJECT_ID = "your-project-id-here"  # Replace this!
!gcloud config set project {PROJECT_ID}

# Verify authentication and project setup
!gcloud auth list
!gcloud config get-value project
!gcloud projects describe {PROJECT_ID}

Option 2: Service Account Authentication (Recommended for Production)¶

For production environments, create and use a service account with specific permissions:

[ ]:

# Test GCP connection
try:
    credentials, project_id = default()
    print(f"✓ Successfully authenticated with project: {project_id}")

    # Test compute API
    compute_client = compute_v1.InstancesClient()
    print("✓ Compute Engine API access confirmed")

    # Test storage API
    storage_client = storage.Client()
    print("✓ Cloud Storage API access confirmed")

except Exception as e:
    print(f"❌ GCP authentication failed: {e}")
    print("Please check your authentication setup and try again.")

Service Account Setup (Production Environments)

For production use, create a service account with specific permissions:

# Create service account
gcloud iam service-accounts create clustrix-service-account \
  --description="Service account for Clustrix operations" \
  --display-name="Clustrix Service Account"

# Grant necessary permissions
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:clustrix-service-account@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/compute.admin"

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:clustrix-service-account@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.admin"

# Create and download service account key
gcloud iam service-accounts keys create ~/clustrix-service-account-key.json \
  --iam-account=clustrix-service-account@YOUR_PROJECT_ID.iam.gserviceaccount.com

# Set the environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/clustrix-service-account-key.json"

Important: Make sure you have completed authentication setup and enabled all required APIs before proceeding.

If authentication fails, double-check that:

Your project ID is correct
Billing is enabled for your project
Required APIs are enabled
Your credentials are properly configured

Method 1: Google Compute Engine Configuration¶

Create Compute Engine Instance for Clustrix¶

[ ]:

def create_clustrix_compute_instance(project_id, zone='us-central1-a', machine_type='e2-standard-4'):
    """
    Create a GCP Compute Engine instance configured for Clustrix.

    Args:
        project_id: GCP project ID
        zone: GCP zone for the instance
        machine_type: Machine type (CPU/memory configuration)

    Returns:
        Instance configuration and gcloud commands
    """

    # Startup script for instance initialization
    startup_script = '''
#!/bin/bash

# Update system
apt-get update
apt-get install -y python3 python3-pip git htop curl

# Install clustrix and common packages
pip3 install clustrix numpy scipy pandas scikit-learn matplotlib

# Install uv for faster package management
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.cargo/env

# Create clustrix user
useradd -m -s /bin/bash clustrix
usermod -aG sudo clustrix
echo "clustrix ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers

# Setup SSH for clustrix user
mkdir -p /home/clustrix/.ssh
# Copy SSH keys from default user
if [ -d "/home/$(logname)/.ssh" ]; then
    cp -r /home/$(logname)/.ssh/* /home/clustrix/.ssh/
    chown -R clustrix:clustrix /home/clustrix/.ssh
    chmod 700 /home/clustrix/.ssh
    chmod 600 /home/clustrix/.ssh/authorized_keys 2>/dev/null || true
fi

# Create working directory
mkdir -p /tmp/clustrix
chown clustrix:clustrix /tmp/clustrix

# Install Google Cloud SDK for clustrix user
curl https://sdk.cloud.google.com | bash
exec -l $SHELL

# Log completion
echo "Clustrix setup completed at $(date)" >> /var/log/clustrix-setup.log
'''

    # gcloud commands for instance creation
    gcloud_commands = f"""
# Create firewall rule for SSH (if not exists)
gcloud compute firewall-rules create allow-ssh \
  --allow tcp:22 \
  --source-ranges 0.0.0.0/0 \
  --description "Allow SSH access" \
  --project {project_id} || echo "SSH rule already exists"

# Create the instance
gcloud compute instances create clustrix-instance \
  --project={project_id} \
  --zone={zone} \
  --machine-type={machine_type} \
  --network-interface=network-tier=PREMIUM,subnet=default \
  --maintenance-policy=MIGRATE \
  --provisioning-model=STANDARD \
  --service-account=default \
  --scopes=https://www.googleapis.com/auth/cloud-platform \
  --tags=clustrix,http-server,https-server \
  --create-disk=auto-delete=yes,boot=yes,device-name=clustrix-instance,image=projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts,mode=rw,size=50,type=projects/{project_id}/zones/{zone}/diskTypes/pd-balanced \
  --no-shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring \
  --labels=purpose=clustrix,environment=tutorial \
  --reservation-affinity=any \
  --metadata-from-file startup-script=startup-script.sh

# Get the external IP
gcloud compute instances describe clustrix-instance \
  --project={project_id} \
  --zone={zone} \
  --format='get(networkInterfaces[0].accessConfigs[0].natIP)'

# SSH to the instance (after startup script completes)
gcloud compute ssh clustrix-instance \
  --project={project_id} \
  --zone={zone}
"""

    return {
        'project_id': project_id,
        'zone': zone,
        'machine_type': machine_type,
        'instance_name': 'clustrix-instance',
        'gcloud_commands': gcloud_commands,
        'startup_script': startup_script
    }

# Example usage - replace with your actual project ID
instance_config = create_clustrix_compute_instance(
    project_id=PROJECT_ID,  # Using the PROJECT_ID variable from above
    zone='us-central1-a',
    machine_type='e2-standard-4'  # 4 vCPUs, 16 GB RAM
)

# Display the configuration results
print("=== GCP Compute Engine Instance Configuration ===")
print(f"Project ID: {instance_config['project_id']}")
print(f"Zone: {instance_config['zone']}")
print(f"Machine Type: {instance_config['machine_type']}")
print(f"Instance Name: {instance_config['instance_name']}")
print("\n=== Next Steps ===")
print("1. Save the startup script to 'startup-script.sh'")
print("2. Execute the gcloud commands shown above")
print("3. Wait 3-5 minutes for instance initialization")
print("4. Get the external IP and configure Clustrix")

GCP Compute Engine Instance Creation¶

The above code defines a function that creates a GCP Compute Engine instance optimized for Clustrix workloads. The function returns:

gcloud commands: Complete CLI commands to create the instance
startup script: Automated setup script that configures the instance

The configuration includes:

Ubuntu 22.04 LTS base image
Pre-installed Python packages and Clustrix
Clustrix user account with sudo privileges
SSH key setup and working directories
50GB balanced persistent disk
Appropriate firewall rules and metadata

Next Steps:

Save the startup script to a file named startup-script.sh in your current directory
Execute the gcloud commands shown above to create your instance
Wait for the instance to fully initialize (startup script takes 3-5 minutes)
Get the external IP using the describe command shown above
Test SSH access to ensure the instance is ready for Clustrix

Configure Clustrix for Compute Engine¶

[ ]:

# Get the external IP of your created instance
# Replace with the actual external IP from your instance
INSTANCE_EXTERNAL_IP = "YOUR_INSTANCE_EXTERNAL_IP"  # Replace this!

# Configure Clustrix to use your Compute Engine instance
configure(
    cluster_type="ssh",
    cluster_host=INSTANCE_EXTERNAL_IP,
    username="clustrix",  # or your default user
    key_file="~/.ssh/gcp_key",  # path to your SSH private key
    remote_work_dir="/tmp/clustrix",
    package_manager="auto",  # Will use uv if available, pip otherwise
    default_cores=4,
    default_memory="8GB",
    default_time="01:00:00"
)

# Verify configuration
if INSTANCE_EXTERNAL_IP != "YOUR_INSTANCE_EXTERNAL_IP":
    print(f"✓ Clustrix configured for GCP Compute Engine")
    print(f"  Host: {INSTANCE_EXTERNAL_IP}")
    print(f"  SSH Key: ~/.ssh/gcp_key")
    print(f"  Remote Work Dir: /tmp/clustrix")
else:
    print("⚠️  Please replace INSTANCE_EXTERNAL_IP with your actual IP address")

Important Configuration Notes:

Replace YOUR_INSTANCE_EXTERNAL_IP with the actual external IP address from your Compute Engine instance
Use the SSH key path that corresponds to your setup (either ~/.ssh/gcp_key if you created one following this tutorial, or ~/.ssh/google_compute_engine for gcloud-generated keys)
The clustrix user was created by the startup script with appropriate permissions
If you encounter connection issues, ensure your firewall rules allow SSH access from your IP address

Example: Remote Computation on Compute Engine¶

[ ]:

# Example: GCP Data Analysis
@cluster(cores=2, memory="4GB")
def gcp_data_analysis(dataset_size=10000, analysis_type='regression'):
    """Perform data analysis on GCP Compute Engine."""
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
    from sklearn.metrics import mean_squared_error, accuracy_score
    from sklearn.datasets import make_regression, make_classification
    import time

    start_time = time.time()

    # Generate synthetic dataset
    if analysis_type == 'regression':
        X, y = make_regression(
            n_samples=dataset_size,
            n_features=20,
            noise=0.1,
            random_state=42
        )
        model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
        metric_name = 'rmse'
    else:
        X, y = make_classification(
            n_samples=dataset_size,
            n_features=20,
            n_classes=3,
            random_state=42
        )
        model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
        metric_name = 'accuracy'

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Train model
    training_start = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - training_start

    # Evaluate
    y_pred = model.predict(X_test)

    if analysis_type == 'regression':
        metric_value = np.sqrt(mean_squared_error(y_test, y_pred))
    else:
        metric_value = accuracy_score(y_test, y_pred)

    total_time = time.time() - start_time

    return {
        'analysis_type': analysis_type,
        'dataset_size': dataset_size,
        'training_time': training_time,
        'total_time': total_time,
        metric_name: metric_value,
        'feature_importance': model.feature_importances_[:5].tolist(),  # Top 5
        'training_samples': len(X_train),
        'test_samples': len(X_test)
    }

# Example: Parallel Computation
@cluster(cores=4, memory="8GB")
def gcp_parallel_computation(n_iterations=1000):
    """Basic parallel computation example."""
    import numpy as np
    import time

    start_time = time.time()

    # Simulate CPU-intensive work
    results = []
    for i in range(n_iterations):
        # Monte Carlo pi estimation
        points = np.random.random((1000, 2))
        inside_circle = np.sum((points**2).sum(axis=1) <= 1)
        pi_estimate = 4 * inside_circle / 1000
        results.append(pi_estimate)

    computation_time = time.time() - start_time
    final_pi_estimate = np.mean(results)

    return {
        'iterations': n_iterations,
        'pi_estimate': final_pi_estimate,
        'computation_time': computation_time,
        'accuracy': abs(final_pi_estimate - np.pi)
    }

print("✓ GCP computation examples defined")
print("\n📝 Example usage:")
print("# Data analysis:")
print("# result = gcp_data_analysis(dataset_size=50000, analysis_type='classification')")
print("# print(f'Accuracy: {result[\"accuracy\"]:.4f}')")
print("#")
print("# Parallel computation:")
print("# result = gcp_parallel_computation(n_iterations=5000)")
print("# print(f'Pi estimate: {result[\"pi_estimate\"]:.6f}')")

# Example execution (commented out - uncomment after setup):
# result = gcp_data_analysis(dataset_size=5000, analysis_type='classification')
# print(f"✓ Analysis completed: {result['accuracy']:.4f} accuracy")
# print(f"⏱️  Training time: {result['training_time']:.2f} seconds")

Method 2: Google Kubernetes Engine (GKE) Configuration¶

GKE provides managed Kubernetes clusters ideal for containerized Clustrix workloads:

[ ]:

def setup_gke_cluster_for_clustrix(project_id, cluster_name='clustrix-cluster', zone='us-central1-a'):
    """
    Setup GKE cluster optimized for Clustrix workloads.
    """

    gke_commands = f"""
# Enable required APIs
gcloud services enable container.googleapis.com \
  --project {project_id}

# Create GKE cluster with auto-scaling
gcloud container clusters create {cluster_name} \
  --project {project_id} \
  --zone {zone} \
  --machine-type e2-standard-4 \
  --num-nodes 1 \
  --enable-autoscaling \
  --min-nodes 0 \
  --max-nodes 10 \
  --enable-autorepair \
  --enable-autoupgrade \
  --disk-size 50GB \
  --disk-type pd-ssd \
  --enable-network-policy \
  --enable-ip-alias \
  --labels purpose=clustrix,environment=tutorial

# Get cluster credentials
gcloud container clusters get-credentials {cluster_name} \
  --project {project_id} \
  --zone {zone}

# Verify cluster access
kubectl get nodes

# Create clustrix namespace
kubectl create namespace clustrix

# Set as default namespace
kubectl config set-context --current --namespace=clustrix
"""

    # Clustrix job template for Kubernetes
    k8s_job_template = """
apiVersion: batch/v1
kind: Job
metadata:
  name: clustrix-job-${JOB_ID}
  namespace: clustrix
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: clustrix-worker
        image: python:3.11-slim
        command: ["bash", "-c"]
        args:
        - |
          pip install clustrix numpy scipy pandas scikit-learn
          python -c "
          import pickle
          import sys

          # Load and execute function
          with open('/data/function_data.pkl', 'rb') as f:
              data = pickle.load(f)

          func = pickle.loads(data['function'])
          args = pickle.loads(data['args'])
          kwargs = pickle.loads(data['kwargs'])

          try:
              result = func(*args, **kwargs)
              with open('/data/result.pkl', 'wb') as f:
                  pickle.dump(result, f)
          except Exception as e:
              with open('/data/error.pkl', 'wb') as f:
                  pickle.dump({'error': str(e)}, f)
              raise
          "
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
        volumeMounts:
        - name: job-data
          mountPath: /data
      volumes:
      - name: job-data
        persistentVolumeClaim:
          claimName: clustrix-pvc
  backoffLimit: 3
"""

    return {
        'cluster_name': cluster_name,
        'project_id': project_id,
        'zone': zone,
        'setup_commands': gke_commands,
        'job_template': k8s_job_template
    }

def configure_clustrix_for_gke(cluster_endpoint, cluster_name):
    """Configure Clustrix to use GKE cluster."""
    configure(
        cluster_type="kubernetes",
        cluster_host=cluster_endpoint,
        # For GKE, authentication is handled via kubectl config
        remote_work_dir="/tmp/clustrix",
        package_manager="pip",  # Container-based, pip is fine
        default_cores=2,
        default_memory="4GB",
        default_time="01:00:00"
    )
    print(f"✓ Configured Clustrix for GKE cluster: {cluster_name}")

# Create GKE configuration
gke_config = setup_gke_cluster_for_clustrix(
    project_id=PROJECT_ID,
    cluster_name='clustrix-cluster'
)

print("=== GKE Cluster Setup Commands ===")
print(gke_config['setup_commands'])
print("\n=== Kubernetes Job Template ===")
print(gke_config['job_template'])
print("\n📝 Note: GKE integration requires additional implementation in Clustrix.")
print("Current Clustrix supports basic Kubernetes, but GKE-specific features need custom setup.")

Method 3: Google Cloud Batch¶

Google Cloud Batch provides managed job scheduling for large-scale workloads:

[ ]:

def setup_gcp_batch_environment(project_id, region='us-central1'):
    """
    Setup Google Cloud Batch for Clustrix workloads.
    """

    batch_setup_commands = f"""
# Enable Batch API
gcloud services enable batch.googleapis.com \
  --project {project_id}

# Create a service account for Batch jobs
gcloud iam service-accounts create clustrix-batch-sa \
  --project {project_id} \
  --description="Service account for Clustrix Batch jobs" \
  --display-name="Clustrix Batch Service Account"

# Grant necessary permissions
gcloud projects add-iam-policy-binding {project_id} \
  --member="serviceAccount:clustrix-batch-sa@{project_id}.iam.gserviceaccount.com" \
  --role="roles/batch.jobsEditor"

gcloud projects add-iam-policy-binding {project_id} \
  --member="serviceAccount:clustrix-batch-sa@{project_id}.iam.gserviceaccount.com" \
  --role="roles/storage.objectAdmin"

# Create Cloud Storage bucket for job data
gsutil mb -p {project_id} -l {region} gs://{project_id}-clustrix-batch
"""

    # Batch job configuration template
    batch_job_config = {
        "taskGroups": [
            {
                "taskSpec": {
                    "runnables": [
                        {
                            "script": {
                                "text": f"""
#!/bin/bash
set -e

# Install required packages
pip3 install clustrix numpy scipy pandas scikit-learn

# Download job data from Cloud Storage
gsutil cp gs://{project_id}-clustrix-batch/jobs/${{BATCH_JOB_ID}}/function_data.pkl .

# Execute the function
python3 -c "
import pickle
import traceback

try:
    with open('function_data.pkl', 'rb') as f:
        data = pickle.load(f)

    func = pickle.loads(data['function'])
    args = pickle.loads(data['args'])
    kwargs = pickle.loads(data['kwargs'])

    result = func(*args, **kwargs)

    with open('result.pkl', 'wb') as f:
        pickle.dump(result, f)

except Exception as e:
    with open('error.pkl', 'wb') as f:
        pickle.dump({{
            'error': str(e),
            'traceback': traceback.format_exc()
        }}, f)
    raise
"

# Upload results to Cloud Storage
gsutil cp result.pkl gs://{project_id}-clustrix-batch/jobs/${{BATCH_JOB_ID}}/result.pkl || \
gsutil cp error.pkl gs://{project_id}-clustrix-batch/jobs/${{BATCH_JOB_ID}}/error.pkl
"""
                            }
                        }
                    ],
                    "computeResource": {
                        "cpuMilli": 2000,  # 2 CPUs
                        "memoryMib": 4096  # 4 GB RAM
                    },
                    "maxRetryCount": 2,
                    "maxRunDuration": "3600s"  # 1 hour
                },
                "taskCount": 1
            }
        ],
        "allocationPolicy": {
            "instances": [
                {
                    "instanceTemplate": {
                        "machineType": "e2-standard-2",
                        "provisioningModel": "STANDARD"
                    }
                }
            ]
        },
        "labels": {
            "purpose": "clustrix",
            "environment": "tutorial"
        },
        "logsPolicy": {
            "destination": "CLOUD_LOGGING"
        }
    }

    return {
        'project_id': project_id,
        'region': region,
        'bucket_name': f'{project_id}-clustrix-batch',
        'service_account': f'clustrix-batch-sa@{project_id}.iam.gserviceaccount.com',
        'job_config': batch_job_config,
        'setup_commands': batch_setup_commands
    }

# Create Batch configuration
batch_config = setup_gcp_batch_environment(PROJECT_ID)

print("=== Google Cloud Batch Setup Commands ===")
print(batch_config['setup_commands'])
print("\n=== Batch Job Configuration ===")
print(json.dumps(batch_config['job_config'], indent=2))
print("\n💡 Google Cloud Batch provides excellent integration for large-scale Clustrix workloads.")

Data Management with Google Cloud Storage¶

[ ]:

@cluster(cores=2, memory="4GB")
def process_gcs_data(bucket_name, input_blob, output_blob, project_id=None):
    """Process data from Google Cloud Storage and save results back."""
    from google.cloud import storage
    import numpy as np
    import pickle
    import io
    import time

    # Initialize Cloud Storage client
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.bucket(bucket_name)

    # Download data from Cloud Storage
    input_blob_obj = bucket.blob(input_blob)
    data_bytes = input_blob_obj.download_as_bytes()
    data = pickle.loads(data_bytes)

    # Process the data
    processed_data = {
        'original_shape': data.shape if hasattr(data, 'shape') else len(data) if hasattr(data, '__len__') else 'scalar',
        'mean': float(np.mean(data)) if hasattr(data, '__iter__') else float(data),
        'std': float(np.std(data)) if hasattr(data, '__iter__') else 0.0,
        'max': float(np.max(data)) if hasattr(data, '__iter__') else float(data),
        'min': float(np.min(data)) if hasattr(data, '__iter__') else float(data),
        'processing_timestamp': time.time(),
        'processed_on': 'gcp-compute-engine',
        'data_type': str(type(data).__name__)
    }

    # Advanced processing based on data type
    if hasattr(data, 'shape') and len(data.shape) >= 2:
        # Matrix operations
        processed_data.update({
            'matrix_rank': int(np.linalg.matrix_rank(data)) if data.shape[0] == data.shape[1] else 'non_square',
            'frobenius_norm': float(np.linalg.norm(data, 'fro')),
            'condition_number': float(np.linalg.cond(data)) if data.shape[0] == data.shape[1] else None
        })

    # Upload results to Cloud Storage
    output_bytes = pickle.dumps(processed_data)
    output_blob_obj = bucket.blob(output_blob)
    output_blob_obj.upload_from_string(output_bytes)

    return f"Processed data saved to gs://{bucket_name}/{output_blob}"

# Utility functions for Google Cloud Storage
def upload_to_gcs(data, bucket_name, blob_name, project_id=None):
    """Upload data to Google Cloud Storage."""
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    data_bytes = pickle.dumps(data)
    blob.upload_from_string(data_bytes)
    return f"gs://{bucket_name}/{blob_name}"

def download_from_gcs(bucket_name, blob_name, project_id=None):
    """Download data from Google Cloud Storage."""
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    data_bytes = blob.download_as_bytes()
    return pickle.loads(data_bytes)

def create_gcs_bucket_for_clustrix(project_id, bucket_name, location='us-central1'):
    """Create a Cloud Storage bucket for Clustrix data."""
    gcs_commands = f"""
# Create bucket with appropriate settings
gsutil mb -p {project_id} -l {location} gs://{bucket_name}

# Set lifecycle policy to delete temporary files after 7 days
echo '{{
  "lifecycle": {{
    "rule": [
      {{
        "action": {{"type": "Delete"}},
        "condition": {{
          "age": 7,
          "matchesPrefix": ["temp/"]
        }}
      }}
    ]
  }}
}}' > lifecycle.json

gsutil lifecycle set lifecycle.json gs://{bucket_name}

# Set up proper permissions (if using service account)
gsutil iam ch serviceAccount:clustrix-batch-sa@{project_id}.iam.gserviceaccount.com:objectAdmin gs://{bucket_name}
"""

    return gcs_commands

# Create bucket configuration
BUCKET_NAME = f"{PROJECT_ID}-clustrix-data"
bucket_commands = create_gcs_bucket_for_clustrix(PROJECT_ID, BUCKET_NAME)

print("=== Commands to create Cloud Storage bucket ===")
print(bucket_commands)

# Example usage (commented out - uncomment after creating bucket):
# sample_data = np.random.rand(1000, 100)
# upload_location = upload_to_gcs(sample_data, BUCKET_NAME, 'input/sample_data.pkl', PROJECT_ID)
# print(f"✓ Data uploaded to {upload_location}")
#
# result = process_gcs_data(BUCKET_NAME, 'input/sample_data.pkl', 'output/results.pkl', PROJECT_ID)
# print(f"✓ Processing completed: {result}")

print("\n✓ Google Cloud Storage integration functions defined.")
print("Execute the bucket creation commands above, then uncomment the example usage.")

Vertex AI Integration¶

[ ]:

def setup_vertex_ai_for_clustrix(project_id, region='us-central1'):
    """
    Setup Vertex AI for ML workloads with Clustrix.
    """

    vertex_commands = f"""
# Enable Vertex AI API
gcloud services enable aiplatform.googleapis.com \
  --project {project_id}

# Create Vertex AI custom training job
gcloud ai custom-jobs create \
  --region={region} \
  --display-name=clustrix-training-job \
  --config=training_job_config.yaml

# Create Vertex AI endpoints for model serving
gcloud ai endpoints create \
  --region={region} \
  --display-name=clustrix-model-endpoint
"""

    # Vertex AI training job configuration
    training_config = f"""
# training_job_config.yaml
workerPoolSpecs:
- machineSpec:
    machineType: e2-standard-4
  replicaCount: 1
  containerSpec:
    imageUri: gcr.io/cloud-aiplatform/training/tf-cpu.2-8:latest
    command:
    - python3
    - -c
    args:
    - |
      import subprocess
      import sys

      # Install clustrix
      subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'clustrix', 'numpy', 'pandas', 'scikit-learn'])

      # Your training code here
      print("Clustrix training job completed on Vertex AI")
    env:
    - name: GOOGLE_CLOUD_PROJECT
      value: {project_id}
    - name: AIP_MODEL_DIR
      value: gs://{project_id}-vertex-models
"""

    return {
        'project_id': project_id,
        'region': region,
        'setup_commands': vertex_commands,
        'training_config': training_config
    }

@cluster(cores=4, memory="8GB")
def vertex_ai_ml_pipeline(dataset_config, model_config, project_id, bucket_name):
    """ML pipeline that could run on Vertex AI with Clustrix."""
    import numpy as np
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import cross_val_score, GridSearchCV
    from sklearn.datasets import make_classification
    from sklearn.metrics import classification_report
    from google.cloud import storage
    import pickle
    import time

    start_time = time.time()

    # Generate or load dataset
    X, y = make_classification(
        n_samples=dataset_config['n_samples'],
        n_features=dataset_config['n_features'],
        n_classes=dataset_config['n_classes'],
        n_informative=dataset_config.get('n_informative', dataset_config['n_features'] // 2),
        random_state=42
    )

    # Hyperparameter tuning
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2]
    }

    # Grid search with cross-validation
    model = GradientBoostingClassifier(random_state=42)
    grid_search = GridSearchCV(
        model, param_grid, cv=5, scoring='accuracy', n_jobs=-1
    )

    grid_search.fit(X, y)

    # Get best model
    best_model = grid_search.best_estimator_

    # Evaluate with cross-validation
    cv_scores = cross_val_score(best_model, X, y, cv=5, scoring='accuracy')

    # Save model to Cloud Storage
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.bucket(bucket_name)

    model_blob = bucket.blob('models/clustrix_model.pkl')
    model_bytes = pickle.dumps(best_model)
    model_blob.upload_from_string(model_bytes)

    total_time = time.time() - start_time

    return {
        'best_params': grid_search.best_params_,
        'best_score': grid_search.best_score_,
        'cv_mean_score': cv_scores.mean(),
        'cv_std_score': cv_scores.std(),
        'training_time': total_time,
        'model_location': f'gs://{bucket_name}/models/clustrix_model.pkl',
        'feature_importance': best_model.feature_importances_[:10].tolist(),  # Top 10
        'dataset_size': len(X)
    }

# Setup Vertex AI configuration
vertex_config = setup_vertex_ai_for_clustrix(PROJECT_ID)

print("=== Vertex AI Setup Commands ===")
print(vertex_config['setup_commands'])
print("\n=== Training Job Configuration ===")
print(vertex_config['training_config'])

# Example usage (commented out):
# dataset_params = {'n_samples': 10000, 'n_features': 20, 'n_classes': 3}
# model_params = {}
# result = vertex_ai_ml_pipeline(dataset_params, model_params, PROJECT_ID, BUCKET_NAME)
# print(f"✓ Best model score: {result['best_score']:.4f}")
# print(f"✓ Model saved to: {result['model_location']}")

print("\n✓ Vertex AI integration examples defined.")

Security Best Practices¶

[ ]:

def setup_gcp_security_for_clustrix(project_id):
    """
    Security configuration for GCP + Clustrix deployment.
    """

    security_commands = f"""
# Create VPC with private subnets
gcloud compute networks create clustrix-vpc \
  --project {project_id} \
  --subnet-mode custom

gcloud compute networks subnets create clustrix-subnet \
  --project {project_id} \
  --network clustrix-vpc \
  --range 10.1.0.0/24 \
  --region us-central1 \
  --enable-private-ip-google-access

# Create firewall rules (restrictive)
gcloud compute firewall-rules create clustrix-allow-ssh \
  --project {project_id} \
  --network clustrix-vpc \
  --allow tcp:22 \
  --source-ranges YOUR_IP/32 \
  --target-tags clustrix

gcloud compute firewall-rules create clustrix-internal \
  --project {project_id} \
  --network clustrix-vpc \
  --allow tcp,udp,icmp \
  --source-ranges 10.1.0.0/24 \
  --target-tags clustrix

# Create service account with minimal permissions
gcloud iam service-accounts create clustrix-compute \
  --project {project_id} \
  --description="Service account for Clustrix compute instances" \
  --display-name="Clustrix Compute Service Account"

# Grant only necessary permissions
gcloud projects add-iam-policy-binding {project_id} \
  --member="serviceAccount:clustrix-compute@{project_id}.iam.gserviceaccount.com" \
  --role="roles/storage.objectAdmin"

gcloud projects add-iam-policy-binding {project_id} \
  --member="serviceAccount:clustrix-compute@{project_id}.iam.gserviceaccount.com" \
  --role="roles/logging.logWriter"

# Enable OS Login for better SSH key management
gcloud compute project-info add-metadata \
  --project {project_id} \
  --metadata enable-oslogin=TRUE

# Create Cloud KMS key for encryption
gcloud kms keyrings create clustrix-keyring \
  --project {project_id} \
  --location global

gcloud kms keys create clustrix-key \
  --project {project_id} \
  --keyring clustrix-keyring \
  --location global \
  --purpose encryption
"""

    return {
        'project_id': project_id,
        'vpc_name': 'clustrix-vpc',
        'subnet_name': 'clustrix-subnet',
        'service_account': f'clustrix-compute@{project_id}.iam.gserviceaccount.com',
        'security_commands': security_commands
    }

# Generate security configuration
security_config = setup_gcp_security_for_clustrix(PROJECT_ID)

print("=== GCP Security Setup Commands ===")
print(security_config['security_commands'])
print(f"\n✓ Security configuration templates generated for project: {PROJECT_ID}")
print(f"✓ VPC: {security_config['vpc_name']}")
print(f"✓ Service Account: {security_config['service_account']}")
print("\n⚠️  Remember to replace 'YOUR_IP' with your actual IP address in the firewall rules!")

GCP Security Checklist for Clustrix¶

✓ Authentication and Access

Use IAM service accounts with minimal permissions
Enable OS Login for centralized SSH key management
Create custom VPC with private subnets
Restrict firewall rules to specific IP ranges

✓ Infrastructure Security

Enable private Google access for instances without external IPs
Use Cloud KMS for encryption at rest
Enable audit logging and Cloud Security Command Center
Use Binary Authorization for container security

✓ Network Security

Implement VPC Service Controls for data perimeter
Enable DDoS protection and Cloud Armor
Use Secret Manager for sensitive configuration
Enable vulnerability scanning for container images

✓ Governance and Compliance

Set up budget alerts and billing account security
Use organization policies for governance
Regular security reviews and access audits

Resource Cleanup¶

[ ]:

def cleanup_gcp_resources(project_id, zone='us-central1-a', region='us-central1'):
    """
    Clean up GCP resources to avoid ongoing charges.

    Args:
        project_id: GCP project ID
        zone: Zone where resources were created
        region: Region where resources were created
    """

    cleanup_commands = f"""
# List all compute instances
gcloud compute instances list --project {project_id}

# Delete specific instances
gcloud compute instances delete clustrix-instance \
  --project {project_id} \
  --zone {zone} \
  --quiet

# Delete managed instance groups
gcloud compute instance-groups managed delete clustrix-preemptible-group \
  --project {project_id} \
  --zone {zone} \
  --quiet

# Delete instance templates
gcloud compute instance-templates delete clustrix-preemptible-template \
  --project {project_id} \
  --quiet

# Delete GKE clusters
gcloud container clusters delete clustrix-cluster \
  --project {project_id} \
  --zone {zone} \
  --quiet

# Delete Cloud Storage buckets (BE CAREFUL - THIS DELETES ALL DATA)
gsutil -m rm -r gs://{project_id}-clustrix-batch
gsutil -m rm -r gs://{project_id}-vertex-models
gsutil -m rm -r gs://{project_id}-clustrix-data

# Delete firewall rules
gcloud compute firewall-rules delete clustrix-allow-ssh clustrix-internal \
  --project {project_id} \
  --quiet

# Delete VPC network
gcloud compute networks subnets delete clustrix-subnet \
  --project {project_id} \
  --region {region} \
  --quiet

gcloud compute networks delete clustrix-vpc \
  --project {project_id} \
  --quiet

# Delete service accounts
gcloud iam service-accounts delete clustrix-compute@{project_id}.iam.gserviceaccount.com \
  --project {project_id} \
  --quiet

gcloud iam service-accounts delete clustrix-batch-sa@{project_id}.iam.gserviceaccount.com \
  --project {project_id} \
  --quiet

# List remaining billable resources
echo "=== Remaining billable resources ==="
gcloud compute instances list --project {project_id}
gcloud compute disks list --project {project_id}
gcloud compute addresses list --project {project_id}
gcloud container clusters list --project {project_id}
"""

    return {
        'project_id': project_id,
        'zone': zone,
        'region': region,
        'cleanup_commands': cleanup_commands
    }

# Generate cleanup commands
cleanup_info = cleanup_gcp_resources(PROJECT_ID)

print(f"=== GCP Resource Cleanup Commands for Project: {PROJECT_ID} ===")
print(cleanup_info['cleanup_commands'])
print("\n⚠️  WARNING: Some commands will permanently delete resources and data!")
print("Review each resource before deleting and ensure you have backups if needed.")
print("\n💡 TIP: Use 'gcloud compute instances stop' instead of 'delete' to preserve instances while stopping charges.")
print("\n✓ Cleanup commands generated. Always verify resources before deletion!")

Advanced Example: Distributed Scientific Computing¶

[ ]:

# Advanced Scientific Computing
@cluster(cores=4, memory="8GB", time="01:00:00")
def gcp_scientific_simulation(simulation_params, storage_config=None):
    """
    Distributed scientific simulation using GCP infrastructure.
    """
    import numpy as np
    from scipy.integrate import odeint
    from scipy.optimize import minimize
    import pickle
    import time
    import matplotlib
    matplotlib.use('Agg')  # Use non-interactive backend
    import matplotlib.pyplot as plt
    import io

    # Only import GCP storage if config provided
    if storage_config:
        from google.cloud import storage

    def lorenz_system(state, t, sigma, rho, beta):
        """Lorenz attractor differential equations."""
        x, y, z = state
        return [
            sigma * (y - x),
            x * (rho - z) - y,
            x * y - beta * z
        ]

    def simulate_lorenz(params, time_points):
        """Simulate Lorenz system with given parameters."""
        initial_state = [1.0, 1.0, 1.0]
        solution = odeint(
            lorenz_system, initial_state, time_points,
            args=(params['sigma'], params['rho'], params['beta'])
        )
        return solution

    start_time = time.time()

    # Parameter sweep
    parameter_sets = simulation_params['parameter_sets']
    time_points = np.linspace(0, simulation_params['max_time'], simulation_params['num_points'])

    results = []

    for i, params in enumerate(parameter_sets):
        # Run simulation
        solution = simulate_lorenz(params, time_points)

        # Analyze results
        x, y, z = solution[:, 0], solution[:, 1], solution[:, 2]

        analysis = {
            'params': params,
            'max_x': float(np.max(x)),
            'min_x': float(np.min(x)),
            'max_y': float(np.max(y)),
            'min_y': float(np.min(y)),
            'max_z': float(np.max(z)),
            'min_z': float(np.min(z)),
            'mean_energy': float(np.mean(x**2 + y**2 + z**2)),
            'final_state': [float(x[-1]), float(y[-1]), float(z[-1])],
            'std_x': float(np.std(x)),
            'std_y': float(np.std(y)),
            'std_z': float(np.std(z))
        }

        results.append(analysis)

        # Create visualization for first few parameter sets
        if i < 3:
            fig = plt.figure(figsize=(12, 4))

            # Time series
            plt.subplot(1, 3, 1)
            plt.plot(time_points, x, label='X', alpha=0.8)
            plt.plot(time_points, y, label='Y', alpha=0.8)
            plt.plot(time_points, z, label='Z', alpha=0.8)
            plt.xlabel('Time')
            plt.ylabel('State')
            plt.title(f'Lorenz System (σ={params["sigma"]}, ρ={params["rho"]}, β={params["beta"]})')
            plt.legend()
            plt.grid(True, alpha=0.3)

            # Phase space (X-Y)
            plt.subplot(1, 3, 2)
            plt.plot(x, y, alpha=0.7, linewidth=0.8)
            plt.xlabel('X')
            plt.ylabel('Y')
            plt.title('X-Y Phase Space')
            plt.grid(True, alpha=0.3)

            # Phase space (X-Z)
            plt.subplot(1, 3, 3)
            plt.plot(x, z, alpha=0.7, linewidth=0.8)
            plt.xlabel('X')
            plt.ylabel('Z')
            plt.title('X-Z Phase Space')
            plt.grid(True, alpha=0.3)

            plt.tight_layout()

            # Save plot to Cloud Storage if configured
            if storage_config:
                try:
                    img_buffer = io.BytesIO()
                    plt.savefig(img_buffer, format='png', dpi=150, bbox_inches='tight')
                    img_buffer.seek(0)

                    storage_client = storage.Client(project=storage_config['project_id'])
                    bucket = storage_client.bucket(storage_config['bucket_name'])

                    plot_blob = bucket.blob(f"plots/lorenz_simulation_{i}.png")
                    plot_blob.upload_from_string(img_buffer.getvalue(), content_type='image/png')
                except Exception as e:
                    print(f"Warning: Could not save plot to GCS: {e}")

            plt.close()

    computation_time = time.time() - start_time

    # Calculate summary statistics
    energies = [r['mean_energy'] for r in results]
    summary_stats = {
        'total_simulations': len(parameter_sets),
        'computation_time': computation_time,
        'average_energy': np.mean(energies),
        'max_energy': max(energies),
        'min_energy': min(energies),
        'energy_std': np.std(energies),
        'time_per_simulation': computation_time / len(parameter_sets)
    }

    # Save detailed results to Cloud Storage if configured
    if storage_config:
        try:
            storage_client = storage.Client(project=storage_config['project_id'])
            bucket = storage_client.bucket(storage_config['bucket_name'])

            results_blob = bucket.blob("results/simulation_results.pkl")
            results_data = {
                'simulation_params': simulation_params,
                'results': results,
                'summary_stats': summary_stats,
                'timestamp': time.time()
            }
            results_bytes = pickle.dumps(results_data)
            results_blob.upload_from_string(results_bytes)
        except Exception as e:
            print(f"Warning: Could not save results to GCS: {e}")

    return {
        'num_simulations': len(parameter_sets),
        'computation_time': computation_time,
        'summary_stats': summary_stats,
        'results_preview': results[:2],  # First 2 for brevity
        'storage_location': f"gs://{storage_config['bucket_name']}/results/" if storage_config else None,
        'plots_saved': min(3, len(parameter_sets))
    }

# Monte Carlo simulation example
@cluster(cores=2, memory="4GB")
def gcp_monte_carlo_simulation(n_samples=1000000):
    """Monte Carlo simulation for option pricing."""
    import numpy as np
    import time

    start_time = time.time()

    # Black-Scholes parameters
    S0 = 100    # Initial stock price
    K = 105     # Strike price
    T = 1.0     # Time to expiration
    r = 0.05    # Risk-free rate
    sigma = 0.2 # Volatility

    # Generate random samples
    np.random.seed(42)
    Z = np.random.standard_normal(n_samples)

    # Simulate stock prices at expiration
    ST = S0 * np.exp((r - 0.5 * sigma**2) * T + sigma * np.sqrt(T) * Z)

    # Calculate option payoffs
    call_payoffs = np.maximum(ST - K, 0)
    put_payoffs = np.maximum(K - ST, 0)

    # Discount to present value
    call_price = np.exp(-r * T) * np.mean(call_payoffs)
    put_price = np.exp(-r * T) * np.mean(put_payoffs)

    # Calculate confidence intervals
    call_std = np.std(call_payoffs) / np.sqrt(n_samples)
    put_std = np.std(put_payoffs) / np.sqrt(n_samples)

    computation_time = time.time() - start_time

    return {
        'n_samples': n_samples,
        'computation_time': computation_time,
        'call_price': call_price,
        'put_price': put_price,
        'call_confidence_interval': [call_price - 1.96*call_std, call_price + 1.96*call_std],
        'put_confidence_interval': [put_price - 1.96*put_std, put_price + 1.96*put_std],
        'parameters': {'S0': S0, 'K': K, 'T': T, 'r': r, 'sigma': sigma}
    }

print("✓ Advanced scientific computing examples defined")

# Example simulation parameters
example_lorenz_params = {
    'parameter_sets': [
        {'sigma': 10.0, 'rho': 28.0, 'beta': 8.0/3.0},    # Classic chaotic
        {'sigma': 10.0, 'rho': 24.74, 'beta': 8.0/3.0},   # Near onset
        {'sigma': 10.0, 'rho': 99.65, 'beta': 8.0/3.0},   # High rho
        {'sigma': 16.0, 'rho': 45.92, 'beta': 4.0},       # Different params
    ],
    'max_time': 25.0,
    'num_points': 5000
}

print("\n📝 Example usage:")
print("# Lorenz simulation:")
print("# result = gcp_scientific_simulation(example_lorenz_params)")
print("# print(f'Completed {result[\"num_simulations\"]} simulations')")
print("# print(f'Computation time: {result[\"computation_time\"]:.2f} seconds')")
print("#")
print("# Monte Carlo simulation:")
print("# mc_result = gcp_monte_carlo_simulation(n_samples=5000000)")
print("# print(f'Call option price: ${mc_result[\"call_price\"]:.2f}')")

print("\n🧪 These examples demonstrate GCP's computational capabilities:")
print("  • Parallel differential equation solving")
print("  • Statistical simulations with confidence intervals")
print("  • Cloud Storage integration for results")
print("  • Visualization generation and storage")

Summary¶

This tutorial covered:

Setup: GCP authentication and Clustrix installation
Compute Engine: Direct VM configuration and management
GKE Integration: Kubernetes clusters for containerized workloads
Cloud Batch: Managed job scheduling for large-scale processing
Cloud Storage: Data management and result storage
Vertex AI: Machine learning platform integration
Security: Best practices for secure deployment
Resource Management: Proper cleanup procedures

Cost Monitoring¶

For comprehensive cost monitoring, optimization strategies, and multi-cloud cost comparisons, see the dedicated Cost Monitoring Tutorial.

Next Steps¶

Set up your GCP credentials and test the basic configuration
Start with a simple Compute Engine instance for initial testing
Consider GKE for containerized workloads and auto-scaling
Explore Cloud Batch for large-scale batch processing
Implement proper monitoring and access controls
Review the Cost Monitoring Tutorial for expense tracking

GCP-Specific Advantages¶

Preemptible/Spot VMs: Exceptional cost savings (up to 80%)
Google Kubernetes Engine: Industry-leading managed Kubernetes
Vertex AI: Comprehensive ML platform with AutoML capabilities
Global Network: Superior network performance and global reach
BigQuery Integration: Seamless data analytics integration
Sustained Use Discounts: Automatic discounts for sustained usage

Resources¶

Remember: Always monitor your cloud costs and clean up resources when not in use!