HuggingFace Spaces Tutorial¶

This tutorial demonstrates how to use Clustrix with HuggingFace Spaces for ML model deployment and distributed computing.

Overview¶

HuggingFace Spaces provides a unique platform for ML applications that integrates well with Clustrix:

Gradio Apps: Interactive web interfaces for ML models
Streamlit Apps: Data science web applications
Static Spaces: HTML/JS applications
Docker Spaces: Custom containerized applications
GPU Support: Hardware acceleration for compute-intensive tasks
Persistent Storage: Data storage across sessions
Secrets Management: Secure credential storage
Community Hub: Easy sharing and collaboration

Prerequisites¶

HuggingFace account (free)
HuggingFace Hub token for authentication
Basic understanding of Gradio or Streamlit
Git for version control

Installation and Setup¶

Install Clustrix with HuggingFace dependencies:

[ ]:

# Install Clustrix with HuggingFace support
!pip install clustrix huggingface_hub gradio streamlit transformers datasets

# Import required libraries
import clustrix
from clustrix import cluster, configure
from huggingface_hub import HfApi, Repository, login, upload_file
import gradio as gr
import streamlit as st
import os
import numpy as np
import time
import json
import requests

HuggingFace Authentication Setup¶

Option 1: Interactive Login¶

[ ]:

# Login to HuggingFace (will prompt for token)
# login()

# Or set token as environment variable
# os.environ['HUGGINGFACE_HUB_TOKEN'] = 'your-token-here'

# Test authentication
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Successfully authenticated as: {user_info['name']}")
except Exception as e:
    print(f"Authentication failed: {e}")

Get your token fromhttps://huggingface.co/settings/tokens

Method 1: Gradio Space with Clustrix Backend¶

Create a Gradio App with Distributed Computing¶

[ ]:

def create_gradio_clustrix_app():
    """
    Create a Gradio app that uses Clustrix for backend computations.
    """

    # This would typically be configured to point to your cluster
    # For demo purposes, we'll use local execution
    configure(
        cluster_host=None,  # Local execution for demo
        package_manager="auto"
    )

    @cluster(cores=2, memory="4GB")
    def distributed_model_training(dataset_size, model_type, n_estimators):
        """Train ML model using distributed computing."""
        import numpy as np
        from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
        from sklearn.datasets import make_classification
        from sklearn.model_selection import train_test_split, cross_val_score
        from sklearn.metrics import accuracy_score, classification_report
        import time

        start_time = time.time()

        # Generate synthetic dataset
        X, y = make_classification(
            n_samples=int(dataset_size),
            n_features=20,
            n_classes=3,
            n_informative=15,
            random_state=42
        )

        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )

        # Select model
        if model_type == "Random Forest":
            model = RandomForestClassifier(
                n_estimators=int(n_estimators),
                random_state=42,
                n_jobs=-1
            )
        else:  # Gradient Boosting
            model = GradientBoostingClassifier(
                n_estimators=int(n_estimators),
                random_state=42
            )

        # Train model
        model.fit(X_train, y_train)

        # Evaluate
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)

        # Cross-validation
        cv_scores = cross_val_score(model, X_train, y_train, cv=5)

        training_time = time.time() - start_time

        return {
            'accuracy': accuracy,
            'cv_mean': cv_scores.mean(),
            'cv_std': cv_scores.std(),
            'training_time': training_time,
            'model_type': model_type,
            'n_estimators': n_estimators,
            'dataset_size': dataset_size,
            'feature_importance': model.feature_importances_[:5].tolist()
        }

    def train_model_interface(dataset_size, model_type, n_estimators):
        """Gradio interface function."""
        try:
            # Run distributed training
            result = distributed_model_training(dataset_size, model_type, n_estimators)

            # Format results for display
            output = f"""
**Training Results:**

- **Model Type:** {result['model_type']}
- **Dataset Size:** {result['dataset_size']:,} samples
- **Number of Estimators:** {result['n_estimators']}
- **Test Accuracy:** {result['accuracy']:.4f}
- **CV Mean Score:** {result['cv_mean']:.4f} ± {result['cv_std']:.4f}
- **Training Time:** {result['training_time']:.2f} seconds

**Top 5 Feature Importances:**
{', '.join([f'{imp:.4f}' for imp in result['feature_importance']])}

*Computation completed using Clustrix distributed computing.*
"""
            return output

        except Exception as e:
            return f"Error during training: {str(e)}"

    # Create Gradio interface
    interface = gr.Interface(
        fn=train_model_interface,
        inputs=[
            gr.Slider(
                minimum=1000,
                maximum=50000,
                value=10000,
                step=1000,
                label="Dataset Size"
            ),
            gr.Radio(
                choices=["Random Forest", "Gradient Boosting"],
                value="Random Forest",
                label="Model Type"
            ),
            gr.Slider(
                minimum=10,
                maximum=200,
                value=100,
                step=10,
                label="Number of Estimators"
            )
        ],
        outputs=gr.Markdown(label="Training Results"),
        title="Clustrix Distributed ML Training",
        description="Train machine learning models using Clustrix distributed computing backend.",
        article="""
        ### About This Demo

        This Gradio app demonstrates how to integrate Clustrix with HuggingFace Spaces
        for distributed machine learning. The backend uses Clustrix to:

        - Distribute model training across multiple cores
        - Perform cross-validation in parallel
        - Handle large datasets efficiently

        **Note:** In a production deployment, Clustrix would be configured to use
        remote compute clusters (AWS, Azure, GCP, etc.) for true distributed computing.
        """,
        theme="default",
        examples=[
            [5000, "Random Forest", 50],
            [20000, "Gradient Boosting", 100],
            [10000, "Random Forest", 150]
        ]
    )

    return interface

# Create the Gradio app
app = create_gradio_clustrix_app()

Use ``app.launch()`` to run the Gradio app locally.

Create Space Files Structure¶

[ ]:

def create_huggingface_space_files():
    """
    Create the necessary files for a HuggingFace Space.
    """

    # app.py - Main Gradio application
    app_py_content = '''
import gradio as gr
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
import time
import os

# Import clustrix if available, otherwise use local computation
try:
    from clustrix import cluster, configure
    CLUSTRIX_AVAILABLE = True

    # Configure clustrix (would normally point to remote cluster)
    configure(
        cluster_host=None,  # Local execution in HF Spaces
        package_manager="pip"
    )

    @cluster(cores=2, memory="4GB")
    def train_model_distributed(dataset_size, model_type, n_estimators):
        return train_model_local(dataset_size, model_type, n_estimators)

except ImportError:
    CLUSTRIX_AVAILABLE = False
    def train_model_distributed(dataset_size, model_type, n_estimators):
        return train_model_local(dataset_size, model_type, n_estimators)

def train_model_local(dataset_size, model_type, n_estimators):
    """Local model training function."""
    start_time = time.time()

    # Generate synthetic dataset
    X, y = make_classification(
        n_samples=int(dataset_size),
        n_features=20,
        n_classes=3,
        n_informative=15,
        random_state=42
    )

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Select model
    if model_type == "Random Forest":
        model = RandomForestClassifier(
            n_estimators=int(n_estimators),
            random_state=42,
            n_jobs=-1
        )
    else:  # Gradient Boosting
        model = GradientBoostingClassifier(
            n_estimators=int(n_estimators),
            random_state=42
        )

    # Train model
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    # Cross-validation (simplified for HF Spaces)
    cv_scores = cross_val_score(model, X_train, y_train, cv=3)  # Reduced CV folds

    training_time = time.time() - start_time

    return {
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'training_time': training_time,
        'model_type': model_type,
        'n_estimators': n_estimators,
        'dataset_size': dataset_size,
        'feature_importance': model.feature_importances_[:5].tolist()
    }

def train_model_interface(dataset_size, model_type, n_estimators):
    """Gradio interface function."""
    try:
        # Run training (distributed if clustrix available, local otherwise)
        result = train_model_distributed(dataset_size, model_type, n_estimators)

        # Format results for display
        backend_info = "Clustrix Distributed" if CLUSTRIX_AVAILABLE else "Local Computation"

        output = f"""
**Training Results** ({backend_info}):

- **Model Type:** {result['model_type']}
- **Dataset Size:** {result['dataset_size']:,} samples
- **Number of Estimators:** {result['n_estimators']}
- **Test Accuracy:** {result['accuracy']:.4f}
- **CV Mean Score:** {result['cv_mean']:.4f} ± {result['cv_std']:.4f}
- **Training Time:** {result['training_time']:.2f} seconds

**Top 5 Feature Importances:**
{', '.join([f'{imp:.4f}' for imp in result['feature_importance']])}

*Backend: {backend_info}*
"""
        return output

    except Exception as e:
        return f"Error during training: {str(e)}"

# Create Gradio interface
demo = gr.Interface(
    fn=train_model_interface,
    inputs=[
        gr.Slider(
            minimum=1000,
            maximum=20000,  # Reduced for HF Spaces limits
            value=5000,
            step=1000,
            label="Dataset Size"
        ),
        gr.Radio(
            choices=["Random Forest", "Gradient Boosting"],
            value="Random Forest",
            label="Model Type"
        ),
        gr.Slider(
            minimum=10,
            maximum=100,  # Reduced for HF Spaces
            value=50,
            step=10,
            label="Number of Estimators"
        )
    ],
    outputs=gr.Markdown(label="Training Results"),
    title="Clustrix Distributed ML Training",
    description="Train machine learning models with optional Clustrix distributed computing backend.",
    article="""
    ### About This Demo

    This HuggingFace Space demonstrates integration between Clustrix and Gradio.

    **Features:**
    - Interactive ML model training
    - Automatic fallback to local computation
    - Real-time results and performance metrics

    **Clustrix Integration:**
    When properly configured, Clustrix can distribute computations across:
    - AWS EC2, Batch, or ParallelCluster
    - Azure VMs, Batch, or CycleCloud
    - Google Cloud Compute Engine, GKE, or Batch
    - On-premise SLURM, PBS, or SGE clusters

    Visit [Clustrix Documentation](https://clustrix.readthedocs.io/) for setup instructions.
    """,
    examples=[
        [3000, "Random Forest", 30],
        [8000, "Gradient Boosting", 50],
        [5000, "Random Forest", 70]
    ]
)

if __name__ == "__main__":
    demo.launch()
'''

    # requirements.txt
    requirements_content = '''
gradio==4.44.0
numpy==1.24.3
scikit-learn==1.3.0
clustrix>=0.1.1
'''

    # README.md
    readme_content = '''
---
title: Clustrix Distributed ML Training
emoji: 🚀
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
tags:
- machine-learning
- distributed-computing
- clustrix
- scikit-learn
---

# Clustrix Distributed ML Training

This HuggingFace Space demonstrates how to integrate Clustrix distributed computing
with Gradio for interactive machine learning applications.

## Features

- **Interactive Training**: Train ML models through a web interface
- **Multiple Algorithms**: Support for Random Forest and Gradient Boosting
- **Real-time Results**: See training progress and results immediately
- **Distributed Backend**: Optional Clustrix integration for scaling

## How It Works

1. **Data Generation**: Creates synthetic classification datasets
2. **Model Training**: Trains selected algorithm with specified parameters
3. **Evaluation**: Performs cross-validation and test set evaluation
4. **Results Display**: Shows metrics and feature importance

## Clustrix Integration

When Clustrix is properly configured, this app can distribute computations across:

- **Cloud Platforms**: AWS, Azure, Google Cloud
- **HPC Clusters**: SLURM, PBS/Torque, SGE
- **Container Orchestration**: Kubernetes, Docker Swarm
- **SSH Clusters**: Any SSH-accessible compute nodes

## Usage

1. Adjust the dataset size (1,000 - 20,000 samples)
2. Select the model type (Random Forest or Gradient Boosting)
3. Set the number of estimators (10 - 100)
4. Click "Submit" to start training
5. View results including accuracy, cross-validation scores, and timing

## Local Development

To run this app locally:

```bash
pip install -r requirements.txt
python app.py
```

## Learn More

- [Clustrix Documentation](https://clustrix.readthedocs.io/)
- [Gradio Documentation](https://gradio.app/docs/)
- [HuggingFace Spaces](https://huggingface.co/docs/hub/spaces)
'''

    files = {
        'app.py': app_py_content.strip(),
        'requirements.txt': requirements_content.strip(),
        'README.md': readme_content.strip()
    }

    print("HuggingFace Space Files:")
    print("========================")

    for filename, content in files.items():
        print(f"\n--- {filename} ---")
        print(content[:500] + "..." if len(content) > 500 else content)

    return files

space_files = create_huggingface_space_files()
print("\nSpace files created. Upload these to create your HuggingFace Space.")

Deploy to HuggingFace Spaces¶

[ ]:

def deploy_clustrix_space(username, space_name, space_files):
    """
    Deploy Clustrix app to HuggingFace Spaces.

    Args:
        username: Your HuggingFace username
        space_name: Name for the new space
        space_files: Dictionary of files to upload
    """

    # Commands to create and deploy the space
    deployment_commands = f"""
# Method 1: Using HuggingFace Hub (Recommended)

# Create space via web interface first:
# 1. Go to https://huggingface.co/new-space
# 2. Choose username: {username}
# 3. Space name: {space_name}
# 4. License: MIT
# 5. SDK: Gradio
# 6. Hardware: CPU basic (free) or upgrade as needed

# Then clone and upload files:
git clone https://huggingface.co/spaces/{username}/{space_name}
cd {space_name}

# Copy your files (app.py, requirements.txt, README.md) to this directory

git add .
git commit -m "Initial commit: Clustrix distributed ML training app"
git push

# Method 2: Using Python API
# (Run this in Python after authentication)
"""

    python_deployment = f'''
from huggingface_hub import HfApi, upload_file
import tempfile
import os

# Initialize API
api = HfApi()

# Create space
api.create_repo(
    repo_id="{username}/{space_name}",
    repo_type="space",
    space_sdk="gradio",
    private=False
)

# Upload files
space_files = {space_files}

for filename, content in space_files.items():
    with tempfile.NamedTemporaryFile(mode='w', suffix=f'_{filename}', delete=False) as f:
        f.write(content)
        temp_path = f.name

    upload_file(
        path_or_fileobj=temp_path,
        path_in_repo=filename,
        repo_id="{username}/{space_name}",
        repo_type="space",
        commit_message=f"Add {filename}"
    )

    os.unlink(temp_path)

print(f"Space deployed: https://huggingface.co/spaces/{username}/{space_name}")
'''

    print("HuggingFace Space Deployment:")
    print("==============================")
    print(deployment_commands)
    print("\nPython Deployment Code:")
    print(python_deployment)

    return {
        'space_url': f'https://huggingface.co/spaces/{username}/{space_name}',
        'deployment_commands': deployment_commands,
        'python_code': python_deployment
    }

# Example deployment
deployment_info = deploy_clustrix_space(
    username='your-username',  # Replace with your HF username
    space_name='clustrix-ml-training',
    space_files=space_files
)

print("\nDeployment instructions generated.")

Method 2: Streamlit Space with Clustrix¶

[ ]:

def create_streamlit_clustrix_app():
    """
    Create a Streamlit app template for HuggingFace Spaces.
    """

    streamlit_app_content = '''
import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import time

# Import clustrix if available
try:
    from clustrix import cluster, configure
    CLUSTRIX_AVAILABLE = True
    configure(cluster_host=None, package_manager="pip")
except ImportError:
    CLUSTRIX_AVAILABLE = False

st.set_page_config(
    page_title="Clustrix ML Dashboard",
    page_icon="🚀",
    layout="wide",
    initial_sidebar_state="expanded"
)

st.title("🚀 Clustrix Distributed ML Dashboard")
st.markdown("""
This dashboard demonstrates machine learning with Clustrix distributed computing backend.
""")

# Sidebar controls
st.sidebar.header("Configuration")

dataset_size = st.sidebar.slider(
    "Dataset Size",
    min_value=1000,
    max_value=20000,
    value=5000,
    step=1000
)

n_features = st.sidebar.slider(
    "Number of Features",
    min_value=5,
    max_value=50,
    value=20,
    step=5
)

n_estimators = st.sidebar.slider(
    "Number of Estimators",
    min_value=10,
    max_value=200,
    value=100,
    step=10
)

max_depth = st.sidebar.slider(
    "Max Depth",
    min_value=3,
    max_value=20,
    value=10
)

# Backend selection
backend = st.sidebar.radio(
    "Computation Backend",
    ["Local", "Clustrix (if available)"]
)

if CLUSTRIX_AVAILABLE and backend == "Clustrix (if available)":
    @cluster(cores=2, memory="4GB")
    def train_model_clustrix(dataset_size, n_features, n_estimators, max_depth):
        return train_model_local(dataset_size, n_features, n_estimators, max_depth)

    train_function = train_model_clustrix
    backend_status = "🚀 Clustrix Distributed"
else:
    train_function = lambda *args: train_model_local(*args)
    backend_status = "💻 Local Computation"

def train_model_local(dataset_size, n_features, n_estimators, max_depth):
    """Train model locally."""
    # Generate dataset
    X, y = make_classification(
        n_samples=dataset_size,
        n_features=n_features,
        n_classes=3,
        n_informative=max(3, n_features // 2),
        random_state=42
    )

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Train model
    start_time = time.time()
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train)
    training_time = time.time() - start_time

    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    return {
        'model': model,
        'X_test': X_test,
        'y_test': y_test,
        'y_pred': y_pred,
        'accuracy': accuracy,
        'training_time': training_time,
        'feature_importance': model.feature_importances_
    }

# Main content
col1, col2 = st.columns([2, 1])

with col2:
    st.markdown(f"**Backend:** {backend_status}")
    st.markdown(f"**Clustrix Available:** {'✅' if CLUSTRIX_AVAILABLE else '❌'}")

if st.button("🚀 Train Model", type="primary"):
    with st.spinner("Training model..."):
        # Train model
        result = train_function(dataset_size, n_features, n_estimators, max_depth)

        # Display results
        col1, col2, col3 = st.columns(3)

        with col1:
            st.metric("Accuracy", f"{result['accuracy']:.4f}")

        with col2:
            st.metric("Training Time", f"{result['training_time']:.2f}s")

        with col3:
            st.metric("Test Samples", len(result['y_test']))

        # Feature importance plot
        st.subheader("Feature Importance")
        importance_df = pd.DataFrame({
            'Feature': [f'Feature {i}' for i in range(len(result['feature_importance']))],
            'Importance': result['feature_importance']
        }).sort_values('Importance', ascending=True)

        fig_importance = px.bar(
            importance_df.tail(10),
            x='Importance',
            y='Feature',
            title="Top 10 Feature Importances",
            orientation='h'
        )
        st.plotly_chart(fig_importance, use_container_width=True)

        # Confusion matrix
        st.subheader("Confusion Matrix")
        cm = confusion_matrix(result['y_test'], result['y_pred'])

        fig_cm = px.imshow(
            cm,
            text_auto=True,
            aspect="auto",
            title="Confusion Matrix",
            labels=dict(x="Predicted", y="Actual")
        )
        st.plotly_chart(fig_cm, use_container_width=True)

# Information section
st.markdown("---")
st.subheader("About Clustrix Integration")

col1, col2 = st.columns(2)

with col1:
    st.markdown("""
    **Clustrix Features:**
    - 🌐 Distributed computing across clusters
    - ☁️ Cloud platform integration (AWS, Azure, GCP)
    - 🐳 Container and Kubernetes support
    - 📊 Automatic workload distribution
    - 🔧 Simple decorator-based API
    """)

with col2:
    st.markdown("""
    **Supported Platforms:**
    - AWS EC2, Batch, ParallelCluster
    - Azure VMs, Batch, CycleCloud
    - Google Compute Engine, GKE, Batch
    - SLURM, PBS/Torque, SGE clusters
    - SSH-accessible compute nodes
    """)

st.markdown("""
**Learn More:**
- [Clustrix Documentation](https://clustrix.readthedocs.io/)
- [GitHub Repository](https://github.com/ContextLab/clustrix)
- [PyPI Package](https://pypi.org/project/clustrix/)
""")
'''

    streamlit_requirements = '''
streamlit==1.28.0
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
plotly==5.15.0
clustrix>=0.1.1
'''

    streamlit_readme = '''
---
title: Clustrix ML Dashboard
emoji: 📊
colorFrom: purple
colorTo: pink
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit
tags:
- machine-learning
- distributed-computing
- clustrix
- dashboard
---

# Clustrix ML Dashboard

An interactive Streamlit dashboard demonstrating Clustrix distributed computing
for machine learning workflows.

## Features

- 📊 **Interactive Dashboard**: Real-time model training and visualization
- 🚀 **Distributed Computing**: Optional Clustrix backend for scaling
- 📈 **Rich Visualizations**: Feature importance and confusion matrix plots
- ⚙️ **Configurable Parameters**: Adjust dataset size, model parameters
- 🔄 **Backend Selection**: Choose between local and distributed computation

## Usage

1. Configure dataset and model parameters in the sidebar
2. Select computation backend (local or Clustrix)
3. Click "Train Model" to start training
4. View results, metrics, and visualizations

## Clustrix Integration

When Clustrix is available and configured, this dashboard can distribute
ML computations across various platforms for improved performance and scalability.
'''

    return {
        'app.py': streamlit_app_content.strip(),
        'requirements.txt': streamlit_requirements.strip(),
        'README.md': streamlit_readme.strip()
    }

streamlit_files = create_streamlit_clustrix_app()
print("Streamlit app files created for HuggingFace Spaces deployment.")
print("\nKey features:")
print("- Interactive dashboard with real-time training")
print("- Rich visualizations with Plotly")
print("- Configurable parameters and backend selection")
print("- Automatic fallback to local computation")

Method 3: GPU-Accelerated Spaces¶

[ ]:

def create_gpu_clustrix_space():
    """
    Create a GPU-accelerated HuggingFace Space with Clustrix.
    """

    gpu_app_content = '''
import gradio as gr
import torch
import numpy as np
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import time
import json

# Import clustrix if available
try:
    from clustrix import cluster, configure
    CLUSTRIX_AVAILABLE = True

    # Configure for GPU-enabled remote clusters
    configure(
        cluster_host=None,  # Local for HF Spaces
        package_manager="pip",
        default_cores=1,  # GPU tasks typically use 1 core
        default_memory="8GB"
    )
except ImportError:
    CLUSTRIX_AVAILABLE = False

# Check GPU availability
CUDA_AVAILABLE = torch.cuda.is_available()
device = "cuda" if CUDA_AVAILABLE else "cpu"

print(f"Device: {device}")
print(f"Clustrix available: {CLUSTRIX_AVAILABLE}")

# Load a pre-trained model for demonstration
@cluster(cores=1, memory="8GB") if CLUSTRIX_AVAILABLE else (lambda f: f)
def load_sentiment_model():
    """Load sentiment analysis model."""
    model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    if CUDA_AVAILABLE:
        model = model.to(device)

    return pipeline(
        "sentiment-analysis",
        model=model,
        tokenizer=tokenizer,
        device=0 if CUDA_AVAILABLE else -1
    )

# Initialize model
sentiment_pipeline = load_sentiment_model()

@cluster(cores=1, memory="4GB") if CLUSTRIX_AVAILABLE else (lambda f: f)
def batch_sentiment_analysis(texts, use_gpu=True):
    """Perform batch sentiment analysis."""
    start_time = time.time()

    # Process texts in batches
    batch_size = 16 if use_gpu and CUDA_AVAILABLE else 8
    results = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_results = sentiment_pipeline(batch)
        results.extend(batch_results)

    processing_time = time.time() - start_time

    # Aggregate results
    positive_count = sum(1 for r in results if r['label'] == 'LABEL_2')
    negative_count = sum(1 for r in results if r['label'] == 'LABEL_0')
    neutral_count = sum(1 for r in results if r['label'] == 'LABEL_1')

    avg_confidence = np.mean([r['score'] for r in results])

    return {
        'results': results,
        'summary': {
            'total_texts': len(texts),
            'positive': positive_count,
            'negative': negative_count,
            'neutral': neutral_count,
            'avg_confidence': avg_confidence,
            'processing_time': processing_time,
            'texts_per_second': len(texts) / processing_time,
            'device_used': device,
            'clustrix_enabled': CLUSTRIX_AVAILABLE
        }
    }

def process_text_input(text_input, sample_size):
    """Process text input for sentiment analysis."""
    try:
        # Split text into individual texts
        texts = [t.strip() for t in text_input.split('\\n') if t.strip()]

        # Limit sample size for demo
        if len(texts) > sample_size:
            texts = texts[:sample_size]

        if not texts:
            return "Please provide some text to analyze."

        # Run batch analysis
        result = batch_sentiment_analysis(texts)
        summary = result['summary']

        # Format output
        output = f"""
**Batch Sentiment Analysis Results**

📊 **Summary Statistics:**
- Total texts analyzed: {summary['total_texts']}
- Positive sentiment: {summary['positive']} ({summary['positive']/summary['total_texts']*100:.1f}%)
- Negative sentiment: {summary['negative']} ({summary['negative']/summary['total_texts']*100:.1f}%)
- Neutral sentiment: {summary['neutral']} ({summary['neutral']/summary['total_texts']*100:.1f}%)
- Average confidence: {summary['avg_confidence']:.3f}

⚡ **Performance:**
- Processing time: {summary['processing_time']:.2f} seconds
- Throughput: {summary['texts_per_second']:.1f} texts/second
- Device: {summary['device_used'].upper()}
- Backend: {'Clustrix Distributed' if summary['clustrix_enabled'] else 'Local Processing'}

📝 **Individual Results:**
"""

        # Show first few individual results
        for i, (text, result_item) in enumerate(zip(texts[:5], result['results'][:5])):
            sentiment = {'LABEL_0': 'Negative', 'LABEL_1': 'Neutral', 'LABEL_2': 'Positive'}[result_item['label']]
            confidence = result_item['score']
            output += f"\n{i+1}. \"{text[:50]}{'...' if len(text) > 50 else ''}\" → {sentiment} ({confidence:.3f})"

        if len(texts) > 5:
            output += f"\n... and {len(texts) - 5} more texts"

        return output

    except Exception as e:
        return f"Error during analysis: {str(e)}"

# Create Gradio interface
demo = gr.Interface(
    fn=process_text_input,
    inputs=[
        gr.Textbox(
            lines=10,
            placeholder="Enter texts to analyze (one per line)\\nExample:\\nI love this product!\\nThis is terrible.\\nIt's okay, nothing special.",
            label="Text Input"
        ),
        gr.Slider(
            minimum=1,
            maximum=100,
            value=20,
            step=1,
            label="Max Texts to Process"
        )
    ],
    outputs=gr.Markdown(label="Analysis Results"),
    title="🚀 Clustrix GPU-Accelerated Sentiment Analysis",
    description=f"""
    Batch sentiment analysis using transformer models with optional Clustrix distributed computing.

    **Current Setup:**
    - Device: {device.upper()}
    - Clustrix: {'✅ Available' if CLUSTRIX_AVAILABLE else '❌ Not Available'}
    - GPU Acceleration: {'✅ Enabled' if CUDA_AVAILABLE else '❌ CPU Only'}
    """,
    article="""
    ### About This Demo

    This HuggingFace Space demonstrates GPU-accelerated NLP processing with Clustrix:

    **Features:**
    - Batch processing of multiple texts
    - GPU acceleration when available
    - Comprehensive performance metrics
    - Optional distributed computing backend

    **Clustrix Integration:**
    In production, Clustrix can distribute GPU workloads across:
    - Cloud GPU instances (AWS P3/P4, Azure NC/ND, GCP A100)
    - Multi-GPU clusters with SLURM/PBS scheduling
    - Kubernetes GPU nodes
    - On-premise GPU clusters

    **Model:** `cardiffnlp/twitter-roberta-base-sentiment-latest`
    """,
    examples=[
        [
            "I absolutely love this new feature!\\nThis is the worst experience ever.\\nIt's pretty good, could be better.\\nAmazing work by the team!\\nNot impressed at all.",
            5
        ],
        [
            "Great product, highly recommend!\\nTerrible customer service.\\nAverage quality for the price.\\nOutstanding performance!\\nWaste of money.",
            5
        ]
    ]
)

if __name__ == "__main__":
    demo.launch()
'''

    gpu_requirements = '''
gradio==4.44.0
torch==2.1.0
transformers==4.35.0
numpy==1.24.3
clustrix>=0.1.1
'''

    gpu_readme = '''
---
title: Clustrix GPU Sentiment Analysis
emoji: ⚡
colorFrom: yellow
colorTo: orange
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
tags:
- nlp
- sentiment-analysis
- gpu
- distributed-computing
- clustrix
hardware: t4-small
---

# Clustrix GPU-Accelerated Sentiment Analysis

A high-performance sentiment analysis demo showcasing GPU acceleration
and Clustrix distributed computing integration.

## Features

- ⚡ **GPU Acceleration**: Utilizes GPU for faster inference
- 📊 **Batch Processing**: Efficiently processes multiple texts
- 🚀 **Clustrix Integration**: Optional distributed computing backend
- 📈 **Performance Metrics**: Real-time throughput and timing
- 🤖 **Transformer Models**: Uses state-of-the-art RoBERTa model

## Usage

1. Enter multiple texts (one per line) in the input box
2. Set the maximum number of texts to process
3. Click "Submit" to run batch sentiment analysis
4. View results including sentiment distribution and performance metrics

## Model

This demo uses `cardiffnlp/twitter-roberta-base-sentiment-latest`,
a RoBERTa model fine-tuned for sentiment analysis on Twitter data.

## Clustrix Scaling

In production environments, Clustrix can distribute GPU workloads across:
- Multi-GPU cloud instances
- GPU clusters with job schedulers
- Kubernetes GPU nodes
- Hybrid cloud-edge deployments
'''

    return {
        'app.py': gpu_app_content.strip(),
        'requirements.txt': gpu_requirements.strip(),
        'README.md': gpu_readme.strip()
    }

gpu_files = create_gpu_clustrix_space()
print("GPU-accelerated HuggingFace Space files created.")
print("\nKey features:")
print("- GPU acceleration for transformer models")
print("- Batch processing for improved throughput")
print("- Real-time performance metrics")
print("- Clustrix integration for distributed GPU computing")
print("\nNote: Requires GPU hardware tier on HuggingFace Spaces.")

Secrets and Configuration Management¶

[ ]:

import os
import base64
import tempfile
from clustrix import configure

def setup_clustrix_from_secrets():
    """Configure Clustrix using HuggingFace Spaces secrets."""

    # Get cluster configuration from secrets
    cluster_host = os.getenv('CLUSTER_HOST')
    cluster_username = os.getenv('CLUSTER_USERNAME', 'clustrix')
    ssh_key_b64 = os.getenv('CLUSTER_SSH_KEY')

    if not cluster_host:
        print("No cluster host configured, using local execution")
        configure(cluster_host=None)
        return False

    # Handle SSH key
    key_file_path = None
    if ssh_key_b64:
        try:
            # Decode base64 SSH key
            ssh_key = base64.b64decode(ssh_key_b64).decode('utf-8')

            # Write to temporary file
            with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.pem') as f:
                f.write(ssh_key)
                key_file_path = f.name

            # Set correct permissions
            os.chmod(key_file_path, 0o600)

        except Exception as e:
            print(f"Error processing SSH key: {e}")
            return False

    # Configure Clustrix
    try:
        configure(
            cluster_type="ssh",
            cluster_host=cluster_host,
            username=cluster_username,
            key_file=key_file_path,
            remote_work_dir="/tmp/clustrix",
            package_manager="auto",
            default_cores=2,
            default_memory="4GB",
            default_time="01:00:00"
        )

        print(f"✅ Clustrix configured for remote execution on {cluster_host}")
        return True

    except Exception as e:
        print(f"❌ Failed to configure Clustrix: {e}")
        configure(cluster_host=None)  # Fallback to local
        return False

def setup_cloud_credentials():
    """Setup cloud credentials from secrets."""

    # AWS credentials
    aws_key = os.getenv('AWS_ACCESS_KEY_ID')
    aws_secret = os.getenv('AWS_SECRET_ACCESS_KEY')
    if aws_key and aws_secret:
        os.environ['AWS_ACCESS_KEY_ID'] = aws_key
        os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret
        print("✅ AWS credentials configured")

    # Azure credentials
    azure_client_id = os.getenv('AZURE_CLIENT_ID')
    azure_client_secret = os.getenv('AZURE_CLIENT_SECRET')
    azure_tenant_id = os.getenv('AZURE_TENANT_ID')
    if azure_client_id and azure_client_secret and azure_tenant_id:
        os.environ['AZURE_CLIENT_ID'] = azure_client_id
        os.environ['AZURE_CLIENT_SECRET'] = azure_client_secret
        os.environ['AZURE_TENANT_ID'] = azure_tenant_id
        print("✅ Azure credentials configured")

    # Google Cloud credentials
    gcp_key = os.getenv('GCP_SERVICE_ACCOUNT_KEY')
    if gcp_key:
        with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.json') as f:
            f.write(gcp_key)
            os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = f.name
        print("✅ Google Cloud credentials configured")

# Example usage in your Space app:
# setup_cloud_credentials()
# clustrix_enabled = setup_clustrix_from_secrets()
# print(f"Clustrix distributed computing: {'Enabled' if clustrix_enabled else 'Disabled (local mode)'}")

HuggingFace Spaces Secrets Management for Clustrix¶

1. Access Secrets in Space Settings¶

Go to your Space settings page
Navigate to the “Repository secrets” section
Add secrets as key-value pairs

2. Common Clustrix Secrets¶

CLUSTER_HOST: IP address of your compute cluster
CLUSTER_USERNAME: SSH username for cluster access
CLUSTER_SSH_KEY: Private SSH key (base64 encoded)
AWS_ACCESS_KEY_ID: AWS credentials for cloud clusters
AWS_SECRET_ACCESS_KEY: AWS secret key
AZURE_CLIENT_ID: Azure service principal ID
AZURE_CLIENT_SECRET: Azure service principal secret
GCP_SERVICE_ACCOUNT_KEY: Google Cloud service account JSON

3. Security Best Practices¶

Use service accounts instead of personal credentials
Rotate secrets regularly
Apply principle of least privilege
Monitor secret usage and access logs

4. Environment Variables in Code¶

Secrets are automatically available as environment variables

Configuration Code Example¶

Deployment Tips and Best Practices¶

Troubleshooting Guide¶

Common Issues and Solutions¶

❌ Problem: Space fails to start ✅ Solution:

Check requirements.txt for version conflicts
Verify Python version compatibility
Review app.py for syntax errors
Check Space logs for detailed error messages

❌ Problem: Clustrix connection fails ✅ Solution:

Verify cluster host is accessible from HF Spaces
Check SSH key format and permissions
Ensure firewall allows connections from HF IPs
Implement fallback to local execution

❌ Problem: GPU not detected ✅ Solution:

Upgrade to GPU-enabled hardware tier
Check torch.cuda.is_available() in code
Verify CUDA-compatible PyTorch version
Add GPU requirements to README hardware field

❌ Problem: Memory errors ✅ Solution:

Optimize batch sizes for available memory
Clear GPU cache with torch.cuda.empty_cache()
Use memory-efficient model loading
Consider model quantization or distillation

❌ Problem: Slow performance ✅ Solution:

Profile code to identify bottlenecks
Use appropriate hardware tier
Implement model caching and warm-up
Optimize data preprocessing pipeline

HuggingFace Spaces Hardware Tiers¶

🆓 CPU Basic (Free):

2 vCPUs, 16GB RAM
Good for: Simple demos, small models, prototyping
Clustrix use case: Local fallback, lightweight computations

💰 CPU Upgrade ($3/hour):

8 vCPUs, 32GB RAM
Good for: CPU-intensive tasks, larger datasets
Clustrix use case: Medium-scale local processing

🚀 T4 Small ($0.60/hour):

4 vCPUs, 15GB RAM, 1x T4 GPU (16GB VRAM)
Good for: Deep learning inference, computer vision
Clustrix use case: GPU-accelerated ML, model training demos

⚡ A10G Small ($3.15/hour):

4 vCPUs, 15GB RAM, 1x A10G GPU (24GB VRAM)
Good for: Large models, high-performance inference
Clustrix use case: Production-scale ML applications

🔥 A100 Large ($4.13/hour):

12 vCPUs, 46GB RAM, 1x A100 GPU (40GB VRAM)
Good for: Massive models, research applications
Clustrix use case: Distributed training coordination

HuggingFace Spaces + Clustrix Best Practices¶

🚀 Performance Optimization¶

Use appropriate hardware tier (CPU Basic → T4 Small → A10G Small)
Implement caching for models and data
Use batch processing for multiple requests
Optimize memory usage with careful tensor management
Consider async processing for long-running tasks

🔒 Security¶

Store all credentials in Spaces secrets
Use service accounts instead of personal credentials
Implement input validation and sanitization
Never log sensitive information
Use HTTPS for all external API calls

🎯 User Experience¶

Provide clear error messages and fallbacks
Show progress indicators for long operations
Include example inputs and use cases
Add comprehensive documentation
Implement graceful degradation when Clustrix is unavailable

📊 Monitoring and Debugging¶

Add logging for key operations
Include performance metrics in the UI
Monitor resource usage and costs
Set up alerts for failures
Use descriptive commit messages for versioning

🔄 Scalability¶

Design for both local and distributed execution
Implement proper error handling and retries
Use connection pooling for database/API connections
Consider rate limiting for external services
Plan for traffic spikes and scaling needs

📦 Deployment¶

Pin specific package versions in requirements.txt
Test locally before deploying
Use environment variables for configuration
Implement health checks and status endpoints
Document deployment process and dependencies

Summary¶

This tutorial covered:

Gradio Integration: Interactive ML training interfaces with Clustrix backend
Streamlit Dashboards: Rich data science applications with distributed computing
GPU Acceleration: High-performance NLP processing with transformer models
Secrets Management: Secure credential storage and configuration
Deployment Best Practices: Performance optimization and troubleshooting
Hardware Selection: Choosing appropriate tiers for different use cases

Key Advantages of HuggingFace Spaces + Clustrix¶

Easy Deployment: Simple git-based deployment workflow
Community Sharing: Built-in discoverability and collaboration
Flexible Hardware: From free CPU to high-end GPU instances
Hybrid Computing: Local execution with optional distributed scaling
ML Focus: Optimized for machine learning and AI applications

Next Steps¶

Create your HuggingFace account and get an access token
Start with a simple Gradio app using the provided templates
Configure Clustrix integration using Spaces secrets
Test locally before deploying to ensure compatibility
Monitor performance and scale hardware as needed

Use Cases¶

Research Demos: Showcase distributed computing research
Educational Tools: Interactive learning environments
Prototype Testing: Rapid prototyping with real user feedback
Model Serving: Production-ready ML model deployment
Collaborative Computing: Shared access to distributed resources

Resources¶

Remember: HuggingFace Spaces provides an excellent platform for showcasing Clustrix capabilities and building interactive ML applications with distributed computing backends!