HuggingFace Spaces TutorialΒΆ

This tutorial demonstrates how to use Clustrix with HuggingFace Spaces for ML model deployment and distributed computing.

Open In Colab

OverviewΒΆ

HuggingFace Spaces provides a unique platform for ML applications that integrates well with Clustrix:

  • Gradio Apps: Interactive web interfaces for ML models

  • Streamlit Apps: Data science web applications

  • Static Spaces: HTML/JS applications

  • Docker Spaces: Custom containerized applications

  • GPU Support: Hardware acceleration for compute-intensive tasks

  • Persistent Storage: Data storage across sessions

  • Secrets Management: Secure credential storage

  • Community Hub: Easy sharing and collaboration

PrerequisitesΒΆ

  1. HuggingFace account (free)

  2. HuggingFace Hub token for authentication

  3. Basic understanding of Gradio or Streamlit

  4. Git for version control

Installation and SetupΒΆ

Install Clustrix with HuggingFace dependencies:

[ ]:
# Install Clustrix with HuggingFace support
!pip install clustrix huggingface_hub gradio streamlit transformers datasets

# Import required libraries
import clustrix
from clustrix import cluster, configure
from huggingface_hub import HfApi, Repository, login, upload_file
import gradio as gr
import streamlit as st
import os
import numpy as np
import time
import json
import requests

HuggingFace Authentication SetupΒΆ

Option 1: Interactive LoginΒΆ

[ ]:
# Login to HuggingFace (will prompt for token)
# login()

# Or set token as environment variable
# os.environ['HUGGINGFACE_HUB_TOKEN'] = 'your-token-here'

# Test authentication
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Successfully authenticated as: {user_info['name']}")
except Exception as e:
    print(f"Authentication failed: {e}")

Get your token fromhttps://huggingface.co/settings/tokens

Method 1: Gradio Space with Clustrix BackendΒΆ

Create a Gradio App with Distributed ComputingΒΆ

[ ]:
def create_gradio_clustrix_app():
    """
    Create a Gradio app that uses Clustrix for backend computations.
    """

    # This would typically be configured to point to your cluster
    # For demo purposes, we'll use local execution
    configure(
        cluster_host=None,  # Local execution for demo
        package_manager="auto"
    )

    @cluster(cores=2, memory="4GB")
    def distributed_model_training(dataset_size, model_type, n_estimators):
        """Train ML model using distributed computing."""
        import numpy as np
        from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
        from sklearn.datasets import make_classification
        from sklearn.model_selection import train_test_split, cross_val_score
        from sklearn.metrics import accuracy_score, classification_report
        import time

        start_time = time.time()

        # Generate synthetic dataset
        X, y = make_classification(
            n_samples=int(dataset_size),
            n_features=20,
            n_classes=3,
            n_informative=15,
            random_state=42
        )

        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )

        # Select model
        if model_type == "Random Forest":
            model = RandomForestClassifier(
                n_estimators=int(n_estimators),
                random_state=42,
                n_jobs=-1
            )
        else:  # Gradient Boosting
            model = GradientBoostingClassifier(
                n_estimators=int(n_estimators),
                random_state=42
            )

        # Train model
        model.fit(X_train, y_train)

        # Evaluate
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)

        # Cross-validation
        cv_scores = cross_val_score(model, X_train, y_train, cv=5)

        training_time = time.time() - start_time

        return {
            'accuracy': accuracy,
            'cv_mean': cv_scores.mean(),
            'cv_std': cv_scores.std(),
            'training_time': training_time,
            'model_type': model_type,
            'n_estimators': n_estimators,
            'dataset_size': dataset_size,
            'feature_importance': model.feature_importances_[:5].tolist()
        }

    def train_model_interface(dataset_size, model_type, n_estimators):
        """Gradio interface function."""
        try:
            # Run distributed training
            result = distributed_model_training(dataset_size, model_type, n_estimators)

            # Format results for display
            output = f"""
**Training Results:**

- **Model Type:** {result['model_type']}
- **Dataset Size:** {result['dataset_size']:,} samples
- **Number of Estimators:** {result['n_estimators']}
- **Test Accuracy:** {result['accuracy']:.4f}
- **CV Mean Score:** {result['cv_mean']:.4f} Β± {result['cv_std']:.4f}
- **Training Time:** {result['training_time']:.2f} seconds

**Top 5 Feature Importances:**
{', '.join([f'{imp:.4f}' for imp in result['feature_importance']])}

*Computation completed using Clustrix distributed computing.*
"""
            return output

        except Exception as e:
            return f"Error during training: {str(e)}"

    # Create Gradio interface
    interface = gr.Interface(
        fn=train_model_interface,
        inputs=[
            gr.Slider(
                minimum=1000,
                maximum=50000,
                value=10000,
                step=1000,
                label="Dataset Size"
            ),
            gr.Radio(
                choices=["Random Forest", "Gradient Boosting"],
                value="Random Forest",
                label="Model Type"
            ),
            gr.Slider(
                minimum=10,
                maximum=200,
                value=100,
                step=10,
                label="Number of Estimators"
            )
        ],
        outputs=gr.Markdown(label="Training Results"),
        title="Clustrix Distributed ML Training",
        description="Train machine learning models using Clustrix distributed computing backend.",
        article="""
        ### About This Demo

        This Gradio app demonstrates how to integrate Clustrix with HuggingFace Spaces
        for distributed machine learning. The backend uses Clustrix to:

        - Distribute model training across multiple cores
        - Perform cross-validation in parallel
        - Handle large datasets efficiently

        **Note:** In a production deployment, Clustrix would be configured to use
        remote compute clusters (AWS, Azure, GCP, etc.) for true distributed computing.
        """,
        theme="default",
        examples=[
            [5000, "Random Forest", 50],
            [20000, "Gradient Boosting", 100],
            [10000, "Random Forest", 150]
        ]
    )

    return interface

# Create the Gradio app
app = create_gradio_clustrix_app()

Use ``app.launch()`` to run the Gradio app locally.

Create Space Files StructureΒΆ

[ ]:
def create_huggingface_space_files():
    """
    Create the necessary files for a HuggingFace Space.
    """

    # app.py - Main Gradio application
    app_py_content = '''
import gradio as gr
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
import time
import os

# Import clustrix if available, otherwise use local computation
try:
    from clustrix import cluster, configure
    CLUSTRIX_AVAILABLE = True

    # Configure clustrix (would normally point to remote cluster)
    configure(
        cluster_host=None,  # Local execution in HF Spaces
        package_manager="pip"
    )

    @cluster(cores=2, memory="4GB")
    def train_model_distributed(dataset_size, model_type, n_estimators):
        return train_model_local(dataset_size, model_type, n_estimators)

except ImportError:
    CLUSTRIX_AVAILABLE = False
    def train_model_distributed(dataset_size, model_type, n_estimators):
        return train_model_local(dataset_size, model_type, n_estimators)

def train_model_local(dataset_size, model_type, n_estimators):
    """Local model training function."""
    start_time = time.time()

    # Generate synthetic dataset
    X, y = make_classification(
        n_samples=int(dataset_size),
        n_features=20,
        n_classes=3,
        n_informative=15,
        random_state=42
    )

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Select model
    if model_type == "Random Forest":
        model = RandomForestClassifier(
            n_estimators=int(n_estimators),
            random_state=42,
            n_jobs=-1
        )
    else:  # Gradient Boosting
        model = GradientBoostingClassifier(
            n_estimators=int(n_estimators),
            random_state=42
        )

    # Train model
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    # Cross-validation (simplified for HF Spaces)
    cv_scores = cross_val_score(model, X_train, y_train, cv=3)  # Reduced CV folds

    training_time = time.time() - start_time

    return {
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'training_time': training_time,
        'model_type': model_type,
        'n_estimators': n_estimators,
        'dataset_size': dataset_size,
        'feature_importance': model.feature_importances_[:5].tolist()
    }

def train_model_interface(dataset_size, model_type, n_estimators):
    """Gradio interface function."""
    try:
        # Run training (distributed if clustrix available, local otherwise)
        result = train_model_distributed(dataset_size, model_type, n_estimators)

        # Format results for display
        backend_info = "Clustrix Distributed" if CLUSTRIX_AVAILABLE else "Local Computation"

        output = f"""
**Training Results** ({backend_info}):

- **Model Type:** {result['model_type']}
- **Dataset Size:** {result['dataset_size']:,} samples
- **Number of Estimators:** {result['n_estimators']}
- **Test Accuracy:** {result['accuracy']:.4f}
- **CV Mean Score:** {result['cv_mean']:.4f} Β± {result['cv_std']:.4f}
- **Training Time:** {result['training_time']:.2f} seconds

**Top 5 Feature Importances:**
{', '.join([f'{imp:.4f}' for imp in result['feature_importance']])}

*Backend: {backend_info}*
"""
        return output

    except Exception as e:
        return f"Error during training: {str(e)}"

# Create Gradio interface
demo = gr.Interface(
    fn=train_model_interface,
    inputs=[
        gr.Slider(
            minimum=1000,
            maximum=20000,  # Reduced for HF Spaces limits
            value=5000,
            step=1000,
            label="Dataset Size"
        ),
        gr.Radio(
            choices=["Random Forest", "Gradient Boosting"],
            value="Random Forest",
            label="Model Type"
        ),
        gr.Slider(
            minimum=10,
            maximum=100,  # Reduced for HF Spaces
            value=50,
            step=10,
            label="Number of Estimators"
        )
    ],
    outputs=gr.Markdown(label="Training Results"),
    title="Clustrix Distributed ML Training",
    description="Train machine learning models with optional Clustrix distributed computing backend.",
    article="""
    ### About This Demo

    This HuggingFace Space demonstrates integration between Clustrix and Gradio.

    **Features:**
    - Interactive ML model training
    - Automatic fallback to local computation
    - Real-time results and performance metrics

    **Clustrix Integration:**
    When properly configured, Clustrix can distribute computations across:
    - AWS EC2, Batch, or ParallelCluster
    - Azure VMs, Batch, or CycleCloud
    - Google Cloud Compute Engine, GKE, or Batch
    - On-premise SLURM, PBS, or SGE clusters

    Visit [Clustrix Documentation](https://clustrix.readthedocs.io/) for setup instructions.
    """,
    examples=[
        [3000, "Random Forest", 30],
        [8000, "Gradient Boosting", 50],
        [5000, "Random Forest", 70]
    ]
)

if __name__ == "__main__":
    demo.launch()
'''

    # requirements.txt
    requirements_content = '''
gradio==4.44.0
numpy==1.24.3
scikit-learn==1.3.0
clustrix>=0.1.1
'''

    # README.md
    readme_content = '''
---
title: Clustrix Distributed ML Training
emoji: πŸš€
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
tags:
- machine-learning
- distributed-computing
- clustrix
- scikit-learn
---

# Clustrix Distributed ML Training

This HuggingFace Space demonstrates how to integrate Clustrix distributed computing
with Gradio for interactive machine learning applications.

## Features

- **Interactive Training**: Train ML models through a web interface
- **Multiple Algorithms**: Support for Random Forest and Gradient Boosting
- **Real-time Results**: See training progress and results immediately
- **Distributed Backend**: Optional Clustrix integration for scaling

## How It Works

1. **Data Generation**: Creates synthetic classification datasets
2. **Model Training**: Trains selected algorithm with specified parameters
3. **Evaluation**: Performs cross-validation and test set evaluation
4. **Results Display**: Shows metrics and feature importance

## Clustrix Integration

When Clustrix is properly configured, this app can distribute computations across:

- **Cloud Platforms**: AWS, Azure, Google Cloud
- **HPC Clusters**: SLURM, PBS/Torque, SGE
- **Container Orchestration**: Kubernetes, Docker Swarm
- **SSH Clusters**: Any SSH-accessible compute nodes

## Usage

1. Adjust the dataset size (1,000 - 20,000 samples)
2. Select the model type (Random Forest or Gradient Boosting)
3. Set the number of estimators (10 - 100)
4. Click "Submit" to start training
5. View results including accuracy, cross-validation scores, and timing

## Local Development

To run this app locally:

```bash
pip install -r requirements.txt
python app.py
```

## Learn More

- [Clustrix Documentation](https://clustrix.readthedocs.io/)
- [Gradio Documentation](https://gradio.app/docs/)
- [HuggingFace Spaces](https://huggingface.co/docs/hub/spaces)
'''

    files = {
        'app.py': app_py_content.strip(),
        'requirements.txt': requirements_content.strip(),
        'README.md': readme_content.strip()
    }

    print("HuggingFace Space Files:")
    print("========================")

    for filename, content in files.items():
        print(f"\n--- {filename} ---")
        print(content[:500] + "..." if len(content) > 500 else content)

    return files

space_files = create_huggingface_space_files()
print("\nSpace files created. Upload these to create your HuggingFace Space.")

Deploy to HuggingFace SpacesΒΆ

[ ]:
def deploy_clustrix_space(username, space_name, space_files):
    """
    Deploy Clustrix app to HuggingFace Spaces.

    Args:
        username: Your HuggingFace username
        space_name: Name for the new space
        space_files: Dictionary of files to upload
    """

    # Commands to create and deploy the space
    deployment_commands = f"""
# Method 1: Using HuggingFace Hub (Recommended)

# Create space via web interface first:
# 1. Go to https://huggingface.co/new-space
# 2. Choose username: {username}
# 3. Space name: {space_name}
# 4. License: MIT
# 5. SDK: Gradio
# 6. Hardware: CPU basic (free) or upgrade as needed

# Then clone and upload files:
git clone https://huggingface.co/spaces/{username}/{space_name}
cd {space_name}

# Copy your files (app.py, requirements.txt, README.md) to this directory

git add .
git commit -m "Initial commit: Clustrix distributed ML training app"
git push

# Method 2: Using Python API
# (Run this in Python after authentication)
"""

    python_deployment = f'''
from huggingface_hub import HfApi, upload_file
import tempfile
import os

# Initialize API
api = HfApi()

# Create space
api.create_repo(
    repo_id="{username}/{space_name}",
    repo_type="space",
    space_sdk="gradio",
    private=False
)

# Upload files
space_files = {space_files}

for filename, content in space_files.items():
    with tempfile.NamedTemporaryFile(mode='w', suffix=f'_{filename}', delete=False) as f:
        f.write(content)
        temp_path = f.name

    upload_file(
        path_or_fileobj=temp_path,
        path_in_repo=filename,
        repo_id="{username}/{space_name}",
        repo_type="space",
        commit_message=f"Add {filename}"
    )

    os.unlink(temp_path)

print(f"Space deployed: https://huggingface.co/spaces/{username}/{space_name}")
'''

    print("HuggingFace Space Deployment:")
    print("==============================")
    print(deployment_commands)
    print("\nPython Deployment Code:")
    print(python_deployment)

    return {
        'space_url': f'https://huggingface.co/spaces/{username}/{space_name}',
        'deployment_commands': deployment_commands,
        'python_code': python_deployment
    }

# Example deployment
deployment_info = deploy_clustrix_space(
    username='your-username',  # Replace with your HF username
    space_name='clustrix-ml-training',
    space_files=space_files
)

print("\nDeployment instructions generated.")

Method 2: Streamlit Space with ClustrixΒΆ

[ ]:
def create_streamlit_clustrix_app():
    """
    Create a Streamlit app template for HuggingFace Spaces.
    """

    streamlit_app_content = '''
import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import time

# Import clustrix if available
try:
    from clustrix import cluster, configure
    CLUSTRIX_AVAILABLE = True
    configure(cluster_host=None, package_manager="pip")
except ImportError:
    CLUSTRIX_AVAILABLE = False

st.set_page_config(
    page_title="Clustrix ML Dashboard",
    page_icon="πŸš€",
    layout="wide",
    initial_sidebar_state="expanded"
)

st.title("πŸš€ Clustrix Distributed ML Dashboard")
st.markdown("""
This dashboard demonstrates machine learning with Clustrix distributed computing backend.
""")

# Sidebar controls
st.sidebar.header("Configuration")

dataset_size = st.sidebar.slider(
    "Dataset Size",
    min_value=1000,
    max_value=20000,
    value=5000,
    step=1000
)

n_features = st.sidebar.slider(
    "Number of Features",
    min_value=5,
    max_value=50,
    value=20,
    step=5
)

n_estimators = st.sidebar.slider(
    "Number of Estimators",
    min_value=10,
    max_value=200,
    value=100,
    step=10
)

max_depth = st.sidebar.slider(
    "Max Depth",
    min_value=3,
    max_value=20,
    value=10
)

# Backend selection
backend = st.sidebar.radio(
    "Computation Backend",
    ["Local", "Clustrix (if available)"]
)

if CLUSTRIX_AVAILABLE and backend == "Clustrix (if available)":
    @cluster(cores=2, memory="4GB")
    def train_model_clustrix(dataset_size, n_features, n_estimators, max_depth):
        return train_model_local(dataset_size, n_features, n_estimators, max_depth)

    train_function = train_model_clustrix
    backend_status = "πŸš€ Clustrix Distributed"
else:
    train_function = lambda *args: train_model_local(*args)
    backend_status = "πŸ’» Local Computation"

def train_model_local(dataset_size, n_features, n_estimators, max_depth):
    """Train model locally."""
    # Generate dataset
    X, y = make_classification(
        n_samples=dataset_size,
        n_features=n_features,
        n_classes=3,
        n_informative=max(3, n_features // 2),
        random_state=42
    )

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Train model
    start_time = time.time()
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train)
    training_time = time.time() - start_time

    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    return {
        'model': model,
        'X_test': X_test,
        'y_test': y_test,
        'y_pred': y_pred,
        'accuracy': accuracy,
        'training_time': training_time,
        'feature_importance': model.feature_importances_
    }

# Main content
col1, col2 = st.columns([2, 1])

with col2:
    st.markdown(f"**Backend:** {backend_status}")
    st.markdown(f"**Clustrix Available:** {'βœ…' if CLUSTRIX_AVAILABLE else '❌'}")

if st.button("πŸš€ Train Model", type="primary"):
    with st.spinner("Training model..."):
        # Train model
        result = train_function(dataset_size, n_features, n_estimators, max_depth)

        # Display results
        col1, col2, col3 = st.columns(3)

        with col1:
            st.metric("Accuracy", f"{result['accuracy']:.4f}")

        with col2:
            st.metric("Training Time", f"{result['training_time']:.2f}s")

        with col3:
            st.metric("Test Samples", len(result['y_test']))

        # Feature importance plot
        st.subheader("Feature Importance")
        importance_df = pd.DataFrame({
            'Feature': [f'Feature {i}' for i in range(len(result['feature_importance']))],
            'Importance': result['feature_importance']
        }).sort_values('Importance', ascending=True)

        fig_importance = px.bar(
            importance_df.tail(10),
            x='Importance',
            y='Feature',
            title="Top 10 Feature Importances",
            orientation='h'
        )
        st.plotly_chart(fig_importance, use_container_width=True)

        # Confusion matrix
        st.subheader("Confusion Matrix")
        cm = confusion_matrix(result['y_test'], result['y_pred'])

        fig_cm = px.imshow(
            cm,
            text_auto=True,
            aspect="auto",
            title="Confusion Matrix",
            labels=dict(x="Predicted", y="Actual")
        )
        st.plotly_chart(fig_cm, use_container_width=True)

# Information section
st.markdown("---")
st.subheader("About Clustrix Integration")

col1, col2 = st.columns(2)

with col1:
    st.markdown("""
    **Clustrix Features:**
    - 🌐 Distributed computing across clusters
    - ☁️ Cloud platform integration (AWS, Azure, GCP)
    - 🐳 Container and Kubernetes support
    - πŸ“Š Automatic workload distribution
    - πŸ”§ Simple decorator-based API
    """)

with col2:
    st.markdown("""
    **Supported Platforms:**
    - AWS EC2, Batch, ParallelCluster
    - Azure VMs, Batch, CycleCloud
    - Google Compute Engine, GKE, Batch
    - SLURM, PBS/Torque, SGE clusters
    - SSH-accessible compute nodes
    """)

st.markdown("""
**Learn More:**
- [Clustrix Documentation](https://clustrix.readthedocs.io/)
- [GitHub Repository](https://github.com/ContextLab/clustrix)
- [PyPI Package](https://pypi.org/project/clustrix/)
""")
'''

    streamlit_requirements = '''
streamlit==1.28.0
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
plotly==5.15.0
clustrix>=0.1.1
'''

    streamlit_readme = '''
---
title: Clustrix ML Dashboard
emoji: πŸ“Š
colorFrom: purple
colorTo: pink
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit
tags:
- machine-learning
- distributed-computing
- clustrix
- dashboard
---

# Clustrix ML Dashboard

An interactive Streamlit dashboard demonstrating Clustrix distributed computing
for machine learning workflows.

## Features

- πŸ“Š **Interactive Dashboard**: Real-time model training and visualization
- πŸš€ **Distributed Computing**: Optional Clustrix backend for scaling
- πŸ“ˆ **Rich Visualizations**: Feature importance and confusion matrix plots
- βš™οΈ **Configurable Parameters**: Adjust dataset size, model parameters
- πŸ”„ **Backend Selection**: Choose between local and distributed computation

## Usage

1. Configure dataset and model parameters in the sidebar
2. Select computation backend (local or Clustrix)
3. Click "Train Model" to start training
4. View results, metrics, and visualizations

## Clustrix Integration

When Clustrix is available and configured, this dashboard can distribute
ML computations across various platforms for improved performance and scalability.
'''

    return {
        'app.py': streamlit_app_content.strip(),
        'requirements.txt': streamlit_requirements.strip(),
        'README.md': streamlit_readme.strip()
    }

streamlit_files = create_streamlit_clustrix_app()
print("Streamlit app files created for HuggingFace Spaces deployment.")
print("\nKey features:")
print("- Interactive dashboard with real-time training")
print("- Rich visualizations with Plotly")
print("- Configurable parameters and backend selection")
print("- Automatic fallback to local computation")

Method 3: GPU-Accelerated SpacesΒΆ

[ ]:
def create_gpu_clustrix_space():
    """
    Create a GPU-accelerated HuggingFace Space with Clustrix.
    """

    gpu_app_content = '''
import gradio as gr
import torch
import numpy as np
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import time
import json

# Import clustrix if available
try:
    from clustrix import cluster, configure
    CLUSTRIX_AVAILABLE = True

    # Configure for GPU-enabled remote clusters
    configure(
        cluster_host=None,  # Local for HF Spaces
        package_manager="pip",
        default_cores=1,  # GPU tasks typically use 1 core
        default_memory="8GB"
    )
except ImportError:
    CLUSTRIX_AVAILABLE = False

# Check GPU availability
CUDA_AVAILABLE = torch.cuda.is_available()
device = "cuda" if CUDA_AVAILABLE else "cpu"

print(f"Device: {device}")
print(f"Clustrix available: {CLUSTRIX_AVAILABLE}")

# Load a pre-trained model for demonstration
@cluster(cores=1, memory="8GB") if CLUSTRIX_AVAILABLE else (lambda f: f)
def load_sentiment_model():
    """Load sentiment analysis model."""
    model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    if CUDA_AVAILABLE:
        model = model.to(device)

    return pipeline(
        "sentiment-analysis",
        model=model,
        tokenizer=tokenizer,
        device=0 if CUDA_AVAILABLE else -1
    )

# Initialize model
sentiment_pipeline = load_sentiment_model()

@cluster(cores=1, memory="4GB") if CLUSTRIX_AVAILABLE else (lambda f: f)
def batch_sentiment_analysis(texts, use_gpu=True):
    """Perform batch sentiment analysis."""
    start_time = time.time()

    # Process texts in batches
    batch_size = 16 if use_gpu and CUDA_AVAILABLE else 8
    results = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_results = sentiment_pipeline(batch)
        results.extend(batch_results)

    processing_time = time.time() - start_time

    # Aggregate results
    positive_count = sum(1 for r in results if r['label'] == 'LABEL_2')
    negative_count = sum(1 for r in results if r['label'] == 'LABEL_0')
    neutral_count = sum(1 for r in results if r['label'] == 'LABEL_1')

    avg_confidence = np.mean([r['score'] for r in results])

    return {
        'results': results,
        'summary': {
            'total_texts': len(texts),
            'positive': positive_count,
            'negative': negative_count,
            'neutral': neutral_count,
            'avg_confidence': avg_confidence,
            'processing_time': processing_time,
            'texts_per_second': len(texts) / processing_time,
            'device_used': device,
            'clustrix_enabled': CLUSTRIX_AVAILABLE
        }
    }

def process_text_input(text_input, sample_size):
    """Process text input for sentiment analysis."""
    try:
        # Split text into individual texts
        texts = [t.strip() for t in text_input.split('\\n') if t.strip()]

        # Limit sample size for demo
        if len(texts) > sample_size:
            texts = texts[:sample_size]

        if not texts:
            return "Please provide some text to analyze."

        # Run batch analysis
        result = batch_sentiment_analysis(texts)
        summary = result['summary']

        # Format output
        output = f"""
**Batch Sentiment Analysis Results**

πŸ“Š **Summary Statistics:**
- Total texts analyzed: {summary['total_texts']}
- Positive sentiment: {summary['positive']} ({summary['positive']/summary['total_texts']*100:.1f}%)
- Negative sentiment: {summary['negative']} ({summary['negative']/summary['total_texts']*100:.1f}%)
- Neutral sentiment: {summary['neutral']} ({summary['neutral']/summary['total_texts']*100:.1f}%)
- Average confidence: {summary['avg_confidence']:.3f}

⚑ **Performance:**
- Processing time: {summary['processing_time']:.2f} seconds
- Throughput: {summary['texts_per_second']:.1f} texts/second
- Device: {summary['device_used'].upper()}
- Backend: {'Clustrix Distributed' if summary['clustrix_enabled'] else 'Local Processing'}

πŸ“ **Individual Results:**
"""

        # Show first few individual results
        for i, (text, result_item) in enumerate(zip(texts[:5], result['results'][:5])):
            sentiment = {'LABEL_0': 'Negative', 'LABEL_1': 'Neutral', 'LABEL_2': 'Positive'}[result_item['label']]
            confidence = result_item['score']
            output += f"\n{i+1}. \"{text[:50]}{'...' if len(text) > 50 else ''}\" β†’ {sentiment} ({confidence:.3f})"

        if len(texts) > 5:
            output += f"\n... and {len(texts) - 5} more texts"

        return output

    except Exception as e:
        return f"Error during analysis: {str(e)}"

# Create Gradio interface
demo = gr.Interface(
    fn=process_text_input,
    inputs=[
        gr.Textbox(
            lines=10,
            placeholder="Enter texts to analyze (one per line)\\nExample:\\nI love this product!\\nThis is terrible.\\nIt's okay, nothing special.",
            label="Text Input"
        ),
        gr.Slider(
            minimum=1,
            maximum=100,
            value=20,
            step=1,
            label="Max Texts to Process"
        )
    ],
    outputs=gr.Markdown(label="Analysis Results"),
    title="πŸš€ Clustrix GPU-Accelerated Sentiment Analysis",
    description=f"""
    Batch sentiment analysis using transformer models with optional Clustrix distributed computing.

    **Current Setup:**
    - Device: {device.upper()}
    - Clustrix: {'βœ… Available' if CLUSTRIX_AVAILABLE else '❌ Not Available'}
    - GPU Acceleration: {'βœ… Enabled' if CUDA_AVAILABLE else '❌ CPU Only'}
    """,
    article="""
    ### About This Demo

    This HuggingFace Space demonstrates GPU-accelerated NLP processing with Clustrix:

    **Features:**
    - Batch processing of multiple texts
    - GPU acceleration when available
    - Comprehensive performance metrics
    - Optional distributed computing backend

    **Clustrix Integration:**
    In production, Clustrix can distribute GPU workloads across:
    - Cloud GPU instances (AWS P3/P4, Azure NC/ND, GCP A100)
    - Multi-GPU clusters with SLURM/PBS scheduling
    - Kubernetes GPU nodes
    - On-premise GPU clusters

    **Model:** `cardiffnlp/twitter-roberta-base-sentiment-latest`
    """,
    examples=[
        [
            "I absolutely love this new feature!\\nThis is the worst experience ever.\\nIt's pretty good, could be better.\\nAmazing work by the team!\\nNot impressed at all.",
            5
        ],
        [
            "Great product, highly recommend!\\nTerrible customer service.\\nAverage quality for the price.\\nOutstanding performance!\\nWaste of money.",
            5
        ]
    ]
)

if __name__ == "__main__":
    demo.launch()
'''

    gpu_requirements = '''
gradio==4.44.0
torch==2.1.0
transformers==4.35.0
numpy==1.24.3
clustrix>=0.1.1
'''

    gpu_readme = '''
---
title: Clustrix GPU Sentiment Analysis
emoji: ⚑
colorFrom: yellow
colorTo: orange
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
tags:
- nlp
- sentiment-analysis
- gpu
- distributed-computing
- clustrix
hardware: t4-small
---

# Clustrix GPU-Accelerated Sentiment Analysis

A high-performance sentiment analysis demo showcasing GPU acceleration
and Clustrix distributed computing integration.

## Features

- ⚑ **GPU Acceleration**: Utilizes GPU for faster inference
- πŸ“Š **Batch Processing**: Efficiently processes multiple texts
- πŸš€ **Clustrix Integration**: Optional distributed computing backend
- πŸ“ˆ **Performance Metrics**: Real-time throughput and timing
- πŸ€– **Transformer Models**: Uses state-of-the-art RoBERTa model

## Usage

1. Enter multiple texts (one per line) in the input box
2. Set the maximum number of texts to process
3. Click "Submit" to run batch sentiment analysis
4. View results including sentiment distribution and performance metrics

## Model

This demo uses `cardiffnlp/twitter-roberta-base-sentiment-latest`,
a RoBERTa model fine-tuned for sentiment analysis on Twitter data.

## Clustrix Scaling

In production environments, Clustrix can distribute GPU workloads across:
- Multi-GPU cloud instances
- GPU clusters with job schedulers
- Kubernetes GPU nodes
- Hybrid cloud-edge deployments
'''

    return {
        'app.py': gpu_app_content.strip(),
        'requirements.txt': gpu_requirements.strip(),
        'README.md': gpu_readme.strip()
    }

gpu_files = create_gpu_clustrix_space()
print("GPU-accelerated HuggingFace Space files created.")
print("\nKey features:")
print("- GPU acceleration for transformer models")
print("- Batch processing for improved throughput")
print("- Real-time performance metrics")
print("- Clustrix integration for distributed GPU computing")
print("\nNote: Requires GPU hardware tier on HuggingFace Spaces.")

Secrets and Configuration ManagementΒΆ

[ ]:
import os
import base64
import tempfile
from clustrix import configure

def setup_clustrix_from_secrets():
    """Configure Clustrix using HuggingFace Spaces secrets."""

    # Get cluster configuration from secrets
    cluster_host = os.getenv('CLUSTER_HOST')
    cluster_username = os.getenv('CLUSTER_USERNAME', 'clustrix')
    ssh_key_b64 = os.getenv('CLUSTER_SSH_KEY')

    if not cluster_host:
        print("No cluster host configured, using local execution")
        configure(cluster_host=None)
        return False

    # Handle SSH key
    key_file_path = None
    if ssh_key_b64:
        try:
            # Decode base64 SSH key
            ssh_key = base64.b64decode(ssh_key_b64).decode('utf-8')

            # Write to temporary file
            with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.pem') as f:
                f.write(ssh_key)
                key_file_path = f.name

            # Set correct permissions
            os.chmod(key_file_path, 0o600)

        except Exception as e:
            print(f"Error processing SSH key: {e}")
            return False

    # Configure Clustrix
    try:
        configure(
            cluster_type="ssh",
            cluster_host=cluster_host,
            username=cluster_username,
            key_file=key_file_path,
            remote_work_dir="/tmp/clustrix",
            package_manager="auto",
            default_cores=2,
            default_memory="4GB",
            default_time="01:00:00"
        )

        print(f"βœ… Clustrix configured for remote execution on {cluster_host}")
        return True

    except Exception as e:
        print(f"❌ Failed to configure Clustrix: {e}")
        configure(cluster_host=None)  # Fallback to local
        return False

def setup_cloud_credentials():
    """Setup cloud credentials from secrets."""

    # AWS credentials
    aws_key = os.getenv('AWS_ACCESS_KEY_ID')
    aws_secret = os.getenv('AWS_SECRET_ACCESS_KEY')
    if aws_key and aws_secret:
        os.environ['AWS_ACCESS_KEY_ID'] = aws_key
        os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret
        print("βœ… AWS credentials configured")

    # Azure credentials
    azure_client_id = os.getenv('AZURE_CLIENT_ID')
    azure_client_secret = os.getenv('AZURE_CLIENT_SECRET')
    azure_tenant_id = os.getenv('AZURE_TENANT_ID')
    if azure_client_id and azure_client_secret and azure_tenant_id:
        os.environ['AZURE_CLIENT_ID'] = azure_client_id
        os.environ['AZURE_CLIENT_SECRET'] = azure_client_secret
        os.environ['AZURE_TENANT_ID'] = azure_tenant_id
        print("βœ… Azure credentials configured")

    # Google Cloud credentials
    gcp_key = os.getenv('GCP_SERVICE_ACCOUNT_KEY')
    if gcp_key:
        with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.json') as f:
            f.write(gcp_key)
            os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = f.name
        print("βœ… Google Cloud credentials configured")

# Example usage in your Space app:
# setup_cloud_credentials()
# clustrix_enabled = setup_clustrix_from_secrets()
# print(f"Clustrix distributed computing: {'Enabled' if clustrix_enabled else 'Disabled (local mode)'}")

HuggingFace Spaces Secrets Management for ClustrixΒΆ

1. Access Secrets in Space SettingsΒΆ

  • Go to your Space settings page

  • Navigate to the β€œRepository secrets” section

  • Add secrets as key-value pairs

2. Common Clustrix SecretsΒΆ

  • CLUSTER_HOST: IP address of your compute cluster

  • CLUSTER_USERNAME: SSH username for cluster access

  • CLUSTER_SSH_KEY: Private SSH key (base64 encoded)

  • AWS_ACCESS_KEY_ID: AWS credentials for cloud clusters

  • AWS_SECRET_ACCESS_KEY: AWS secret key

  • AZURE_CLIENT_ID: Azure service principal ID

  • AZURE_CLIENT_SECRET: Azure service principal secret

  • GCP_SERVICE_ACCOUNT_KEY: Google Cloud service account JSON

3. Security Best PracticesΒΆ

  • Use service accounts instead of personal credentials

  • Rotate secrets regularly

  • Apply principle of least privilege

  • Monitor secret usage and access logs

4. Environment Variables in CodeΒΆ

Secrets are automatically available as environment variables

Configuration Code ExampleΒΆ

Deployment Tips and Best PracticesΒΆ

Troubleshooting GuideΒΆ

Common Issues and SolutionsΒΆ

❌ Problem: Space fails to start βœ… Solution:

  • Check requirements.txt for version conflicts

  • Verify Python version compatibility

  • Review app.py for syntax errors

  • Check Space logs for detailed error messages

❌ Problem: Clustrix connection fails βœ… Solution:

  • Verify cluster host is accessible from HF Spaces

  • Check SSH key format and permissions

  • Ensure firewall allows connections from HF IPs

  • Implement fallback to local execution

❌ Problem: GPU not detected βœ… Solution:

  • Upgrade to GPU-enabled hardware tier

  • Check torch.cuda.is_available() in code

  • Verify CUDA-compatible PyTorch version

  • Add GPU requirements to README hardware field

❌ Problem: Memory errors βœ… Solution:

  • Optimize batch sizes for available memory

  • Clear GPU cache with torch.cuda.empty_cache()

  • Use memory-efficient model loading

  • Consider model quantization or distillation

❌ Problem: Slow performance βœ… Solution:

  • Profile code to identify bottlenecks

  • Use appropriate hardware tier

  • Implement model caching and warm-up

  • Optimize data preprocessing pipeline

HuggingFace Spaces Hardware TiersΒΆ

πŸ†“ CPU Basic (Free):

  • 2 vCPUs, 16GB RAM

  • Good for: Simple demos, small models, prototyping

  • Clustrix use case: Local fallback, lightweight computations

πŸ’° CPU Upgrade ($3/hour):

  • 8 vCPUs, 32GB RAM

  • Good for: CPU-intensive tasks, larger datasets

  • Clustrix use case: Medium-scale local processing

πŸš€ T4 Small ($0.60/hour):

  • 4 vCPUs, 15GB RAM, 1x T4 GPU (16GB VRAM)

  • Good for: Deep learning inference, computer vision

  • Clustrix use case: GPU-accelerated ML, model training demos

⚑ A10G Small ($3.15/hour):

  • 4 vCPUs, 15GB RAM, 1x A10G GPU (24GB VRAM)

  • Good for: Large models, high-performance inference

  • Clustrix use case: Production-scale ML applications

πŸ”₯ A100 Large ($4.13/hour):

  • 12 vCPUs, 46GB RAM, 1x A100 GPU (40GB VRAM)

  • Good for: Massive models, research applications

  • Clustrix use case: Distributed training coordination

HuggingFace Spaces + Clustrix Best PracticesΒΆ

πŸš€ Performance OptimizationΒΆ

  • Use appropriate hardware tier (CPU Basic β†’ T4 Small β†’ A10G Small)

  • Implement caching for models and data

  • Use batch processing for multiple requests

  • Optimize memory usage with careful tensor management

  • Consider async processing for long-running tasks

πŸ”’ SecurityΒΆ

  • Store all credentials in Spaces secrets

  • Use service accounts instead of personal credentials

  • Implement input validation and sanitization

  • Never log sensitive information

  • Use HTTPS for all external API calls

🎯 User Experience¢

  • Provide clear error messages and fallbacks

  • Show progress indicators for long operations

  • Include example inputs and use cases

  • Add comprehensive documentation

  • Implement graceful degradation when Clustrix is unavailable

πŸ“Š Monitoring and DebuggingΒΆ

  • Add logging for key operations

  • Include performance metrics in the UI

  • Monitor resource usage and costs

  • Set up alerts for failures

  • Use descriptive commit messages for versioning

πŸ”„ ScalabilityΒΆ

  • Design for both local and distributed execution

  • Implement proper error handling and retries

  • Use connection pooling for database/API connections

  • Consider rate limiting for external services

  • Plan for traffic spikes and scaling needs

πŸ“¦ DeploymentΒΆ

  • Pin specific package versions in requirements.txt

  • Test locally before deploying

  • Use environment variables for configuration

  • Implement health checks and status endpoints

  • Document deployment process and dependencies

SummaryΒΆ

This tutorial covered:

  1. Gradio Integration: Interactive ML training interfaces with Clustrix backend

  2. Streamlit Dashboards: Rich data science applications with distributed computing

  3. GPU Acceleration: High-performance NLP processing with transformer models

  4. Secrets Management: Secure credential storage and configuration

  5. Deployment Best Practices: Performance optimization and troubleshooting

  6. Hardware Selection: Choosing appropriate tiers for different use cases

Key Advantages of HuggingFace Spaces + ClustrixΒΆ

  • Easy Deployment: Simple git-based deployment workflow

  • Community Sharing: Built-in discoverability and collaboration

  • Flexible Hardware: From free CPU to high-end GPU instances

  • Hybrid Computing: Local execution with optional distributed scaling

  • ML Focus: Optimized for machine learning and AI applications

Next StepsΒΆ

  1. Create your HuggingFace account and get an access token

  2. Start with a simple Gradio app using the provided templates

  3. Configure Clustrix integration using Spaces secrets

  4. Test locally before deploying to ensure compatibility

  5. Monitor performance and scale hardware as needed

Use CasesΒΆ

  • Research Demos: Showcase distributed computing research

  • Educational Tools: Interactive learning environments

  • Prototype Testing: Rapid prototyping with real user feedback

  • Model Serving: Production-ready ML model deployment

  • Collaborative Computing: Shared access to distributed resources

ResourcesΒΆ

Remember: HuggingFace Spaces provides an excellent platform for showcasing Clustrix capabilities and building interactive ML applications with distributed computing backends!