HuggingFace Spaces TutorialΒΆ
This tutorial demonstrates how to use Clustrix with HuggingFace Spaces for ML model deployment and distributed computing.
OverviewΒΆ
HuggingFace Spaces provides a unique platform for ML applications that integrates well with Clustrix:
Gradio Apps: Interactive web interfaces for ML models
Streamlit Apps: Data science web applications
Static Spaces: HTML/JS applications
Docker Spaces: Custom containerized applications
GPU Support: Hardware acceleration for compute-intensive tasks
Persistent Storage: Data storage across sessions
Secrets Management: Secure credential storage
Community Hub: Easy sharing and collaboration
PrerequisitesΒΆ
HuggingFace account (free)
HuggingFace Hub token for authentication
Basic understanding of Gradio or Streamlit
Git for version control
Installation and SetupΒΆ
Install Clustrix with HuggingFace dependencies:
[ ]:
# Install Clustrix with HuggingFace support
!pip install clustrix huggingface_hub gradio streamlit transformers datasets
# Import required libraries
import clustrix
from clustrix import cluster, configure
from huggingface_hub import HfApi, Repository, login, upload_file
import gradio as gr
import streamlit as st
import os
import numpy as np
import time
import json
import requests
HuggingFace Authentication SetupΒΆ
Option 1: Interactive LoginΒΆ
[ ]:
# Login to HuggingFace (will prompt for token)
# login()
# Or set token as environment variable
# os.environ['HUGGINGFACE_HUB_TOKEN'] = 'your-token-here'
# Test authentication
try:
api = HfApi()
user_info = api.whoami()
print(f"Successfully authenticated as: {user_info['name']}")
except Exception as e:
print(f"Authentication failed: {e}")
Get your token fromhttps://huggingface.co/settings/tokens
Method 1: Gradio Space with Clustrix BackendΒΆ
Create a Gradio App with Distributed ComputingΒΆ
[ ]:
def create_gradio_clustrix_app():
"""
Create a Gradio app that uses Clustrix for backend computations.
"""
# This would typically be configured to point to your cluster
# For demo purposes, we'll use local execution
configure(
cluster_host=None, # Local execution for demo
package_manager="auto"
)
@cluster(cores=2, memory="4GB")
def distributed_model_training(dataset_size, model_type, n_estimators):
"""Train ML model using distributed computing."""
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
import time
start_time = time.time()
# Generate synthetic dataset
X, y = make_classification(
n_samples=int(dataset_size),
n_features=20,
n_classes=3,
n_informative=15,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Select model
if model_type == "Random Forest":
model = RandomForestClassifier(
n_estimators=int(n_estimators),
random_state=42,
n_jobs=-1
)
else: # Gradient Boosting
model = GradientBoostingClassifier(
n_estimators=int(n_estimators),
random_state=42
)
# Train model
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
training_time = time.time() - start_time
return {
'accuracy': accuracy,
'cv_mean': cv_scores.mean(),
'cv_std': cv_scores.std(),
'training_time': training_time,
'model_type': model_type,
'n_estimators': n_estimators,
'dataset_size': dataset_size,
'feature_importance': model.feature_importances_[:5].tolist()
}
def train_model_interface(dataset_size, model_type, n_estimators):
"""Gradio interface function."""
try:
# Run distributed training
result = distributed_model_training(dataset_size, model_type, n_estimators)
# Format results for display
output = f"""
**Training Results:**
- **Model Type:** {result['model_type']}
- **Dataset Size:** {result['dataset_size']:,} samples
- **Number of Estimators:** {result['n_estimators']}
- **Test Accuracy:** {result['accuracy']:.4f}
- **CV Mean Score:** {result['cv_mean']:.4f} Β± {result['cv_std']:.4f}
- **Training Time:** {result['training_time']:.2f} seconds
**Top 5 Feature Importances:**
{', '.join([f'{imp:.4f}' for imp in result['feature_importance']])}
*Computation completed using Clustrix distributed computing.*
"""
return output
except Exception as e:
return f"Error during training: {str(e)}"
# Create Gradio interface
interface = gr.Interface(
fn=train_model_interface,
inputs=[
gr.Slider(
minimum=1000,
maximum=50000,
value=10000,
step=1000,
label="Dataset Size"
),
gr.Radio(
choices=["Random Forest", "Gradient Boosting"],
value="Random Forest",
label="Model Type"
),
gr.Slider(
minimum=10,
maximum=200,
value=100,
step=10,
label="Number of Estimators"
)
],
outputs=gr.Markdown(label="Training Results"),
title="Clustrix Distributed ML Training",
description="Train machine learning models using Clustrix distributed computing backend.",
article="""
### About This Demo
This Gradio app demonstrates how to integrate Clustrix with HuggingFace Spaces
for distributed machine learning. The backend uses Clustrix to:
- Distribute model training across multiple cores
- Perform cross-validation in parallel
- Handle large datasets efficiently
**Note:** In a production deployment, Clustrix would be configured to use
remote compute clusters (AWS, Azure, GCP, etc.) for true distributed computing.
""",
theme="default",
examples=[
[5000, "Random Forest", 50],
[20000, "Gradient Boosting", 100],
[10000, "Random Forest", 150]
]
)
return interface
# Create the Gradio app
app = create_gradio_clustrix_app()
Use ``app.launch()`` to run the Gradio app locally.
Create Space Files StructureΒΆ
[ ]:
def create_huggingface_space_files():
"""
Create the necessary files for a HuggingFace Space.
"""
# app.py - Main Gradio application
app_py_content = '''
import gradio as gr
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
import time
import os
# Import clustrix if available, otherwise use local computation
try:
from clustrix import cluster, configure
CLUSTRIX_AVAILABLE = True
# Configure clustrix (would normally point to remote cluster)
configure(
cluster_host=None, # Local execution in HF Spaces
package_manager="pip"
)
@cluster(cores=2, memory="4GB")
def train_model_distributed(dataset_size, model_type, n_estimators):
return train_model_local(dataset_size, model_type, n_estimators)
except ImportError:
CLUSTRIX_AVAILABLE = False
def train_model_distributed(dataset_size, model_type, n_estimators):
return train_model_local(dataset_size, model_type, n_estimators)
def train_model_local(dataset_size, model_type, n_estimators):
"""Local model training function."""
start_time = time.time()
# Generate synthetic dataset
X, y = make_classification(
n_samples=int(dataset_size),
n_features=20,
n_classes=3,
n_informative=15,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Select model
if model_type == "Random Forest":
model = RandomForestClassifier(
n_estimators=int(n_estimators),
random_state=42,
n_jobs=-1
)
else: # Gradient Boosting
model = GradientBoostingClassifier(
n_estimators=int(n_estimators),
random_state=42
)
# Train model
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Cross-validation (simplified for HF Spaces)
cv_scores = cross_val_score(model, X_train, y_train, cv=3) # Reduced CV folds
training_time = time.time() - start_time
return {
'accuracy': accuracy,
'cv_mean': cv_scores.mean(),
'cv_std': cv_scores.std(),
'training_time': training_time,
'model_type': model_type,
'n_estimators': n_estimators,
'dataset_size': dataset_size,
'feature_importance': model.feature_importances_[:5].tolist()
}
def train_model_interface(dataset_size, model_type, n_estimators):
"""Gradio interface function."""
try:
# Run training (distributed if clustrix available, local otherwise)
result = train_model_distributed(dataset_size, model_type, n_estimators)
# Format results for display
backend_info = "Clustrix Distributed" if CLUSTRIX_AVAILABLE else "Local Computation"
output = f"""
**Training Results** ({backend_info}):
- **Model Type:** {result['model_type']}
- **Dataset Size:** {result['dataset_size']:,} samples
- **Number of Estimators:** {result['n_estimators']}
- **Test Accuracy:** {result['accuracy']:.4f}
- **CV Mean Score:** {result['cv_mean']:.4f} Β± {result['cv_std']:.4f}
- **Training Time:** {result['training_time']:.2f} seconds
**Top 5 Feature Importances:**
{', '.join([f'{imp:.4f}' for imp in result['feature_importance']])}
*Backend: {backend_info}*
"""
return output
except Exception as e:
return f"Error during training: {str(e)}"
# Create Gradio interface
demo = gr.Interface(
fn=train_model_interface,
inputs=[
gr.Slider(
minimum=1000,
maximum=20000, # Reduced for HF Spaces limits
value=5000,
step=1000,
label="Dataset Size"
),
gr.Radio(
choices=["Random Forest", "Gradient Boosting"],
value="Random Forest",
label="Model Type"
),
gr.Slider(
minimum=10,
maximum=100, # Reduced for HF Spaces
value=50,
step=10,
label="Number of Estimators"
)
],
outputs=gr.Markdown(label="Training Results"),
title="Clustrix Distributed ML Training",
description="Train machine learning models with optional Clustrix distributed computing backend.",
article="""
### About This Demo
This HuggingFace Space demonstrates integration between Clustrix and Gradio.
**Features:**
- Interactive ML model training
- Automatic fallback to local computation
- Real-time results and performance metrics
**Clustrix Integration:**
When properly configured, Clustrix can distribute computations across:
- AWS EC2, Batch, or ParallelCluster
- Azure VMs, Batch, or CycleCloud
- Google Cloud Compute Engine, GKE, or Batch
- On-premise SLURM, PBS, or SGE clusters
Visit [Clustrix Documentation](https://clustrix.readthedocs.io/) for setup instructions.
""",
examples=[
[3000, "Random Forest", 30],
[8000, "Gradient Boosting", 50],
[5000, "Random Forest", 70]
]
)
if __name__ == "__main__":
demo.launch()
'''
# requirements.txt
requirements_content = '''
gradio==4.44.0
numpy==1.24.3
scikit-learn==1.3.0
clustrix>=0.1.1
'''
# README.md
readme_content = '''
---
title: Clustrix Distributed ML Training
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
tags:
- machine-learning
- distributed-computing
- clustrix
- scikit-learn
---
# Clustrix Distributed ML Training
This HuggingFace Space demonstrates how to integrate Clustrix distributed computing
with Gradio for interactive machine learning applications.
## Features
- **Interactive Training**: Train ML models through a web interface
- **Multiple Algorithms**: Support for Random Forest and Gradient Boosting
- **Real-time Results**: See training progress and results immediately
- **Distributed Backend**: Optional Clustrix integration for scaling
## How It Works
1. **Data Generation**: Creates synthetic classification datasets
2. **Model Training**: Trains selected algorithm with specified parameters
3. **Evaluation**: Performs cross-validation and test set evaluation
4. **Results Display**: Shows metrics and feature importance
## Clustrix Integration
When Clustrix is properly configured, this app can distribute computations across:
- **Cloud Platforms**: AWS, Azure, Google Cloud
- **HPC Clusters**: SLURM, PBS/Torque, SGE
- **Container Orchestration**: Kubernetes, Docker Swarm
- **SSH Clusters**: Any SSH-accessible compute nodes
## Usage
1. Adjust the dataset size (1,000 - 20,000 samples)
2. Select the model type (Random Forest or Gradient Boosting)
3. Set the number of estimators (10 - 100)
4. Click "Submit" to start training
5. View results including accuracy, cross-validation scores, and timing
## Local Development
To run this app locally:
```bash
pip install -r requirements.txt
python app.py
```
## Learn More
- [Clustrix Documentation](https://clustrix.readthedocs.io/)
- [Gradio Documentation](https://gradio.app/docs/)
- [HuggingFace Spaces](https://huggingface.co/docs/hub/spaces)
'''
files = {
'app.py': app_py_content.strip(),
'requirements.txt': requirements_content.strip(),
'README.md': readme_content.strip()
}
print("HuggingFace Space Files:")
print("========================")
for filename, content in files.items():
print(f"\n--- {filename} ---")
print(content[:500] + "..." if len(content) > 500 else content)
return files
space_files = create_huggingface_space_files()
print("\nSpace files created. Upload these to create your HuggingFace Space.")
Deploy to HuggingFace SpacesΒΆ
[ ]:
def deploy_clustrix_space(username, space_name, space_files):
"""
Deploy Clustrix app to HuggingFace Spaces.
Args:
username: Your HuggingFace username
space_name: Name for the new space
space_files: Dictionary of files to upload
"""
# Commands to create and deploy the space
deployment_commands = f"""
# Method 1: Using HuggingFace Hub (Recommended)
# Create space via web interface first:
# 1. Go to https://huggingface.co/new-space
# 2. Choose username: {username}
# 3. Space name: {space_name}
# 4. License: MIT
# 5. SDK: Gradio
# 6. Hardware: CPU basic (free) or upgrade as needed
# Then clone and upload files:
git clone https://huggingface.co/spaces/{username}/{space_name}
cd {space_name}
# Copy your files (app.py, requirements.txt, README.md) to this directory
git add .
git commit -m "Initial commit: Clustrix distributed ML training app"
git push
# Method 2: Using Python API
# (Run this in Python after authentication)
"""
python_deployment = f'''
from huggingface_hub import HfApi, upload_file
import tempfile
import os
# Initialize API
api = HfApi()
# Create space
api.create_repo(
repo_id="{username}/{space_name}",
repo_type="space",
space_sdk="gradio",
private=False
)
# Upload files
space_files = {space_files}
for filename, content in space_files.items():
with tempfile.NamedTemporaryFile(mode='w', suffix=f'_{filename}', delete=False) as f:
f.write(content)
temp_path = f.name
upload_file(
path_or_fileobj=temp_path,
path_in_repo=filename,
repo_id="{username}/{space_name}",
repo_type="space",
commit_message=f"Add {filename}"
)
os.unlink(temp_path)
print(f"Space deployed: https://huggingface.co/spaces/{username}/{space_name}")
'''
print("HuggingFace Space Deployment:")
print("==============================")
print(deployment_commands)
print("\nPython Deployment Code:")
print(python_deployment)
return {
'space_url': f'https://huggingface.co/spaces/{username}/{space_name}',
'deployment_commands': deployment_commands,
'python_code': python_deployment
}
# Example deployment
deployment_info = deploy_clustrix_space(
username='your-username', # Replace with your HF username
space_name='clustrix-ml-training',
space_files=space_files
)
print("\nDeployment instructions generated.")
Method 2: Streamlit Space with ClustrixΒΆ
[ ]:
def create_streamlit_clustrix_app():
"""
Create a Streamlit app template for HuggingFace Spaces.
"""
streamlit_app_content = '''
import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import time
# Import clustrix if available
try:
from clustrix import cluster, configure
CLUSTRIX_AVAILABLE = True
configure(cluster_host=None, package_manager="pip")
except ImportError:
CLUSTRIX_AVAILABLE = False
st.set_page_config(
page_title="Clustrix ML Dashboard",
page_icon="π",
layout="wide",
initial_sidebar_state="expanded"
)
st.title("π Clustrix Distributed ML Dashboard")
st.markdown("""
This dashboard demonstrates machine learning with Clustrix distributed computing backend.
""")
# Sidebar controls
st.sidebar.header("Configuration")
dataset_size = st.sidebar.slider(
"Dataset Size",
min_value=1000,
max_value=20000,
value=5000,
step=1000
)
n_features = st.sidebar.slider(
"Number of Features",
min_value=5,
max_value=50,
value=20,
step=5
)
n_estimators = st.sidebar.slider(
"Number of Estimators",
min_value=10,
max_value=200,
value=100,
step=10
)
max_depth = st.sidebar.slider(
"Max Depth",
min_value=3,
max_value=20,
value=10
)
# Backend selection
backend = st.sidebar.radio(
"Computation Backend",
["Local", "Clustrix (if available)"]
)
if CLUSTRIX_AVAILABLE and backend == "Clustrix (if available)":
@cluster(cores=2, memory="4GB")
def train_model_clustrix(dataset_size, n_features, n_estimators, max_depth):
return train_model_local(dataset_size, n_features, n_estimators, max_depth)
train_function = train_model_clustrix
backend_status = "π Clustrix Distributed"
else:
train_function = lambda *args: train_model_local(*args)
backend_status = "π» Local Computation"
def train_model_local(dataset_size, n_features, n_estimators, max_depth):
"""Train model locally."""
# Generate dataset
X, y = make_classification(
n_samples=dataset_size,
n_features=n_features,
n_classes=3,
n_informative=max(3, n_features // 2),
random_state=42
)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
start_time = time.time()
model = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train)
training_time = time.time() - start_time
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
return {
'model': model,
'X_test': X_test,
'y_test': y_test,
'y_pred': y_pred,
'accuracy': accuracy,
'training_time': training_time,
'feature_importance': model.feature_importances_
}
# Main content
col1, col2 = st.columns([2, 1])
with col2:
st.markdown(f"**Backend:** {backend_status}")
st.markdown(f"**Clustrix Available:** {'β
' if CLUSTRIX_AVAILABLE else 'β'}")
if st.button("π Train Model", type="primary"):
with st.spinner("Training model..."):
# Train model
result = train_function(dataset_size, n_features, n_estimators, max_depth)
# Display results
col1, col2, col3 = st.columns(3)
with col1:
st.metric("Accuracy", f"{result['accuracy']:.4f}")
with col2:
st.metric("Training Time", f"{result['training_time']:.2f}s")
with col3:
st.metric("Test Samples", len(result['y_test']))
# Feature importance plot
st.subheader("Feature Importance")
importance_df = pd.DataFrame({
'Feature': [f'Feature {i}' for i in range(len(result['feature_importance']))],
'Importance': result['feature_importance']
}).sort_values('Importance', ascending=True)
fig_importance = px.bar(
importance_df.tail(10),
x='Importance',
y='Feature',
title="Top 10 Feature Importances",
orientation='h'
)
st.plotly_chart(fig_importance, use_container_width=True)
# Confusion matrix
st.subheader("Confusion Matrix")
cm = confusion_matrix(result['y_test'], result['y_pred'])
fig_cm = px.imshow(
cm,
text_auto=True,
aspect="auto",
title="Confusion Matrix",
labels=dict(x="Predicted", y="Actual")
)
st.plotly_chart(fig_cm, use_container_width=True)
# Information section
st.markdown("---")
st.subheader("About Clustrix Integration")
col1, col2 = st.columns(2)
with col1:
st.markdown("""
**Clustrix Features:**
- π Distributed computing across clusters
- βοΈ Cloud platform integration (AWS, Azure, GCP)
- π³ Container and Kubernetes support
- π Automatic workload distribution
- π§ Simple decorator-based API
""")
with col2:
st.markdown("""
**Supported Platforms:**
- AWS EC2, Batch, ParallelCluster
- Azure VMs, Batch, CycleCloud
- Google Compute Engine, GKE, Batch
- SLURM, PBS/Torque, SGE clusters
- SSH-accessible compute nodes
""")
st.markdown("""
**Learn More:**
- [Clustrix Documentation](https://clustrix.readthedocs.io/)
- [GitHub Repository](https://github.com/ContextLab/clustrix)
- [PyPI Package](https://pypi.org/project/clustrix/)
""")
'''
streamlit_requirements = '''
streamlit==1.28.0
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
plotly==5.15.0
clustrix>=0.1.1
'''
streamlit_readme = '''
---
title: Clustrix ML Dashboard
emoji: π
colorFrom: purple
colorTo: pink
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
license: mit
tags:
- machine-learning
- distributed-computing
- clustrix
- dashboard
---
# Clustrix ML Dashboard
An interactive Streamlit dashboard demonstrating Clustrix distributed computing
for machine learning workflows.
## Features
- π **Interactive Dashboard**: Real-time model training and visualization
- π **Distributed Computing**: Optional Clustrix backend for scaling
- π **Rich Visualizations**: Feature importance and confusion matrix plots
- βοΈ **Configurable Parameters**: Adjust dataset size, model parameters
- π **Backend Selection**: Choose between local and distributed computation
## Usage
1. Configure dataset and model parameters in the sidebar
2. Select computation backend (local or Clustrix)
3. Click "Train Model" to start training
4. View results, metrics, and visualizations
## Clustrix Integration
When Clustrix is available and configured, this dashboard can distribute
ML computations across various platforms for improved performance and scalability.
'''
return {
'app.py': streamlit_app_content.strip(),
'requirements.txt': streamlit_requirements.strip(),
'README.md': streamlit_readme.strip()
}
streamlit_files = create_streamlit_clustrix_app()
print("Streamlit app files created for HuggingFace Spaces deployment.")
print("\nKey features:")
print("- Interactive dashboard with real-time training")
print("- Rich visualizations with Plotly")
print("- Configurable parameters and backend selection")
print("- Automatic fallback to local computation")
Method 3: GPU-Accelerated SpacesΒΆ
[ ]:
def create_gpu_clustrix_space():
"""
Create a GPU-accelerated HuggingFace Space with Clustrix.
"""
gpu_app_content = '''
import gradio as gr
import torch
import numpy as np
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import time
import json
# Import clustrix if available
try:
from clustrix import cluster, configure
CLUSTRIX_AVAILABLE = True
# Configure for GPU-enabled remote clusters
configure(
cluster_host=None, # Local for HF Spaces
package_manager="pip",
default_cores=1, # GPU tasks typically use 1 core
default_memory="8GB"
)
except ImportError:
CLUSTRIX_AVAILABLE = False
# Check GPU availability
CUDA_AVAILABLE = torch.cuda.is_available()
device = "cuda" if CUDA_AVAILABLE else "cpu"
print(f"Device: {device}")
print(f"Clustrix available: {CLUSTRIX_AVAILABLE}")
# Load a pre-trained model for demonstration
@cluster(cores=1, memory="8GB") if CLUSTRIX_AVAILABLE else (lambda f: f)
def load_sentiment_model():
"""Load sentiment analysis model."""
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
if CUDA_AVAILABLE:
model = model.to(device)
return pipeline(
"sentiment-analysis",
model=model,
tokenizer=tokenizer,
device=0 if CUDA_AVAILABLE else -1
)
# Initialize model
sentiment_pipeline = load_sentiment_model()
@cluster(cores=1, memory="4GB") if CLUSTRIX_AVAILABLE else (lambda f: f)
def batch_sentiment_analysis(texts, use_gpu=True):
"""Perform batch sentiment analysis."""
start_time = time.time()
# Process texts in batches
batch_size = 16 if use_gpu and CUDA_AVAILABLE else 8
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
batch_results = sentiment_pipeline(batch)
results.extend(batch_results)
processing_time = time.time() - start_time
# Aggregate results
positive_count = sum(1 for r in results if r['label'] == 'LABEL_2')
negative_count = sum(1 for r in results if r['label'] == 'LABEL_0')
neutral_count = sum(1 for r in results if r['label'] == 'LABEL_1')
avg_confidence = np.mean([r['score'] for r in results])
return {
'results': results,
'summary': {
'total_texts': len(texts),
'positive': positive_count,
'negative': negative_count,
'neutral': neutral_count,
'avg_confidence': avg_confidence,
'processing_time': processing_time,
'texts_per_second': len(texts) / processing_time,
'device_used': device,
'clustrix_enabled': CLUSTRIX_AVAILABLE
}
}
def process_text_input(text_input, sample_size):
"""Process text input for sentiment analysis."""
try:
# Split text into individual texts
texts = [t.strip() for t in text_input.split('\\n') if t.strip()]
# Limit sample size for demo
if len(texts) > sample_size:
texts = texts[:sample_size]
if not texts:
return "Please provide some text to analyze."
# Run batch analysis
result = batch_sentiment_analysis(texts)
summary = result['summary']
# Format output
output = f"""
**Batch Sentiment Analysis Results**
π **Summary Statistics:**
- Total texts analyzed: {summary['total_texts']}
- Positive sentiment: {summary['positive']} ({summary['positive']/summary['total_texts']*100:.1f}%)
- Negative sentiment: {summary['negative']} ({summary['negative']/summary['total_texts']*100:.1f}%)
- Neutral sentiment: {summary['neutral']} ({summary['neutral']/summary['total_texts']*100:.1f}%)
- Average confidence: {summary['avg_confidence']:.3f}
β‘ **Performance:**
- Processing time: {summary['processing_time']:.2f} seconds
- Throughput: {summary['texts_per_second']:.1f} texts/second
- Device: {summary['device_used'].upper()}
- Backend: {'Clustrix Distributed' if summary['clustrix_enabled'] else 'Local Processing'}
π **Individual Results:**
"""
# Show first few individual results
for i, (text, result_item) in enumerate(zip(texts[:5], result['results'][:5])):
sentiment = {'LABEL_0': 'Negative', 'LABEL_1': 'Neutral', 'LABEL_2': 'Positive'}[result_item['label']]
confidence = result_item['score']
output += f"\n{i+1}. \"{text[:50]}{'...' if len(text) > 50 else ''}\" β {sentiment} ({confidence:.3f})"
if len(texts) > 5:
output += f"\n... and {len(texts) - 5} more texts"
return output
except Exception as e:
return f"Error during analysis: {str(e)}"
# Create Gradio interface
demo = gr.Interface(
fn=process_text_input,
inputs=[
gr.Textbox(
lines=10,
placeholder="Enter texts to analyze (one per line)\\nExample:\\nI love this product!\\nThis is terrible.\\nIt's okay, nothing special.",
label="Text Input"
),
gr.Slider(
minimum=1,
maximum=100,
value=20,
step=1,
label="Max Texts to Process"
)
],
outputs=gr.Markdown(label="Analysis Results"),
title="π Clustrix GPU-Accelerated Sentiment Analysis",
description=f"""
Batch sentiment analysis using transformer models with optional Clustrix distributed computing.
**Current Setup:**
- Device: {device.upper()}
- Clustrix: {'β
Available' if CLUSTRIX_AVAILABLE else 'β Not Available'}
- GPU Acceleration: {'β
Enabled' if CUDA_AVAILABLE else 'β CPU Only'}
""",
article="""
### About This Demo
This HuggingFace Space demonstrates GPU-accelerated NLP processing with Clustrix:
**Features:**
- Batch processing of multiple texts
- GPU acceleration when available
- Comprehensive performance metrics
- Optional distributed computing backend
**Clustrix Integration:**
In production, Clustrix can distribute GPU workloads across:
- Cloud GPU instances (AWS P3/P4, Azure NC/ND, GCP A100)
- Multi-GPU clusters with SLURM/PBS scheduling
- Kubernetes GPU nodes
- On-premise GPU clusters
**Model:** `cardiffnlp/twitter-roberta-base-sentiment-latest`
""",
examples=[
[
"I absolutely love this new feature!\\nThis is the worst experience ever.\\nIt's pretty good, could be better.\\nAmazing work by the team!\\nNot impressed at all.",
5
],
[
"Great product, highly recommend!\\nTerrible customer service.\\nAverage quality for the price.\\nOutstanding performance!\\nWaste of money.",
5
]
]
)
if __name__ == "__main__":
demo.launch()
'''
gpu_requirements = '''
gradio==4.44.0
torch==2.1.0
transformers==4.35.0
numpy==1.24.3
clustrix>=0.1.1
'''
gpu_readme = '''
---
title: Clustrix GPU Sentiment Analysis
emoji: β‘
colorFrom: yellow
colorTo: orange
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
tags:
- nlp
- sentiment-analysis
- gpu
- distributed-computing
- clustrix
hardware: t4-small
---
# Clustrix GPU-Accelerated Sentiment Analysis
A high-performance sentiment analysis demo showcasing GPU acceleration
and Clustrix distributed computing integration.
## Features
- β‘ **GPU Acceleration**: Utilizes GPU for faster inference
- π **Batch Processing**: Efficiently processes multiple texts
- π **Clustrix Integration**: Optional distributed computing backend
- π **Performance Metrics**: Real-time throughput and timing
- π€ **Transformer Models**: Uses state-of-the-art RoBERTa model
## Usage
1. Enter multiple texts (one per line) in the input box
2. Set the maximum number of texts to process
3. Click "Submit" to run batch sentiment analysis
4. View results including sentiment distribution and performance metrics
## Model
This demo uses `cardiffnlp/twitter-roberta-base-sentiment-latest`,
a RoBERTa model fine-tuned for sentiment analysis on Twitter data.
## Clustrix Scaling
In production environments, Clustrix can distribute GPU workloads across:
- Multi-GPU cloud instances
- GPU clusters with job schedulers
- Kubernetes GPU nodes
- Hybrid cloud-edge deployments
'''
return {
'app.py': gpu_app_content.strip(),
'requirements.txt': gpu_requirements.strip(),
'README.md': gpu_readme.strip()
}
gpu_files = create_gpu_clustrix_space()
print("GPU-accelerated HuggingFace Space files created.")
print("\nKey features:")
print("- GPU acceleration for transformer models")
print("- Batch processing for improved throughput")
print("- Real-time performance metrics")
print("- Clustrix integration for distributed GPU computing")
print("\nNote: Requires GPU hardware tier on HuggingFace Spaces.")
Secrets and Configuration ManagementΒΆ
[ ]:
import os
import base64
import tempfile
from clustrix import configure
def setup_clustrix_from_secrets():
"""Configure Clustrix using HuggingFace Spaces secrets."""
# Get cluster configuration from secrets
cluster_host = os.getenv('CLUSTER_HOST')
cluster_username = os.getenv('CLUSTER_USERNAME', 'clustrix')
ssh_key_b64 = os.getenv('CLUSTER_SSH_KEY')
if not cluster_host:
print("No cluster host configured, using local execution")
configure(cluster_host=None)
return False
# Handle SSH key
key_file_path = None
if ssh_key_b64:
try:
# Decode base64 SSH key
ssh_key = base64.b64decode(ssh_key_b64).decode('utf-8')
# Write to temporary file
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.pem') as f:
f.write(ssh_key)
key_file_path = f.name
# Set correct permissions
os.chmod(key_file_path, 0o600)
except Exception as e:
print(f"Error processing SSH key: {e}")
return False
# Configure Clustrix
try:
configure(
cluster_type="ssh",
cluster_host=cluster_host,
username=cluster_username,
key_file=key_file_path,
remote_work_dir="/tmp/clustrix",
package_manager="auto",
default_cores=2,
default_memory="4GB",
default_time="01:00:00"
)
print(f"β
Clustrix configured for remote execution on {cluster_host}")
return True
except Exception as e:
print(f"β Failed to configure Clustrix: {e}")
configure(cluster_host=None) # Fallback to local
return False
def setup_cloud_credentials():
"""Setup cloud credentials from secrets."""
# AWS credentials
aws_key = os.getenv('AWS_ACCESS_KEY_ID')
aws_secret = os.getenv('AWS_SECRET_ACCESS_KEY')
if aws_key and aws_secret:
os.environ['AWS_ACCESS_KEY_ID'] = aws_key
os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret
print("β
AWS credentials configured")
# Azure credentials
azure_client_id = os.getenv('AZURE_CLIENT_ID')
azure_client_secret = os.getenv('AZURE_CLIENT_SECRET')
azure_tenant_id = os.getenv('AZURE_TENANT_ID')
if azure_client_id and azure_client_secret and azure_tenant_id:
os.environ['AZURE_CLIENT_ID'] = azure_client_id
os.environ['AZURE_CLIENT_SECRET'] = azure_client_secret
os.environ['AZURE_TENANT_ID'] = azure_tenant_id
print("β
Azure credentials configured")
# Google Cloud credentials
gcp_key = os.getenv('GCP_SERVICE_ACCOUNT_KEY')
if gcp_key:
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.json') as f:
f.write(gcp_key)
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = f.name
print("β
Google Cloud credentials configured")
# Example usage in your Space app:
# setup_cloud_credentials()
# clustrix_enabled = setup_clustrix_from_secrets()
# print(f"Clustrix distributed computing: {'Enabled' if clustrix_enabled else 'Disabled (local mode)'}")
HuggingFace Spaces Secrets Management for ClustrixΒΆ
1. Access Secrets in Space SettingsΒΆ
Go to your Space settings page
Navigate to the βRepository secretsβ section
Add secrets as key-value pairs
2. Common Clustrix SecretsΒΆ
CLUSTER_HOST: IP address of your compute cluster
CLUSTER_USERNAME: SSH username for cluster access
CLUSTER_SSH_KEY: Private SSH key (base64 encoded)
AWS_ACCESS_KEY_ID: AWS credentials for cloud clusters
AWS_SECRET_ACCESS_KEY: AWS secret key
AZURE_CLIENT_ID: Azure service principal ID
AZURE_CLIENT_SECRET: Azure service principal secret
GCP_SERVICE_ACCOUNT_KEY: Google Cloud service account JSON
3. Security Best PracticesΒΆ
Use service accounts instead of personal credentials
Rotate secrets regularly
Apply principle of least privilege
Monitor secret usage and access logs
4. Environment Variables in CodeΒΆ
Secrets are automatically available as environment variables
Configuration Code ExampleΒΆ
Deployment Tips and Best PracticesΒΆ
Troubleshooting GuideΒΆ
Common Issues and SolutionsΒΆ
β Problem: Space fails to start β Solution:
Check requirements.txt for version conflicts
Verify Python version compatibility
Review app.py for syntax errors
Check Space logs for detailed error messages
β Problem: Clustrix connection fails β Solution:
Verify cluster host is accessible from HF Spaces
Check SSH key format and permissions
Ensure firewall allows connections from HF IPs
Implement fallback to local execution
β Problem: GPU not detected β Solution:
Upgrade to GPU-enabled hardware tier
Check torch.cuda.is_available() in code
Verify CUDA-compatible PyTorch version
Add GPU requirements to README hardware field
β Problem: Memory errors β Solution:
Optimize batch sizes for available memory
Clear GPU cache with torch.cuda.empty_cache()
Use memory-efficient model loading
Consider model quantization or distillation
β Problem: Slow performance β Solution:
Profile code to identify bottlenecks
Use appropriate hardware tier
Implement model caching and warm-up
Optimize data preprocessing pipeline
HuggingFace Spaces Hardware TiersΒΆ
π CPU Basic (Free):
2 vCPUs, 16GB RAM
Good for: Simple demos, small models, prototyping
Clustrix use case: Local fallback, lightweight computations
π° CPU Upgrade ($3/hour):
8 vCPUs, 32GB RAM
Good for: CPU-intensive tasks, larger datasets
Clustrix use case: Medium-scale local processing
π T4 Small ($0.60/hour):
4 vCPUs, 15GB RAM, 1x T4 GPU (16GB VRAM)
Good for: Deep learning inference, computer vision
Clustrix use case: GPU-accelerated ML, model training demos
β‘ A10G Small ($3.15/hour):
4 vCPUs, 15GB RAM, 1x A10G GPU (24GB VRAM)
Good for: Large models, high-performance inference
Clustrix use case: Production-scale ML applications
π₯ A100 Large ($4.13/hour):
12 vCPUs, 46GB RAM, 1x A100 GPU (40GB VRAM)
Good for: Massive models, research applications
Clustrix use case: Distributed training coordination
HuggingFace Spaces + Clustrix Best PracticesΒΆ
π Performance OptimizationΒΆ
Use appropriate hardware tier (CPU Basic β T4 Small β A10G Small)
Implement caching for models and data
Use batch processing for multiple requests
Optimize memory usage with careful tensor management
Consider async processing for long-running tasks
π SecurityΒΆ
Store all credentials in Spaces secrets
Use service accounts instead of personal credentials
Implement input validation and sanitization
Never log sensitive information
Use HTTPS for all external API calls
π― User ExperienceΒΆ
Provide clear error messages and fallbacks
Show progress indicators for long operations
Include example inputs and use cases
Add comprehensive documentation
Implement graceful degradation when Clustrix is unavailable
π Monitoring and DebuggingΒΆ
Add logging for key operations
Include performance metrics in the UI
Monitor resource usage and costs
Set up alerts for failures
Use descriptive commit messages for versioning
π ScalabilityΒΆ
Design for both local and distributed execution
Implement proper error handling and retries
Use connection pooling for database/API connections
Consider rate limiting for external services
Plan for traffic spikes and scaling needs
π¦ DeploymentΒΆ
Pin specific package versions in requirements.txt
Test locally before deploying
Use environment variables for configuration
Implement health checks and status endpoints
Document deployment process and dependencies
SummaryΒΆ
This tutorial covered:
Gradio Integration: Interactive ML training interfaces with Clustrix backend
Streamlit Dashboards: Rich data science applications with distributed computing
GPU Acceleration: High-performance NLP processing with transformer models
Secrets Management: Secure credential storage and configuration
Deployment Best Practices: Performance optimization and troubleshooting
Hardware Selection: Choosing appropriate tiers for different use cases
Key Advantages of HuggingFace Spaces + ClustrixΒΆ
Easy Deployment: Simple git-based deployment workflow
Community Sharing: Built-in discoverability and collaboration
Flexible Hardware: From free CPU to high-end GPU instances
Hybrid Computing: Local execution with optional distributed scaling
ML Focus: Optimized for machine learning and AI applications
Next StepsΒΆ
Create your HuggingFace account and get an access token
Start with a simple Gradio app using the provided templates
Configure Clustrix integration using Spaces secrets
Test locally before deploying to ensure compatibility
Monitor performance and scale hardware as needed
Use CasesΒΆ
Research Demos: Showcase distributed computing research
Educational Tools: Interactive learning environments
Prototype Testing: Rapid prototyping with real user feedback
Model Serving: Production-ready ML model deployment
Collaborative Computing: Shared access to distributed resources
ResourcesΒΆ
Remember: HuggingFace Spaces provides an excellent platform for showcasing Clustrix capabilities and building interactive ML applications with distributed computing backends!