Filesystem Utilities Tutorialยถ
This notebook demonstrates how to use Clustrixโs unified filesystem utilities for seamless file operations across local and remote clusters.
Overviewยถ
Clustrix provides a set of filesystem utilities that work identically whether youโre operating on local files or files on remote clusters. This enables data-driven cluster computing workflows where your code can discover, analyze, and process files without worrying about whether theyโre local or remote.
[ ]:
# Import the filesystem utilities
from clustrix import (
cluster_ls,
cluster_find,
cluster_stat,
cluster_exists,
cluster_isdir,
cluster_isfile,
cluster_glob,
cluster_du,
cluster_count_files,
cluster
)
from clustrix.config import ClusterConfig
print("โ
Clustrix filesystem utilities imported successfully!")
Configurationยถ
First, letโs set up configurations for both local and remote operations:
[ ]:
# Local configuration - works with files on your local machine
local_config = ClusterConfig(
cluster_type="local",
local_work_dir="." # Current directory
)
# Remote cluster configuration - replace with your cluster details
remote_config = ClusterConfig(
cluster_type="slurm",
cluster_host="cluster.example.edu",
username="researcher",
remote_work_dir="/scratch/project"
)
# For this demo, we'll use local_config
config = local_config
print(f"Using config: {config.cluster_type}")
Basic Operationsยถ
Directory Listingยถ
List files and directories:
[ ]:
# List files in current directory
files = cluster_ls(".", config)
print(f"Found {len(files)} items in current directory:")
for file in files[:10]: # Show first 10
print(f" - {file}")
if len(files) > 10:
print(f" ... and {len(files) - 10} more")
File Discoveryยถ
Find files by pattern:
[ ]:
# Find Python files recursively
py_files = cluster_find("*.py", ".", config)
print(f"Found {len(py_files)} Python files:")
for file in py_files[:5]:
print(f" - {file}")
# Find Jupyter notebooks
notebooks = cluster_find("*.ipynb", ".", config)
print(f"\nFound {len(notebooks)} Jupyter notebooks:")
for file in notebooks[:3]:
print(f" - {file}")
File Informationยถ
Get detailed information about files:
[ ]:
# Check if README exists and get its info
readme_files = cluster_find("README*", ".", config)
if readme_files:
readme = readme_files[0]
print(f"Found README: {readme}")
# Get detailed file information
file_info = cluster_stat(readme, config)
print(f" Size: {file_info.size:,} bytes")
print(f" Modified: {file_info.modified_datetime}")
print(f" Is directory: {file_info.is_dir}")
print(f" Permissions: {file_info.permissions}")
else:
print("No README file found")
File Existence and Type Checkingยถ
[ ]:
# Check common files/directories
paths_to_check = ["setup.py", "requirements.txt", "docs", "tests", "src", "clustrix"]
for path in paths_to_check:
if cluster_exists(path, config):
if cluster_isdir(path, config):
print(f"๐ {path} (directory)")
elif cluster_isfile(path, config):
print(f"๐ {path} (file)")
else:
print(f"โ {path} (not found)")
Pattern Matching with Globยถ
[ ]:
# Use glob patterns for flexible file matching
patterns = {
"Python files": "*.py",
"Config files": "*.{yml,yaml,json,toml}",
"Documentation": "*.{md,rst,txt}",
"Test files": "test_*.py"
}
for name, pattern in patterns.items():
matches = cluster_glob(pattern, ".", config)
print(f"{name}: {len(matches)} files")
if matches:
print(f" Examples: {', '.join(matches[:3])}")
print()
Directory Usage Analysisยถ
[ ]:
# Analyze current directory usage
usage = cluster_du(".", config)
print(f"๐ Directory Usage Analysis:")
print(f" Total size: {usage.total_mb:.1f} MB ({usage.total_gb:.3f} GB)")
print(f" File count: {usage.file_count:,}")
if usage.file_count > 0:
avg_size = usage.total_mb / usage.file_count
print(f" Average file size: {avg_size:.2f} MB")
# Count files by type
total_files = cluster_count_files(".", "*", config)
python_files = cluster_count_files(".", "*.py", config)
notebook_files = cluster_count_files(".", "*.ipynb", config)
print(f"\n๐ File Counts:")
print(f" Total files: {total_files:,}")
print(f" Python files: {python_files:,}")
print(f" Notebooks: {notebook_files:,}")
Data-Driven Workflowsยถ
The real power comes when combining filesystem utilities with the @cluster decorator:
[ ]:
@cluster(cores=2) # Use 2 cores for this example
def analyze_python_files(config):
"""Analyze all Python files in the project."""
# Find all Python files
py_files = cluster_find("*.py", ".", config)
print(f"Found {len(py_files)} Python files to analyze")
results = {
'total_files': len(py_files),
'total_lines': 0,
'total_size': 0,
'large_files': [],
'file_details': []
}
# This loop will be automatically parallelized!
for py_file in py_files:
# Get file information
file_info = cluster_stat(py_file, config)
# Count lines (for local files)
if config.cluster_type == "local":
try:
with open(py_file, 'r', encoding='utf-8') as f:
lines = len(f.readlines())
except (UnicodeDecodeError, FileNotFoundError):
lines = 0
else:
lines = 0 # Would need remote file reading for clusters
results['total_lines'] += lines
results['total_size'] += file_info.size
# Track large files (> 10KB)
if file_info.size > 10000:
results['large_files'].append({
'file': py_file,
'size': file_info.size,
'lines': lines
})
results['file_details'].append({
'file': py_file,
'size': file_info.size,
'lines': lines,
'modified': file_info.modified_datetime.isoformat()
})
return results
# Run the analysis
print("๐ Analyzing Python files...")
analysis = analyze_python_files(config)
print(f"\n๐ Analysis Results:")
print(f" Total Python files: {analysis['total_files']}")
print(f" Total lines of code: {analysis['total_lines']:,}")
print(f" Total size: {analysis['total_size'] / 1024:.1f} KB")
print(f" Large files (>10KB): {len(analysis['large_files'])}")
if analysis['large_files']:
print("\n๐ Largest Python files:")
large_files = sorted(analysis['large_files'], key=lambda x: x['size'], reverse=True)
for file_info in large_files[:5]:
print(f" - {file_info['file']}: {file_info['size']:,} bytes, {file_info['lines']:,} lines")
Advanced Pattern: Conditional Processingยถ
Process files only if certain conditions are met:
[ ]:
@cluster(cores=1)
def smart_documentation_check(config):
"""Check documentation completeness and suggest improvements."""
results = {
'has_readme': False,
'has_contributing': False,
'has_license': False,
'docs_directory': False,
'notebook_count': 0,
'suggestions': []
}
# Check for essential documentation files
if cluster_exists("README.md", config) or cluster_exists("README.rst", config):
results['has_readme'] = True
else:
results['suggestions'].append("Add a README.md file")
if cluster_exists("CONTRIBUTING.md", config):
results['has_contributing'] = True
else:
results['suggestions'].append("Add a CONTRIBUTING.md file")
if cluster_exists("LICENSE", config) or cluster_exists("LICENSE.txt", config):
results['has_license'] = True
else:
results['suggestions'].append("Add a LICENSE file")
# Check for docs directory
if cluster_exists("docs", config) and cluster_isdir("docs", config):
results['docs_directory'] = True
# Count documentation files
doc_files = cluster_find("*.{rst,md}", "docs", config)
results['doc_file_count'] = len(doc_files)
else:
results['suggestions'].append("Create a docs/ directory with documentation")
# Count notebooks
notebooks = cluster_find("*.ipynb", ".", config)
results['notebook_count'] = len(notebooks)
if results['notebook_count'] == 0:
results['suggestions'].append("Consider adding tutorial notebooks")
return results
# Run documentation check
print("๐ Checking documentation...")
doc_check = smart_documentation_check(config)
print("\n๐ Documentation Status:")
print(f" โ
README: {'Yes' if doc_check['has_readme'] else 'No'}")
print(f" โ
Contributing guide: {'Yes' if doc_check['has_contributing'] else 'No'}")
print(f" โ
License: {'Yes' if doc_check['has_license'] else 'No'}")
print(f" โ
Docs directory: {'Yes' if doc_check['docs_directory'] else 'No'}")
print(f" ๐ Notebooks: {doc_check['notebook_count']}")
if doc_check['suggestions']:
print("\n๐ก Suggestions for improvement:")
for suggestion in doc_check['suggestions']:
print(f" - {suggestion}")
else:
print("\n๐ Documentation looks complete!")
Working with Different File Typesยถ
Demonstrate handling various file types:
[ ]:
def categorize_files(config):
"""Categorize all files in the project."""
categories = {
'Source Code': ['*.py', '*.js', '*.ts', '*.java', '*.cpp', '*.c', '*.h'],
'Documentation': ['*.md', '*.rst', '*.txt'],
'Configuration': ['*.yml', '*.yaml', '*.json', '*.toml', '*.ini', '*.cfg'],
'Data': ['*.csv', '*.json', '*.xml', '*.xlsx', '*.h5', '*.pkl'],
'Images': ['*.png', '*.jpg', '*.jpeg', '*.gif', '*.svg'],
'Notebooks': ['*.ipynb'],
'Web': ['*.html', '*.css', '*.js']
}
results = {}
for category, patterns in categories.items():
files = []
total_size = 0
for pattern in patterns:
found_files = cluster_find(pattern, ".", config)
files.extend(found_files)
# Get size information
for file in files:
try:
file_info = cluster_stat(file, config)
total_size += file_info.size
except:
pass # Skip files that can't be stat'd
# Remove duplicates
files = list(set(files))
results[category] = {
'count': len(files),
'size_mb': total_size / (1024 * 1024),
'files': files[:5] # Store first 5 as examples
}
return results
# Categorize files
print("๐๏ธ Categorizing files by type...")
file_categories = categorize_files(config)
print("\n๐ File Categories:")
total_files = 0
total_size = 0
for category, info in file_categories.items():
if info['count'] > 0:
total_files += info['count']
total_size += info['size_mb']
print(f" ๐ {category}: {info['count']} files ({info['size_mb']:.1f} MB)")
if info['files']:
examples = ', '.join(info['files'][:3])
print(f" Examples: {examples}")
print(f"\n๐ Summary: {total_files} categorized files, {total_size:.1f} MB total")
Performance Tipsยถ
Here are some tips for using filesystem utilities efficiently:
[ ]:
def demonstrate_performance_tips(config):
"""Show efficient vs inefficient patterns."""
print("โก Performance Tips for Filesystem Operations:\n")
# Tip 1: Use count to check before listing
print("1. Check file counts before expensive operations:")
py_count = cluster_count_files(".", "*.py", config)
print(f" Found {py_count} Python files - deciding processing strategy")
if py_count > 100:
print(" โ Large number of files, using targeted search")
# Use specific patterns instead of listing all
test_files = cluster_find("test_*.py", ".", config)
main_files = cluster_find("main*.py", ".", config)
else:
print(" โ Small number of files, safe to list all")
all_py_files = cluster_find("*.py", ".", config)
print()
# Tip 2: Use exists() before stat()
print("2. Check existence before getting file info:")
config_files = ["setup.py", "pyproject.toml", "requirements.txt"]
for config_file in config_files:
if cluster_exists(config_file, config): # Fast check first
file_info = cluster_stat(config_file, config) # Then get details
print(f" โ
{config_file}: {file_info.size:,} bytes")
else:
print(f" โ {config_file}: not found")
print()
# Tip 3: Use specific patterns instead of filtering
print("3. Use specific patterns for better performance:")
print(" Good: cluster_find('test_*.py', '.', config)")
print(" Better than: [f for f in cluster_ls('.', config) if f.startswith('test_')]")
# Demonstrate the difference
import time
# Method 1: Specific pattern (efficient)
start = time.time()
test_files_direct = cluster_find("test_*.py", ".", config)
time_direct = time.time() - start
# Method 2: List all then filter (less efficient)
start = time.time()
all_files = cluster_ls(".", config)
test_files_filtered = [f for f in all_files if f.startswith('test_') and f.endswith('.py')]
time_filtered = time.time() - start
print(f" Direct pattern: {len(test_files_direct)} files in {time_direct:.4f}s")
print(f" List + filter: {len(test_files_filtered)} files in {time_filtered:.4f}s")
speedup = time_filtered / time_direct if time_direct > 0 else 1
print(f" Speedup: {speedup:.1f}x faster")
# Run performance demo
demonstrate_performance_tips(config)
Summaryยถ
This tutorial covered:
Basic Operations:
cluster_ls,cluster_find,cluster_stat,cluster_existsPattern Matching:
cluster_glob,cluster_count_filesData-Driven Workflows: Using filesystem utilities with
@clusterAdvanced Patterns: Conditional processing, file categorization
Performance Tips: Efficient patterns for large-scale operations
Key Benefitsยถ
Unified API: Same code works locally and on remote clusters
Automatic Parallelization: When used with
@cluster, loop processing is parallelizedData Discovery: Enable workflows that adapt based on actual file contents
Cross-Platform: Consistent behavior across different operating systems
Next Stepsยถ
Try these operations with your own data
Configure a remote cluster and test the same code
Build data processing pipelines using
@clusterwith filesystem utilitiesExplore the API documentation for complete function references
Happy cluster computing! ๐