Filesystem UtilitiesΒΆ
Unified filesystem operations for local and remote clusters.
This module provides a consistent interface for filesystem operations that work both locally and on remote clusters based on the ClusterConfig object.
- class clustrix.filesystem.FileInfo(size, modified, is_dir, permissions, name='')[source]ΒΆ
Bases:
objectFile information structure.
- __init__(size, modified, is_dir, permissions, name='')[source]ΒΆ
Initialize FileInfo with file metadata.
- property is_fileΒΆ
Check if this is a file (not a directory).
- property modified_datetimeΒΆ
Get modified time as datetime object.
- class clustrix.filesystem.DiskUsage(total_bytes, file_count)[source]ΒΆ
Bases:
objectDisk usage information.
- class clustrix.filesystem.ClusterFilesystem(config)[source]ΒΆ
Bases:
objectUnified filesystem operations for local and remote clusters.
- clustrix.filesystem.cluster_ls(path='.', config=None)[source]ΒΆ
List directory contents locally or remotely based on config.
- clustrix.filesystem.cluster_find(pattern, path='.', config=None)[source]ΒΆ
Find files matching pattern locally or remotely based on config.
- clustrix.filesystem.cluster_stat(path, config=None)[source]ΒΆ
Get file information locally or remotely based on config.
- Return type:
- clustrix.filesystem.cluster_exists(path, config=None)[source]ΒΆ
Check if file/directory exists locally or remotely based on config.
- Return type:
- clustrix.filesystem.cluster_isdir(path, config=None)[source]ΒΆ
Check if path is directory locally or remotely based on config.
- Return type:
- clustrix.filesystem.cluster_isfile(path, config=None)[source]ΒΆ
Check if path is file locally or remotely based on config.
- Return type:
- clustrix.filesystem.cluster_glob(pattern, path='.', config=None)[source]ΒΆ
Pattern matching for files locally or remotely based on config.
- clustrix.filesystem.cluster_du(path='.', config=None)[source]ΒΆ
Get directory usage locally or remotely based on config.
- Return type:
- clustrix.filesystem.cluster_count_files(path='.', pattern='*', config=None)[source]ΒΆ
Count files matching pattern locally or remotely based on config.
- Return type:
OverviewΒΆ
The filesystem utilities module provides a unified interface for filesystem operations that work seamlessly across local and remote clusters. All operations use the same API regardless of whether youβre working locally or on a remote cluster.
Key FeaturesΒΆ
Unified API: Same function calls work locally and remotely
Automatic SSH Management: Transparent connection handling for remote operations
Path Normalization: Consistent path handling across platforms
Data Structures: Structured returns via FileInfo and DiskUsage classes
Config-Driven: Uses ClusterConfig to determine local vs remote execution
Core FunctionsΒΆ
Directory OperationsΒΆ
- clustrix.cluster_ls(path='.', config=None)[source]ΒΆ
List directory contents locally or remotely based on config.
- clustrix.cluster_find(pattern, path='.', config=None)[source]ΒΆ
Find files matching pattern locally or remotely based on config.
File OperationsΒΆ
- clustrix.cluster_stat(path, config=None)[source]ΒΆ
Get file information locally or remotely based on config.
- Return type:
- clustrix.cluster_exists(path, config=None)[source]ΒΆ
Check if file/directory exists locally or remotely based on config.
- Return type:
Storage OperationsΒΆ
Data ClassesΒΆ
- class clustrix.filesystem.FileInfo(size, modified, is_dir, permissions, name='')[source]ΒΆ
File information structure.
- __init__(size, modified, is_dir, permissions, name='')[source]ΒΆ
Initialize FileInfo with file metadata.
- property is_fileΒΆ
Check if this is a file (not a directory).
- property modified_datetimeΒΆ
Get modified time as datetime object.
Core ImplementationΒΆ
Usage ExamplesΒΆ
Basic OperationsΒΆ
from clustrix import cluster_ls, cluster_find, cluster_stat
from clustrix.config import ClusterConfig
# Configure for remote cluster
config = ClusterConfig(
cluster_type="slurm",
cluster_host="cluster.edu",
username="researcher",
remote_work_dir="/scratch/project"
)
# List directory contents
files = cluster_ls("data/", config)
# Find CSV files recursively
csv_files = cluster_find("*.csv", "datasets/", config)
# Get file information
file_info = cluster_stat("large_dataset.h5", config)
print(f"Size: {file_info.size:,} bytes")
Data-Driven WorkflowsΒΆ
from clustrix import cluster, cluster_glob, cluster_stat
@cluster(cores=8)
def process_datasets(config):
# Find all data files on the cluster
data_files = cluster_glob("*.csv", "input/", config)
results = []
for filename in data_files: # Loop gets parallelized automatically
# Check file size before processing
file_info = cluster_stat(filename, config)
if file_info.size > 100_000_000: # Large files
result = process_large_file(filename, config)
else:
result = process_small_file(filename, config)
results.append(result)
return results
Local vs Remote OperationsΒΆ
# Local configuration
local_config = ClusterConfig(cluster_type="local", local_work_dir="./data")
# Remote configuration
remote_config = ClusterConfig(
cluster_type="slurm",
cluster_host="cluster.edu",
username="researcher"
)
# Same function calls work for both
local_files = cluster_ls(".", local_config)
remote_files = cluster_ls(".", remote_config)
Pattern MatchingΒΆ
# Find all Python files
py_files = cluster_find("*.py", "src/", config)
# Use glob patterns
data_files = cluster_glob("data_*.{csv,json}", "input/", config)
# Count files by type
total_files = cluster_count_files(".", "*", config)
python_files = cluster_count_files(".", "*.py", config)
Directory Usage AnalysisΒΆ
# Get directory usage information
usage = cluster_du("/scratch/project", config)
print(f"Total size: {usage.total_gb:.2f} GB")
print(f"File count: {usage.file_count:,}")
print(f"Average file size: {usage.total_mb/usage.file_count:.1f} MB")
Error HandlingΒΆ
from clustrix.filesystem import FileNotFoundError
try:
file_info = cluster_stat("nonexistent.txt", config)
except FileNotFoundError:
print("File does not exist")
# Safe existence check
if cluster_exists("results/output.json", config):
file_info = cluster_stat("results/output.json", config)
Best PracticesΒΆ
Use config-driven execution: Pass ClusterConfig objects to enable local/remote switching
Check file existence: Use cluster_exists() before operations that assume file presence
Handle large directories carefully: Remote operations on large directories may be slow
Use appropriate patterns: Leverage cluster_find() and cluster_glob() for efficient file discovery
Cache results: Store file listings locally when processing many files
See AlsoΒΆ
Filesystem Utilities Tutorial - Comprehensive tutorial with examples
Configuration API - Configuration management
Decorator API - Using filesystem utilities with the @cluster decorator