Filesystem Utilities¶

Unified filesystem operations for local and remote clusters.

This module provides a consistent interface for filesystem operations that work both locally and on remote clusters based on the ClusterConfig object.

class clustrix.filesystem.FileInfo(size, modified, is_dir, permissions, name='')[source]¶

Bases: object

File information structure.

__init__(size, modified, is_dir, permissions, name='')[source]¶: Initialize FileInfo with file metadata.

property is_file¶: Check if this is a file (not a directory).

property modified_datetime¶: Get modified time as datetime object.

__repr__()[source]¶: String representation of FileInfo.

__eq__(other)[source]¶: Check equality with another FileInfo object.

class clustrix.filesystem.DiskUsage(total_bytes, file_count)[source]¶

Bases: object

Disk usage information.

__init__(total_bytes, file_count)[source]¶: Initialize DiskUsage with usage statistics.

property total_mb: float¶: Total size in megabytes.

property total_gb: float¶: Total size in gigabytes.

__repr__()[source]¶: String representation of DiskUsage.

__eq__(other)[source]¶: Check equality with another DiskUsage object.

class clustrix.filesystem.ClusterFilesystem(config)[source]¶

Bases: object

Unified filesystem operations for local and remote clusters.

__init__(config)[source]¶: Initialize filesystem with cluster configuration.

__del__()[source]¶: Clean up SSH connections.

ls(path='.')[source]¶

List directory contents.

Return type:: List[str]

find(pattern, path='.')[source]¶

Find files matching pattern.

Return type:: List[str]

stat(path)[source]¶

Get file/directory information.

Return type:: FileInfo

exists(path)[source]¶

Check if file/directory exists.

Return type:: bool

isdir(path)[source]¶

Check if path is a directory.

Return type:: bool

isfile(path)[source]¶

Check if path is a file.

Return type:: bool

glob(pattern, path='.')[source]¶

Pattern matching for files.

Return type:: List[str]

du(path='.')[source]¶

Get directory usage information.

Return type:: DiskUsage

count_files(path='.', pattern='*')[source]¶

Count files in directory matching pattern.

Return type:: int

clustrix.filesystem.cluster_ls(path='.', config=None)[source]¶

List directory contents locally or remotely based on config.

Return type:: List[str]

clustrix.filesystem.cluster_find(pattern, path='.', config=None)[source]¶

Find files matching pattern locally or remotely based on config.

Return type:: List[str]

clustrix.filesystem.cluster_stat(path, config=None)[source]¶

Get file information locally or remotely based on config.

Return type:: FileInfo

clustrix.filesystem.cluster_exists(path, config=None)[source]¶

Check if file/directory exists locally or remotely based on config.

Return type:: bool

clustrix.filesystem.cluster_isdir(path, config=None)[source]¶

Check if path is directory locally or remotely based on config.

Return type:: bool

clustrix.filesystem.cluster_isfile(path, config=None)[source]¶

Check if path is file locally or remotely based on config.

Return type:: bool

clustrix.filesystem.cluster_glob(pattern, path='.', config=None)[source]¶

Pattern matching for files locally or remotely based on config.

Return type:: List[str]

clustrix.filesystem.cluster_du(path='.', config=None)[source]¶

Get directory usage locally or remotely based on config.

Return type:: DiskUsage

clustrix.filesystem.cluster_count_files(path='.', pattern='*', config=None)[source]¶

Count files matching pattern locally or remotely based on config.

Return type:: int

Overview¶

The filesystem utilities module provides a unified interface for filesystem operations that work seamlessly across local and remote clusters. All operations use the same API regardless of whether you’re working locally or on a remote cluster.

Key Features¶

Unified API: Same function calls work locally and remotely
Automatic SSH Management: Transparent connection handling for remote operations
Path Normalization: Consistent path handling across platforms
Data Structures: Structured returns via FileInfo and DiskUsage classes
Config-Driven: Uses ClusterConfig to determine local vs remote execution

Core Functions¶

Directory Operations¶

clustrix.cluster_ls(path='.', config=None)[source]¶

List directory contents locally or remotely based on config.

Return type:: List[str]

clustrix.cluster_find(pattern, path='.', config=None)[source]¶

Find files matching pattern locally or remotely based on config.

Return type:: List[str]

clustrix.cluster_glob(pattern, path='.', config=None)[source]¶

Pattern matching for files locally or remotely based on config.

Return type:: List[str]

clustrix.cluster_count_files(path='.', pattern='*', config=None)[source]¶

Count files matching pattern locally or remotely based on config.

Return type:: int

File Operations¶

clustrix.cluster_stat(path, config=None)[source]¶

Get file information locally or remotely based on config.

Return type:: FileInfo

clustrix.cluster_exists(path, config=None)[source]¶

Check if file/directory exists locally or remotely based on config.

Return type:: bool

clustrix.cluster_isdir(path, config=None)[source]¶

Check if path is directory locally or remotely based on config.

Return type:: bool

clustrix.cluster_isfile(path, config=None)[source]¶

Check if path is file locally or remotely based on config.

Return type:: bool

Storage Operations¶

clustrix.cluster_du(path='.', config=None)[source]¶

Get directory usage locally or remotely based on config.

Return type:: DiskUsage

Data Classes¶

class clustrix.filesystem.FileInfo(size, modified, is_dir, permissions, name='')[source]¶

File information structure.

__init__(size, modified, is_dir, permissions, name='')[source]¶: Initialize FileInfo with file metadata.

property is_file¶: Check if this is a file (not a directory).

property modified_datetime¶: Get modified time as datetime object.

__repr__()[source]¶: String representation of FileInfo.

__eq__(other)[source]¶: Check equality with another FileInfo object.

class clustrix.filesystem.DiskUsage(total_bytes, file_count)[source]¶

Disk usage information.

__init__(total_bytes, file_count)[source]¶: Initialize DiskUsage with usage statistics.

property total_mb: float¶: Total size in megabytes.

property total_gb: float¶: Total size in gigabytes.

__repr__()[source]¶: String representation of DiskUsage.

__eq__(other)[source]¶: Check equality with another DiskUsage object.

Core Implementation¶

class clustrix.filesystem.ClusterFilesystem(config)[source]¶

Bases: object

Unified filesystem operations for local and remote clusters.

__init__(config)[source]¶: Initialize filesystem with cluster configuration.

__del__()[source]¶: Clean up SSH connections.

ls(path='.')[source]¶

List directory contents.

Return type:: List[str]

find(pattern, path='.')[source]¶

Find files matching pattern.

Return type:: List[str]

stat(path)[source]¶

Get file/directory information.

Return type:: FileInfo

exists(path)[source]¶

Check if file/directory exists.

Return type:: bool

isdir(path)[source]¶

Check if path is a directory.

Return type:: bool

isfile(path)[source]¶

Check if path is a file.

Return type:: bool

glob(pattern, path='.')[source]¶

Pattern matching for files.

Return type:: List[str]

du(path='.')[source]¶

Get directory usage information.

Return type:: DiskUsage

count_files(path='.', pattern='*')[source]¶

Count files in directory matching pattern.

Return type:: int

Usage Examples¶

Basic Operations¶

from clustrix import cluster_ls, cluster_find, cluster_stat
from clustrix.config import ClusterConfig

# Configure for remote cluster
config = ClusterConfig(
    cluster_type="slurm",
    cluster_host="cluster.edu",
    username="researcher",
    remote_work_dir="/scratch/project"
)

# List directory contents
files = cluster_ls("data/", config)

# Find CSV files recursively
csv_files = cluster_find("*.csv", "datasets/", config)

# Get file information
file_info = cluster_stat("large_dataset.h5", config)
print(f"Size: {file_info.size:,} bytes")

Data-Driven Workflows¶

from clustrix import cluster, cluster_glob, cluster_stat

@cluster(cores=8)
def process_datasets(config):
    # Find all data files on the cluster
    data_files = cluster_glob("*.csv", "input/", config)

    results = []
    for filename in data_files:  # Loop gets parallelized automatically
        # Check file size before processing
        file_info = cluster_stat(filename, config)
        if file_info.size > 100_000_000:  # Large files
            result = process_large_file(filename, config)
        else:
            result = process_small_file(filename, config)
        results.append(result)

    return results

Local vs Remote Operations¶

# Local configuration
local_config = ClusterConfig(cluster_type="local", local_work_dir="./data")

# Remote configuration
remote_config = ClusterConfig(
    cluster_type="slurm",
    cluster_host="cluster.edu",
    username="researcher"
)

# Same function calls work for both
local_files = cluster_ls(".", local_config)
remote_files = cluster_ls(".", remote_config)

Pattern Matching¶

# Find all Python files
py_files = cluster_find("*.py", "src/", config)

# Use glob patterns
data_files = cluster_glob("data_*.{csv,json}", "input/", config)

# Count files by type
total_files = cluster_count_files(".", "*", config)
python_files = cluster_count_files(".", "*.py", config)

Directory Usage Analysis¶

# Get directory usage information
usage = cluster_du("/scratch/project", config)
print(f"Total size: {usage.total_gb:.2f} GB")
print(f"File count: {usage.file_count:,}")
print(f"Average file size: {usage.total_mb/usage.file_count:.1f} MB")

Error Handling¶

from clustrix.filesystem import FileNotFoundError

try:
    file_info = cluster_stat("nonexistent.txt", config)
except FileNotFoundError:
    print("File does not exist")

# Safe existence check
if cluster_exists("results/output.json", config):
    file_info = cluster_stat("results/output.json", config)

Best Practices¶

Use config-driven execution: Pass ClusterConfig objects to enable local/remote switching
Check file existence: Use cluster_exists() before operations that assume file presence
Handle large directories carefully: Remote operations on large directories may be slow
Use appropriate patterns: Leverage cluster_find() and cluster_glob() for efficient file discovery
Cache results: Store file listings locally when processing many files

Filesystem Utilities¶

Overview¶

Key Features¶

Core Functions¶

Directory Operations¶

File Operations¶

Storage Operations¶

Data Classes¶

Core Implementation¶

Usage Examples¶

Basic Operations¶

Data-Driven Workflows¶

Local vs Remote Operations¶

Pattern Matching¶

Directory Usage Analysis¶

Error Handling¶

Best Practices¶

See Also¶