Filesystem UtilitiesΒΆ

Unified filesystem operations for local and remote clusters.

This module provides a consistent interface for filesystem operations that work both locally and on remote clusters based on the ClusterConfig object.

class clustrix.filesystem.FileInfo(size, modified, is_dir, permissions, name='')[source]ΒΆ

Bases: object

File information structure.

__init__(size, modified, is_dir, permissions, name='')[source]ΒΆ

Initialize FileInfo with file metadata.

property is_fileΒΆ

Check if this is a file (not a directory).

property modified_datetimeΒΆ

Get modified time as datetime object.

__repr__()[source]ΒΆ

String representation of FileInfo.

__eq__(other)[source]ΒΆ

Check equality with another FileInfo object.

class clustrix.filesystem.DiskUsage(total_bytes, file_count)[source]ΒΆ

Bases: object

Disk usage information.

__init__(total_bytes, file_count)[source]ΒΆ

Initialize DiskUsage with usage statistics.

property total_mb: floatΒΆ

Total size in megabytes.

property total_gb: floatΒΆ

Total size in gigabytes.

__repr__()[source]ΒΆ

String representation of DiskUsage.

__eq__(other)[source]ΒΆ

Check equality with another DiskUsage object.

class clustrix.filesystem.ClusterFilesystem(config)[source]ΒΆ

Bases: object

Unified filesystem operations for local and remote clusters.

__init__(config)[source]ΒΆ

Initialize filesystem with cluster configuration.

__del__()[source]ΒΆ

Clean up SSH connections.

ls(path='.')[source]ΒΆ

List directory contents.

Return type:

List[str]

find(pattern, path='.')[source]ΒΆ

Find files matching pattern.

Return type:

List[str]

stat(path)[source]ΒΆ

Get file/directory information.

Return type:

FileInfo

exists(path)[source]ΒΆ

Check if file/directory exists.

Return type:

bool

isdir(path)[source]ΒΆ

Check if path is a directory.

Return type:

bool

isfile(path)[source]ΒΆ

Check if path is a file.

Return type:

bool

glob(pattern, path='.')[source]ΒΆ

Pattern matching for files.

Return type:

List[str]

du(path='.')[source]ΒΆ

Get directory usage information.

Return type:

DiskUsage

count_files(path='.', pattern='*')[source]ΒΆ

Count files in directory matching pattern.

Return type:

int

clustrix.filesystem.cluster_ls(path='.', config=None)[source]ΒΆ

List directory contents locally or remotely based on config.

Return type:

List[str]

clustrix.filesystem.cluster_find(pattern, path='.', config=None)[source]ΒΆ

Find files matching pattern locally or remotely based on config.

Return type:

List[str]

clustrix.filesystem.cluster_stat(path, config=None)[source]ΒΆ

Get file information locally or remotely based on config.

Return type:

FileInfo

clustrix.filesystem.cluster_exists(path, config=None)[source]ΒΆ

Check if file/directory exists locally or remotely based on config.

Return type:

bool

clustrix.filesystem.cluster_isdir(path, config=None)[source]ΒΆ

Check if path is directory locally or remotely based on config.

Return type:

bool

clustrix.filesystem.cluster_isfile(path, config=None)[source]ΒΆ

Check if path is file locally or remotely based on config.

Return type:

bool

clustrix.filesystem.cluster_glob(pattern, path='.', config=None)[source]ΒΆ

Pattern matching for files locally or remotely based on config.

Return type:

List[str]

clustrix.filesystem.cluster_du(path='.', config=None)[source]ΒΆ

Get directory usage locally or remotely based on config.

Return type:

DiskUsage

clustrix.filesystem.cluster_count_files(path='.', pattern='*', config=None)[source]ΒΆ

Count files matching pattern locally or remotely based on config.

Return type:

int

OverviewΒΆ

The filesystem utilities module provides a unified interface for filesystem operations that work seamlessly across local and remote clusters. All operations use the same API regardless of whether you’re working locally or on a remote cluster.

Key FeaturesΒΆ

  • Unified API: Same function calls work locally and remotely

  • Automatic SSH Management: Transparent connection handling for remote operations

  • Path Normalization: Consistent path handling across platforms

  • Data Structures: Structured returns via FileInfo and DiskUsage classes

  • Config-Driven: Uses ClusterConfig to determine local vs remote execution

Core FunctionsΒΆ

Directory OperationsΒΆ

clustrix.cluster_ls(path='.', config=None)[source]ΒΆ

List directory contents locally or remotely based on config.

Return type:

List[str]

clustrix.cluster_find(pattern, path='.', config=None)[source]ΒΆ

Find files matching pattern locally or remotely based on config.

Return type:

List[str]

clustrix.cluster_glob(pattern, path='.', config=None)[source]ΒΆ

Pattern matching for files locally or remotely based on config.

Return type:

List[str]

clustrix.cluster_count_files(path='.', pattern='*', config=None)[source]ΒΆ

Count files matching pattern locally or remotely based on config.

Return type:

int

File OperationsΒΆ

clustrix.cluster_stat(path, config=None)[source]ΒΆ

Get file information locally or remotely based on config.

Return type:

FileInfo

clustrix.cluster_exists(path, config=None)[source]ΒΆ

Check if file/directory exists locally or remotely based on config.

Return type:

bool

clustrix.cluster_isdir(path, config=None)[source]ΒΆ

Check if path is directory locally or remotely based on config.

Return type:

bool

clustrix.cluster_isfile(path, config=None)[source]ΒΆ

Check if path is file locally or remotely based on config.

Return type:

bool

Storage OperationsΒΆ

clustrix.cluster_du(path='.', config=None)[source]ΒΆ

Get directory usage locally or remotely based on config.

Return type:

DiskUsage

Data ClassesΒΆ

class clustrix.filesystem.FileInfo(size, modified, is_dir, permissions, name='')[source]ΒΆ

File information structure.

__init__(size, modified, is_dir, permissions, name='')[source]ΒΆ

Initialize FileInfo with file metadata.

property is_fileΒΆ

Check if this is a file (not a directory).

property modified_datetimeΒΆ

Get modified time as datetime object.

__repr__()[source]ΒΆ

String representation of FileInfo.

__eq__(other)[source]ΒΆ

Check equality with another FileInfo object.

class clustrix.filesystem.DiskUsage(total_bytes, file_count)[source]ΒΆ

Disk usage information.

__init__(total_bytes, file_count)[source]ΒΆ

Initialize DiskUsage with usage statistics.

property total_mb: floatΒΆ

Total size in megabytes.

property total_gb: floatΒΆ

Total size in gigabytes.

__repr__()[source]ΒΆ

String representation of DiskUsage.

__eq__(other)[source]ΒΆ

Check equality with another DiskUsage object.

Core ImplementationΒΆ

class clustrix.filesystem.ClusterFilesystem(config)[source]ΒΆ

Bases: object

Unified filesystem operations for local and remote clusters.

__init__(config)[source]ΒΆ

Initialize filesystem with cluster configuration.

__del__()[source]ΒΆ

Clean up SSH connections.

ls(path='.')[source]ΒΆ

List directory contents.

Return type:

List[str]

find(pattern, path='.')[source]ΒΆ

Find files matching pattern.

Return type:

List[str]

stat(path)[source]ΒΆ

Get file/directory information.

Return type:

FileInfo

exists(path)[source]ΒΆ

Check if file/directory exists.

Return type:

bool

isdir(path)[source]ΒΆ

Check if path is a directory.

Return type:

bool

isfile(path)[source]ΒΆ

Check if path is a file.

Return type:

bool

glob(pattern, path='.')[source]ΒΆ

Pattern matching for files.

Return type:

List[str]

du(path='.')[source]ΒΆ

Get directory usage information.

Return type:

DiskUsage

count_files(path='.', pattern='*')[source]ΒΆ

Count files in directory matching pattern.

Return type:

int

Usage ExamplesΒΆ

Basic OperationsΒΆ

from clustrix import cluster_ls, cluster_find, cluster_stat
from clustrix.config import ClusterConfig

# Configure for remote cluster
config = ClusterConfig(
    cluster_type="slurm",
    cluster_host="cluster.edu",
    username="researcher",
    remote_work_dir="/scratch/project"
)

# List directory contents
files = cluster_ls("data/", config)

# Find CSV files recursively
csv_files = cluster_find("*.csv", "datasets/", config)

# Get file information
file_info = cluster_stat("large_dataset.h5", config)
print(f"Size: {file_info.size:,} bytes")

Data-Driven WorkflowsΒΆ

from clustrix import cluster, cluster_glob, cluster_stat

@cluster(cores=8)
def process_datasets(config):
    # Find all data files on the cluster
    data_files = cluster_glob("*.csv", "input/", config)

    results = []
    for filename in data_files:  # Loop gets parallelized automatically
        # Check file size before processing
        file_info = cluster_stat(filename, config)
        if file_info.size > 100_000_000:  # Large files
            result = process_large_file(filename, config)
        else:
            result = process_small_file(filename, config)
        results.append(result)

    return results

Local vs Remote OperationsΒΆ

# Local configuration
local_config = ClusterConfig(cluster_type="local", local_work_dir="./data")

# Remote configuration
remote_config = ClusterConfig(
    cluster_type="slurm",
    cluster_host="cluster.edu",
    username="researcher"
)

# Same function calls work for both
local_files = cluster_ls(".", local_config)
remote_files = cluster_ls(".", remote_config)

Pattern MatchingΒΆ

# Find all Python files
py_files = cluster_find("*.py", "src/", config)

# Use glob patterns
data_files = cluster_glob("data_*.{csv,json}", "input/", config)

# Count files by type
total_files = cluster_count_files(".", "*", config)
python_files = cluster_count_files(".", "*.py", config)

Directory Usage AnalysisΒΆ

# Get directory usage information
usage = cluster_du("/scratch/project", config)
print(f"Total size: {usage.total_gb:.2f} GB")
print(f"File count: {usage.file_count:,}")
print(f"Average file size: {usage.total_mb/usage.file_count:.1f} MB")

Error HandlingΒΆ

from clustrix.filesystem import FileNotFoundError

try:
    file_info = cluster_stat("nonexistent.txt", config)
except FileNotFoundError:
    print("File does not exist")

# Safe existence check
if cluster_exists("results/output.json", config):
    file_info = cluster_stat("results/output.json", config)

Best PracticesΒΆ

  1. Use config-driven execution: Pass ClusterConfig objects to enable local/remote switching

  2. Check file existence: Use cluster_exists() before operations that assume file presence

  3. Handle large directories carefully: Remote operations on large directories may be slow

  4. Use appropriate patterns: Leverage cluster_find() and cluster_glob() for efficient file discovery

  5. Cache results: Store file listings locally when processing many files

See AlsoΒΆ