{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Filesystem Utilities Tutorial\n", "\n", "This notebook demonstrates how to use Clustrix's unified filesystem utilities for seamless file operations across local and remote clusters.\n", "\n", "## Overview\n", "\n", "Clustrix provides a set of filesystem utilities that work identically whether you're operating on local files or files on remote clusters. This enables data-driven cluster computing workflows where your code can discover, analyze, and process files without worrying about whether they're local or remote." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the filesystem utilities\n", "from clustrix import (\n", " cluster_ls,\n", " cluster_find,\n", " cluster_stat,\n", " cluster_exists,\n", " cluster_isdir,\n", " cluster_isfile,\n", " cluster_glob,\n", " cluster_du,\n", " cluster_count_files,\n", " cluster\n", ")\n", "from clustrix.config import ClusterConfig\n", "\n", "print(\"āœ… Clustrix filesystem utilities imported successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuration\n", "\n", "First, let's set up configurations for both local and remote operations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Local configuration - works with files on your local machine\n", "local_config = ClusterConfig(\n", " cluster_type=\"local\",\n", " local_work_dir=\".\" # Current directory\n", ")\n", "\n", "# Remote cluster configuration - replace with your cluster details\n", "remote_config = ClusterConfig(\n", " cluster_type=\"slurm\",\n", " cluster_host=\"cluster.example.edu\",\n", " username=\"researcher\",\n", " remote_work_dir=\"/scratch/project\"\n", ")\n", "\n", "# For this demo, we'll use local_config\n", "config = local_config\n", "print(f\"Using config: {config.cluster_type}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Operations\n", "\n", "### Directory Listing\n", "\n", "List files and directories:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# List files in current directory\n", "files = cluster_ls(\".\", config)\n", "print(f\"Found {len(files)} items in current directory:\")\n", "for file in files[:10]: # Show first 10\n", " print(f\" - {file}\")\n", "if len(files) > 10:\n", " print(f\" ... and {len(files) - 10} more\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File Discovery\n", "\n", "Find files by pattern:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Find Python files recursively\n", "py_files = cluster_find(\"*.py\", \".\", config)\n", "print(f\"Found {len(py_files)} Python files:\")\n", "for file in py_files[:5]:\n", " print(f\" - {file}\")\n", "\n", "# Find Jupyter notebooks\n", "notebooks = cluster_find(\"*.ipynb\", \".\", config)\n", "print(f\"\\nFound {len(notebooks)} Jupyter notebooks:\")\n", "for file in notebooks[:3]:\n", " print(f\" - {file}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File Information\n", "\n", "Get detailed information about files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check if README exists and get its info\n", "readme_files = cluster_find(\"README*\", \".\", config)\n", "if readme_files:\n", " readme = readme_files[0]\n", " print(f\"Found README: {readme}\")\n", " \n", " # Get detailed file information\n", " file_info = cluster_stat(readme, config)\n", " print(f\" Size: {file_info.size:,} bytes\")\n", " print(f\" Modified: {file_info.modified_datetime}\")\n", " print(f\" Is directory: {file_info.is_dir}\")\n", " print(f\" Permissions: {file_info.permissions}\")\n", "else:\n", " print(\"No README file found\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File Existence and Type Checking" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check common files/directories\n", "paths_to_check = [\"setup.py\", \"requirements.txt\", \"docs\", \"tests\", \"src\", \"clustrix\"]\n", "\n", "for path in paths_to_check:\n", " if cluster_exists(path, config):\n", " if cluster_isdir(path, config):\n", " print(f\"šŸ“ {path} (directory)\")\n", " elif cluster_isfile(path, config):\n", " print(f\"šŸ“„ {path} (file)\")\n", " else:\n", " print(f\"āŒ {path} (not found)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pattern Matching with Glob" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use glob patterns for flexible file matching\n", "patterns = {\n", " \"Python files\": \"*.py\",\n", " \"Config files\": \"*.{yml,yaml,json,toml}\",\n", " \"Documentation\": \"*.{md,rst,txt}\",\n", " \"Test files\": \"test_*.py\"\n", "}\n", "\n", "for name, pattern in patterns.items():\n", " matches = cluster_glob(pattern, \".\", config)\n", " print(f\"{name}: {len(matches)} files\")\n", " if matches:\n", " print(f\" Examples: {', '.join(matches[:3])}\")\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Directory Usage Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Analyze current directory usage\n", "usage = cluster_du(\".\", config)\n", "print(f\"šŸ“Š Directory Usage Analysis:\")\n", "print(f\" Total size: {usage.total_mb:.1f} MB ({usage.total_gb:.3f} GB)\")\n", "print(f\" File count: {usage.file_count:,}\")\n", "if usage.file_count > 0:\n", " avg_size = usage.total_mb / usage.file_count\n", " print(f\" Average file size: {avg_size:.2f} MB\")\n", "\n", "# Count files by type\n", "total_files = cluster_count_files(\".\", \"*\", config)\n", "python_files = cluster_count_files(\".\", \"*.py\", config)\n", "notebook_files = cluster_count_files(\".\", \"*.ipynb\", config)\n", "\n", "print(f\"\\nšŸ“ˆ File Counts:\")\n", "print(f\" Total files: {total_files:,}\")\n", "print(f\" Python files: {python_files:,}\")\n", "print(f\" Notebooks: {notebook_files:,}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data-Driven Workflows\n", "\n", "The real power comes when combining filesystem utilities with the `@cluster` decorator:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@cluster(cores=2) # Use 2 cores for this example\n", "def analyze_python_files(config):\n", " \"\"\"Analyze all Python files in the project.\"\"\"\n", " \n", " # Find all Python files\n", " py_files = cluster_find(\"*.py\", \".\", config)\n", " print(f\"Found {len(py_files)} Python files to analyze\")\n", " \n", " results = {\n", " 'total_files': len(py_files),\n", " 'total_lines': 0,\n", " 'total_size': 0,\n", " 'large_files': [],\n", " 'file_details': []\n", " }\n", " \n", " # This loop will be automatically parallelized!\n", " for py_file in py_files:\n", " # Get file information\n", " file_info = cluster_stat(py_file, config)\n", " \n", " # Count lines (for local files)\n", " if config.cluster_type == \"local\":\n", " try:\n", " with open(py_file, 'r', encoding='utf-8') as f:\n", " lines = len(f.readlines())\n", " except (UnicodeDecodeError, FileNotFoundError):\n", " lines = 0\n", " else:\n", " lines = 0 # Would need remote file reading for clusters\n", " \n", " results['total_lines'] += lines\n", " results['total_size'] += file_info.size\n", " \n", " # Track large files (> 10KB)\n", " if file_info.size > 10000:\n", " results['large_files'].append({\n", " 'file': py_file,\n", " 'size': file_info.size,\n", " 'lines': lines\n", " })\n", " \n", " results['file_details'].append({\n", " 'file': py_file,\n", " 'size': file_info.size,\n", " 'lines': lines,\n", " 'modified': file_info.modified_datetime.isoformat()\n", " })\n", " \n", " return results\n", "\n", "# Run the analysis\n", "print(\"šŸ” Analyzing Python files...\")\n", "analysis = analyze_python_files(config)\n", "\n", "print(f\"\\nšŸ“ˆ Analysis Results:\")\n", "print(f\" Total Python files: {analysis['total_files']}\")\n", "print(f\" Total lines of code: {analysis['total_lines']:,}\")\n", "print(f\" Total size: {analysis['total_size'] / 1024:.1f} KB\")\n", "print(f\" Large files (>10KB): {len(analysis['large_files'])}\")\n", "\n", "if analysis['large_files']:\n", " print(\"\\nšŸ“„ Largest Python files:\")\n", " large_files = sorted(analysis['large_files'], key=lambda x: x['size'], reverse=True)\n", " for file_info in large_files[:5]:\n", " print(f\" - {file_info['file']}: {file_info['size']:,} bytes, {file_info['lines']:,} lines\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Pattern: Conditional Processing\n", "\n", "Process files only if certain conditions are met:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@cluster(cores=1)\n", "def smart_documentation_check(config):\n", " \"\"\"Check documentation completeness and suggest improvements.\"\"\"\n", " \n", " results = {\n", " 'has_readme': False,\n", " 'has_contributing': False,\n", " 'has_license': False,\n", " 'docs_directory': False,\n", " 'notebook_count': 0,\n", " 'suggestions': []\n", " }\n", " \n", " # Check for essential documentation files\n", " if cluster_exists(\"README.md\", config) or cluster_exists(\"README.rst\", config):\n", " results['has_readme'] = True\n", " else:\n", " results['suggestions'].append(\"Add a README.md file\")\n", " \n", " if cluster_exists(\"CONTRIBUTING.md\", config):\n", " results['has_contributing'] = True\n", " else:\n", " results['suggestions'].append(\"Add a CONTRIBUTING.md file\")\n", " \n", " if cluster_exists(\"LICENSE\", config) or cluster_exists(\"LICENSE.txt\", config):\n", " results['has_license'] = True\n", " else:\n", " results['suggestions'].append(\"Add a LICENSE file\")\n", " \n", " # Check for docs directory\n", " if cluster_exists(\"docs\", config) and cluster_isdir(\"docs\", config):\n", " results['docs_directory'] = True\n", " \n", " # Count documentation files\n", " doc_files = cluster_find(\"*.{rst,md}\", \"docs\", config)\n", " results['doc_file_count'] = len(doc_files)\n", " else:\n", " results['suggestions'].append(\"Create a docs/ directory with documentation\")\n", " \n", " # Count notebooks\n", " notebooks = cluster_find(\"*.ipynb\", \".\", config)\n", " results['notebook_count'] = len(notebooks)\n", " \n", " if results['notebook_count'] == 0:\n", " results['suggestions'].append(\"Consider adding tutorial notebooks\")\n", " \n", " return results\n", "\n", "# Run documentation check\n", "print(\"šŸ“š Checking documentation...\")\n", "doc_check = smart_documentation_check(config)\n", "\n", "print(\"\\nšŸ“‹ Documentation Status:\")\n", "print(f\" āœ… README: {'Yes' if doc_check['has_readme'] else 'No'}\")\n", "print(f\" āœ… Contributing guide: {'Yes' if doc_check['has_contributing'] else 'No'}\")\n", "print(f\" āœ… License: {'Yes' if doc_check['has_license'] else 'No'}\")\n", "print(f\" āœ… Docs directory: {'Yes' if doc_check['docs_directory'] else 'No'}\")\n", "print(f\" šŸ““ Notebooks: {doc_check['notebook_count']}\")\n", "\n", "if doc_check['suggestions']:\n", " print(\"\\nšŸ’” Suggestions for improvement:\")\n", " for suggestion in doc_check['suggestions']:\n", " print(f\" - {suggestion}\")\n", "else:\n", " print(\"\\nšŸŽ‰ Documentation looks complete!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with Different File Types\n", "\n", "Demonstrate handling various file types:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def categorize_files(config):\n", " \"\"\"Categorize all files in the project.\"\"\"\n", " \n", " categories = {\n", " 'Source Code': ['*.py', '*.js', '*.ts', '*.java', '*.cpp', '*.c', '*.h'],\n", " 'Documentation': ['*.md', '*.rst', '*.txt'],\n", " 'Configuration': ['*.yml', '*.yaml', '*.json', '*.toml', '*.ini', '*.cfg'],\n", " 'Data': ['*.csv', '*.json', '*.xml', '*.xlsx', '*.h5', '*.pkl'],\n", " 'Images': ['*.png', '*.jpg', '*.jpeg', '*.gif', '*.svg'],\n", " 'Notebooks': ['*.ipynb'],\n", " 'Web': ['*.html', '*.css', '*.js']\n", " }\n", " \n", " results = {}\n", " \n", " for category, patterns in categories.items():\n", " files = []\n", " total_size = 0\n", " \n", " for pattern in patterns:\n", " found_files = cluster_find(pattern, \".\", config)\n", " files.extend(found_files)\n", " \n", " # Get size information\n", " for file in files:\n", " try:\n", " file_info = cluster_stat(file, config)\n", " total_size += file_info.size\n", " except:\n", " pass # Skip files that can't be stat'd\n", " \n", " # Remove duplicates\n", " files = list(set(files))\n", " \n", " results[category] = {\n", " 'count': len(files),\n", " 'size_mb': total_size / (1024 * 1024),\n", " 'files': files[:5] # Store first 5 as examples\n", " }\n", " \n", " return results\n", "\n", "# Categorize files\n", "print(\"šŸ—‚ļø Categorizing files by type...\")\n", "file_categories = categorize_files(config)\n", "\n", "print(\"\\nšŸ“Š File Categories:\")\n", "total_files = 0\n", "total_size = 0\n", "\n", "for category, info in file_categories.items():\n", " if info['count'] > 0:\n", " total_files += info['count']\n", " total_size += info['size_mb']\n", " print(f\" šŸ“ {category}: {info['count']} files ({info['size_mb']:.1f} MB)\")\n", " if info['files']:\n", " examples = ', '.join(info['files'][:3])\n", " print(f\" Examples: {examples}\")\n", "\n", "print(f\"\\nšŸ“ˆ Summary: {total_files} categorized files, {total_size:.1f} MB total\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance Tips\n", "\n", "Here are some tips for using filesystem utilities efficiently:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def demonstrate_performance_tips(config):\n", " \"\"\"Show efficient vs inefficient patterns.\"\"\"\n", " \n", " print(\"⚔ Performance Tips for Filesystem Operations:\\n\")\n", " \n", " # Tip 1: Use count to check before listing\n", " print(\"1. Check file counts before expensive operations:\")\n", " py_count = cluster_count_files(\".\", \"*.py\", config)\n", " print(f\" Found {py_count} Python files - deciding processing strategy\")\n", " \n", " if py_count > 100:\n", " print(\" → Large number of files, using targeted search\")\n", " # Use specific patterns instead of listing all\n", " test_files = cluster_find(\"test_*.py\", \".\", config)\n", " main_files = cluster_find(\"main*.py\", \".\", config)\n", " else:\n", " print(\" → Small number of files, safe to list all\")\n", " all_py_files = cluster_find(\"*.py\", \".\", config)\n", " \n", " print()\n", " \n", " # Tip 2: Use exists() before stat()\n", " print(\"2. Check existence before getting file info:\")\n", " config_files = [\"setup.py\", \"pyproject.toml\", \"requirements.txt\"]\n", " \n", " for config_file in config_files:\n", " if cluster_exists(config_file, config): # Fast check first\n", " file_info = cluster_stat(config_file, config) # Then get details\n", " print(f\" āœ… {config_file}: {file_info.size:,} bytes\")\n", " else:\n", " print(f\" āŒ {config_file}: not found\")\n", " \n", " print()\n", " \n", " # Tip 3: Use specific patterns instead of filtering\n", " print(\"3. Use specific patterns for better performance:\")\n", " print(\" Good: cluster_find('test_*.py', '.', config)\")\n", " print(\" Better than: [f for f in cluster_ls('.', config) if f.startswith('test_')]\")\n", " \n", " # Demonstrate the difference\n", " import time\n", " \n", " # Method 1: Specific pattern (efficient)\n", " start = time.time()\n", " test_files_direct = cluster_find(\"test_*.py\", \".\", config)\n", " time_direct = time.time() - start\n", " \n", " # Method 2: List all then filter (less efficient)\n", " start = time.time()\n", " all_files = cluster_ls(\".\", config)\n", " test_files_filtered = [f for f in all_files if f.startswith('test_') and f.endswith('.py')]\n", " time_filtered = time.time() - start\n", " \n", " print(f\" Direct pattern: {len(test_files_direct)} files in {time_direct:.4f}s\")\n", " print(f\" List + filter: {len(test_files_filtered)} files in {time_filtered:.4f}s\")\n", " \n", " speedup = time_filtered / time_direct if time_direct > 0 else 1\n", " print(f\" Speedup: {speedup:.1f}x faster\")\n", "\n", "# Run performance demo\n", "demonstrate_performance_tips(config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This tutorial covered:\n", "\n", "1. **Basic Operations**: `cluster_ls`, `cluster_find`, `cluster_stat`, `cluster_exists`\n", "2. **Pattern Matching**: `cluster_glob`, `cluster_count_files`\n", "3. **Data-Driven Workflows**: Using filesystem utilities with `@cluster`\n", "4. **Advanced Patterns**: Conditional processing, file categorization\n", "5. **Performance Tips**: Efficient patterns for large-scale operations\n", "\n", "### Key Benefits\n", "\n", "- **Unified API**: Same code works locally and on remote clusters\n", "- **Automatic Parallelization**: When used with `@cluster`, loop processing is parallelized\n", "- **Data Discovery**: Enable workflows that adapt based on actual file contents\n", "- **Cross-Platform**: Consistent behavior across different operating systems\n", "\n", "### Next Steps\n", "\n", "1. Try these operations with your own data\n", "2. Configure a remote cluster and test the same code\n", "3. Build data processing pipelines using `@cluster` with filesystem utilities\n", "4. Explore the [API documentation](../api/filesystem.rst) for complete function references\n", "\n", "Happy cluster computing! šŸš€" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }