{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Filesystem Utilities Tutorial\n",
    "\n",
    "This notebook demonstrates how to use Clustrix's unified filesystem utilities for seamless file operations across local and remote clusters.\n",
    "\n",
    "## Overview\n",
    "\n",
    "Clustrix provides a set of filesystem utilities that work identically whether you're operating on local files or files on remote clusters. This enables data-driven cluster computing workflows where your code can discover, analyze, and process files without worrying about whether they're local or remote."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the filesystem utilities\n",
    "from clustrix import (\n",
    "    cluster_ls,\n",
    "    cluster_find,\n",
    "    cluster_stat,\n",
    "    cluster_exists,\n",
    "    cluster_isdir,\n",
    "    cluster_isfile,\n",
    "    cluster_glob,\n",
    "    cluster_du,\n",
    "    cluster_count_files,\n",
    "    cluster\n",
    ")\n",
    "from clustrix.config import ClusterConfig\n",
    "\n",
    "print(\"✅ Clustrix filesystem utilities imported successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configuration\n",
    "\n",
    "First, let's set up configurations for both local and remote operations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Local configuration - works with files on your local machine\n",
    "local_config = ClusterConfig(\n",
    "    cluster_type=\"local\",\n",
    "    local_work_dir=\".\"  # Current directory\n",
    ")\n",
    "\n",
    "# Remote cluster configuration - replace with your cluster details\n",
    "remote_config = ClusterConfig(\n",
    "    cluster_type=\"slurm\",\n",
    "    cluster_host=\"cluster.example.edu\",\n",
    "    username=\"researcher\",\n",
    "    remote_work_dir=\"/scratch/project\"\n",
    ")\n",
    "\n",
    "# For this demo, we'll use local_config\n",
    "config = local_config\n",
    "print(f\"Using config: {config.cluster_type}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Operations\n",
    "\n",
    "### Directory Listing\n",
    "\n",
    "List files and directories:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List files in current directory\n",
    "files = cluster_ls(\".\", config)\n",
    "print(f\"Found {len(files)} items in current directory:\")\n",
    "for file in files[:10]:  # Show first 10\n",
    "    print(f\"  - {file}\")\n",
    "if len(files) > 10:\n",
    "    print(f\"  ... and {len(files) - 10} more\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### File Discovery\n",
    "\n",
    "Find files by pattern:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find Python files recursively\n",
    "py_files = cluster_find(\"*.py\", \".\", config)\n",
    "print(f\"Found {len(py_files)} Python files:\")\n",
    "for file in py_files[:5]:\n",
    "    print(f\"  - {file}\")\n",
    "\n",
    "# Find Jupyter notebooks\n",
    "notebooks = cluster_find(\"*.ipynb\", \".\", config)\n",
    "print(f\"\\nFound {len(notebooks)} Jupyter notebooks:\")\n",
    "for file in notebooks[:3]:\n",
    "    print(f\"  - {file}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### File Information\n",
    "\n",
    "Get detailed information about files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check if README exists and get its info\n",
    "readme_files = cluster_find(\"README*\", \".\", config)\n",
    "if readme_files:\n",
    "    readme = readme_files[0]\n",
    "    print(f\"Found README: {readme}\")\n",
    "    \n",
    "    # Get detailed file information\n",
    "    file_info = cluster_stat(readme, config)\n",
    "    print(f\"  Size: {file_info.size:,} bytes\")\n",
    "    print(f\"  Modified: {file_info.modified_datetime}\")\n",
    "    print(f\"  Is directory: {file_info.is_dir}\")\n",
    "    print(f\"  Permissions: {file_info.permissions}\")\n",
    "else:\n",
    "    print(\"No README file found\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### File Existence and Type Checking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check common files/directories\n",
    "paths_to_check = [\"setup.py\", \"requirements.txt\", \"docs\", \"tests\", \"src\", \"clustrix\"]\n",
    "\n",
    "for path in paths_to_check:\n",
    "    if cluster_exists(path, config):\n",
    "        if cluster_isdir(path, config):\n",
    "            print(f\"📁 {path} (directory)\")\n",
    "        elif cluster_isfile(path, config):\n",
    "            print(f\"📄 {path} (file)\")\n",
    "    else:\n",
    "        print(f\"❌ {path} (not found)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pattern Matching with Glob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use glob patterns for flexible file matching\n",
    "patterns = {\n",
    "    \"Python files\": \"*.py\",\n",
    "    \"Config files\": \"*.{yml,yaml,json,toml}\",\n",
    "    \"Documentation\": \"*.{md,rst,txt}\",\n",
    "    \"Test files\": \"test_*.py\"\n",
    "}\n",
    "\n",
    "for name, pattern in patterns.items():\n",
    "    matches = cluster_glob(pattern, \".\", config)\n",
    "    print(f\"{name}: {len(matches)} files\")\n",
    "    if matches:\n",
    "        print(f\"  Examples: {', '.join(matches[:3])}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Directory Usage Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze current directory usage\n",
    "usage = cluster_du(\".\", config)\n",
    "print(f\"📊 Directory Usage Analysis:\")\n",
    "print(f\"  Total size: {usage.total_mb:.1f} MB ({usage.total_gb:.3f} GB)\")\n",
    "print(f\"  File count: {usage.file_count:,}\")\n",
    "if usage.file_count > 0:\n",
    "    avg_size = usage.total_mb / usage.file_count\n",
    "    print(f\"  Average file size: {avg_size:.2f} MB\")\n",
    "\n",
    "# Count files by type\n",
    "total_files = cluster_count_files(\".\", \"*\", config)\n",
    "python_files = cluster_count_files(\".\", \"*.py\", config)\n",
    "notebook_files = cluster_count_files(\".\", \"*.ipynb\", config)\n",
    "\n",
    "print(f\"\\n📈 File Counts:\")\n",
    "print(f\"  Total files: {total_files:,}\")\n",
    "print(f\"  Python files: {python_files:,}\")\n",
    "print(f\"  Notebooks: {notebook_files:,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data-Driven Workflows\n",
    "\n",
    "The real power comes when combining filesystem utilities with the `@cluster` decorator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cluster(cores=2)  # Use 2 cores for this example\n",
    "def analyze_python_files(config):\n",
    "    \"\"\"Analyze all Python files in the project.\"\"\"\n",
    "    \n",
    "    # Find all Python files\n",
    "    py_files = cluster_find(\"*.py\", \".\", config)\n",
    "    print(f\"Found {len(py_files)} Python files to analyze\")\n",
    "    \n",
    "    results = {\n",
    "        'total_files': len(py_files),\n",
    "        'total_lines': 0,\n",
    "        'total_size': 0,\n",
    "        'large_files': [],\n",
    "        'file_details': []\n",
    "    }\n",
    "    \n",
    "    # This loop will be automatically parallelized!\n",
    "    for py_file in py_files:\n",
    "        # Get file information\n",
    "        file_info = cluster_stat(py_file, config)\n",
    "        \n",
    "        # Count lines (for local files)\n",
    "        if config.cluster_type == \"local\":\n",
    "            try:\n",
    "                with open(py_file, 'r', encoding='utf-8') as f:\n",
    "                    lines = len(f.readlines())\n",
    "            except (UnicodeDecodeError, FileNotFoundError):\n",
    "                lines = 0\n",
    "        else:\n",
    "            lines = 0  # Would need remote file reading for clusters\n",
    "        \n",
    "        results['total_lines'] += lines\n",
    "        results['total_size'] += file_info.size\n",
    "        \n",
    "        # Track large files (> 10KB)\n",
    "        if file_info.size > 10000:\n",
    "            results['large_files'].append({\n",
    "                'file': py_file,\n",
    "                'size': file_info.size,\n",
    "                'lines': lines\n",
    "            })\n",
    "        \n",
    "        results['file_details'].append({\n",
    "            'file': py_file,\n",
    "            'size': file_info.size,\n",
    "            'lines': lines,\n",
    "            'modified': file_info.modified_datetime.isoformat()\n",
    "        })\n",
    "    \n",
    "    return results\n",
    "\n",
    "# Run the analysis\n",
    "print(\"🔍 Analyzing Python files...\")\n",
    "analysis = analyze_python_files(config)\n",
    "\n",
    "print(f\"\\n📈 Analysis Results:\")\n",
    "print(f\"  Total Python files: {analysis['total_files']}\")\n",
    "print(f\"  Total lines of code: {analysis['total_lines']:,}\")\n",
    "print(f\"  Total size: {analysis['total_size'] / 1024:.1f} KB\")\n",
    "print(f\"  Large files (>10KB): {len(analysis['large_files'])}\")\n",
    "\n",
    "if analysis['large_files']:\n",
    "    print(\"\\n📄 Largest Python files:\")\n",
    "    large_files = sorted(analysis['large_files'], key=lambda x: x['size'], reverse=True)\n",
    "    for file_info in large_files[:5]:\n",
    "        print(f\"  - {file_info['file']}: {file_info['size']:,} bytes, {file_info['lines']:,} lines\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Pattern: Conditional Processing\n",
    "\n",
    "Process files only if certain conditions are met:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cluster(cores=1)\n",
    "def smart_documentation_check(config):\n",
    "    \"\"\"Check documentation completeness and suggest improvements.\"\"\"\n",
    "    \n",
    "    results = {\n",
    "        'has_readme': False,\n",
    "        'has_contributing': False,\n",
    "        'has_license': False,\n",
    "        'docs_directory': False,\n",
    "        'notebook_count': 0,\n",
    "        'suggestions': []\n",
    "    }\n",
    "    \n",
    "    # Check for essential documentation files\n",
    "    if cluster_exists(\"README.md\", config) or cluster_exists(\"README.rst\", config):\n",
    "        results['has_readme'] = True\n",
    "    else:\n",
    "        results['suggestions'].append(\"Add a README.md file\")\n",
    "    \n",
    "    if cluster_exists(\"CONTRIBUTING.md\", config):\n",
    "        results['has_contributing'] = True\n",
    "    else:\n",
    "        results['suggestions'].append(\"Add a CONTRIBUTING.md file\")\n",
    "    \n",
    "    if cluster_exists(\"LICENSE\", config) or cluster_exists(\"LICENSE.txt\", config):\n",
    "        results['has_license'] = True\n",
    "    else:\n",
    "        results['suggestions'].append(\"Add a LICENSE file\")\n",
    "    \n",
    "    # Check for docs directory\n",
    "    if cluster_exists(\"docs\", config) and cluster_isdir(\"docs\", config):\n",
    "        results['docs_directory'] = True\n",
    "        \n",
    "        # Count documentation files\n",
    "        doc_files = cluster_find(\"*.{rst,md}\", \"docs\", config)\n",
    "        results['doc_file_count'] = len(doc_files)\n",
    "    else:\n",
    "        results['suggestions'].append(\"Create a docs/ directory with documentation\")\n",
    "    \n",
    "    # Count notebooks\n",
    "    notebooks = cluster_find(\"*.ipynb\", \".\", config)\n",
    "    results['notebook_count'] = len(notebooks)\n",
    "    \n",
    "    if results['notebook_count'] == 0:\n",
    "        results['suggestions'].append(\"Consider adding tutorial notebooks\")\n",
    "    \n",
    "    return results\n",
    "\n",
    "# Run documentation check\n",
    "print(\"📚 Checking documentation...\")\n",
    "doc_check = smart_documentation_check(config)\n",
    "\n",
    "print(\"\\n📋 Documentation Status:\")\n",
    "print(f\"  ✅ README: {'Yes' if doc_check['has_readme'] else 'No'}\")\n",
    "print(f\"  ✅ Contributing guide: {'Yes' if doc_check['has_contributing'] else 'No'}\")\n",
    "print(f\"  ✅ License: {'Yes' if doc_check['has_license'] else 'No'}\")\n",
    "print(f\"  ✅ Docs directory: {'Yes' if doc_check['docs_directory'] else 'No'}\")\n",
    "print(f\"  📓 Notebooks: {doc_check['notebook_count']}\")\n",
    "\n",
    "if doc_check['suggestions']:\n",
    "    print(\"\\n💡 Suggestions for improvement:\")\n",
    "    for suggestion in doc_check['suggestions']:\n",
    "        print(f\"  - {suggestion}\")\n",
    "else:\n",
    "    print(\"\\n🎉 Documentation looks complete!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with Different File Types\n",
    "\n",
    "Demonstrate handling various file types:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def categorize_files(config):\n",
    "    \"\"\"Categorize all files in the project.\"\"\"\n",
    "    \n",
    "    categories = {\n",
    "        'Source Code': ['*.py', '*.js', '*.ts', '*.java', '*.cpp', '*.c', '*.h'],\n",
    "        'Documentation': ['*.md', '*.rst', '*.txt'],\n",
    "        'Configuration': ['*.yml', '*.yaml', '*.json', '*.toml', '*.ini', '*.cfg'],\n",
    "        'Data': ['*.csv', '*.json', '*.xml', '*.xlsx', '*.h5', '*.pkl'],\n",
    "        'Images': ['*.png', '*.jpg', '*.jpeg', '*.gif', '*.svg'],\n",
    "        'Notebooks': ['*.ipynb'],\n",
    "        'Web': ['*.html', '*.css', '*.js']\n",
    "    }\n",
    "    \n",
    "    results = {}\n",
    "    \n",
    "    for category, patterns in categories.items():\n",
    "        files = []\n",
    "        total_size = 0\n",
    "        \n",
    "        for pattern in patterns:\n",
    "            found_files = cluster_find(pattern, \".\", config)\n",
    "            files.extend(found_files)\n",
    "        \n",
    "        # Get size information\n",
    "        for file in files:\n",
    "            try:\n",
    "                file_info = cluster_stat(file, config)\n",
    "                total_size += file_info.size\n",
    "            except:\n",
    "                pass  # Skip files that can't be stat'd\n",
    "        \n",
    "        # Remove duplicates\n",
    "        files = list(set(files))\n",
    "        \n",
    "        results[category] = {\n",
    "            'count': len(files),\n",
    "            'size_mb': total_size / (1024 * 1024),\n",
    "            'files': files[:5]  # Store first 5 as examples\n",
    "        }\n",
    "    \n",
    "    return results\n",
    "\n",
    "# Categorize files\n",
    "print(\"🗂️ Categorizing files by type...\")\n",
    "file_categories = categorize_files(config)\n",
    "\n",
    "print(\"\\n📊 File Categories:\")\n",
    "total_files = 0\n",
    "total_size = 0\n",
    "\n",
    "for category, info in file_categories.items():\n",
    "    if info['count'] > 0:\n",
    "        total_files += info['count']\n",
    "        total_size += info['size_mb']\n",
    "        print(f\"  📁 {category}: {info['count']} files ({info['size_mb']:.1f} MB)\")\n",
    "        if info['files']:\n",
    "            examples = ', '.join(info['files'][:3])\n",
    "            print(f\"     Examples: {examples}\")\n",
    "\n",
    "print(f\"\\n📈 Summary: {total_files} categorized files, {total_size:.1f} MB total\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performance Tips\n",
    "\n",
    "Here are some tips for using filesystem utilities efficiently:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def demonstrate_performance_tips(config):\n",
    "    \"\"\"Show efficient vs inefficient patterns.\"\"\"\n",
    "    \n",
    "    print(\"⚡ Performance Tips for Filesystem Operations:\\n\")\n",
    "    \n",
    "    # Tip 1: Use count to check before listing\n",
    "    print(\"1. Check file counts before expensive operations:\")\n",
    "    py_count = cluster_count_files(\".\", \"*.py\", config)\n",
    "    print(f\"   Found {py_count} Python files - deciding processing strategy\")\n",
    "    \n",
    "    if py_count > 100:\n",
    "        print(\"   → Large number of files, using targeted search\")\n",
    "        # Use specific patterns instead of listing all\n",
    "        test_files = cluster_find(\"test_*.py\", \".\", config)\n",
    "        main_files = cluster_find(\"main*.py\", \".\", config)\n",
    "    else:\n",
    "        print(\"   → Small number of files, safe to list all\")\n",
    "        all_py_files = cluster_find(\"*.py\", \".\", config)\n",
    "    \n",
    "    print()\n",
    "    \n",
    "    # Tip 2: Use exists() before stat()\n",
    "    print(\"2. Check existence before getting file info:\")\n",
    "    config_files = [\"setup.py\", \"pyproject.toml\", \"requirements.txt\"]\n",
    "    \n",
    "    for config_file in config_files:\n",
    "        if cluster_exists(config_file, config):  # Fast check first\n",
    "            file_info = cluster_stat(config_file, config)  # Then get details\n",
    "            print(f\"   ✅ {config_file}: {file_info.size:,} bytes\")\n",
    "        else:\n",
    "            print(f\"   ❌ {config_file}: not found\")\n",
    "    \n",
    "    print()\n",
    "    \n",
    "    # Tip 3: Use specific patterns instead of filtering\n",
    "    print(\"3. Use specific patterns for better performance:\")\n",
    "    print(\"   Good: cluster_find('test_*.py', '.', config)\")\n",
    "    print(\"   Better than: [f for f in cluster_ls('.', config) if f.startswith('test_')]\")\n",
    "    \n",
    "    # Demonstrate the difference\n",
    "    import time\n",
    "    \n",
    "    # Method 1: Specific pattern (efficient)\n",
    "    start = time.time()\n",
    "    test_files_direct = cluster_find(\"test_*.py\", \".\", config)\n",
    "    time_direct = time.time() - start\n",
    "    \n",
    "    # Method 2: List all then filter (less efficient)\n",
    "    start = time.time()\n",
    "    all_files = cluster_ls(\".\", config)\n",
    "    test_files_filtered = [f for f in all_files if f.startswith('test_') and f.endswith('.py')]\n",
    "    time_filtered = time.time() - start\n",
    "    \n",
    "    print(f\"   Direct pattern: {len(test_files_direct)} files in {time_direct:.4f}s\")\n",
    "    print(f\"   List + filter: {len(test_files_filtered)} files in {time_filtered:.4f}s\")\n",
    "    \n",
    "    speedup = time_filtered / time_direct if time_direct > 0 else 1\n",
    "    print(f\"   Speedup: {speedup:.1f}x faster\")\n",
    "\n",
    "# Run performance demo\n",
    "demonstrate_performance_tips(config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "This tutorial covered:\n",
    "\n",
    "1. **Basic Operations**: `cluster_ls`, `cluster_find`, `cluster_stat`, `cluster_exists`\n",
    "2. **Pattern Matching**: `cluster_glob`, `cluster_count_files`\n",
    "3. **Data-Driven Workflows**: Using filesystem utilities with `@cluster`\n",
    "4. **Advanced Patterns**: Conditional processing, file categorization\n",
    "5. **Performance Tips**: Efficient patterns for large-scale operations\n",
    "\n",
    "### Key Benefits\n",
    "\n",
    "- **Unified API**: Same code works locally and on remote clusters\n",
    "- **Automatic Parallelization**: When used with `@cluster`, loop processing is parallelized\n",
    "- **Data Discovery**: Enable workflows that adapt based on actual file contents\n",
    "- **Cross-Platform**: Consistent behavior across different operating systems\n",
    "\n",
    "### Next Steps\n",
    "\n",
    "1. Try these operations with your own data\n",
    "2. Configure a remote cluster and test the same code\n",
    "3. Build data processing pipelines using `@cluster` with filesystem utilities\n",
    "4. Explore the [API documentation](../api/filesystem.rst) for complete function references\n",
    "\n",
    "Happy cluster computing! 🚀"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}