Metadata-Version: 2.4
Name: fsspec-utils
Version: 0.2.6.6
Summary: Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support
Project-URL: Homepage, https://github.com/legout/fsspec-utils
Project-URL: Documentation, https://legout.github.io/fsspec-utils
Project-URL: Repository, https://github.com/legout/fsspec-utils.git
Project-URL: Issues, https://github.com/legout/fsspec-utils/issues
Author-email: legout <ligno.blades@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: azure,cloud-storage,csv,data-io,filesystem,fsspec,gcs,json,parquet,s3
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Filesystems
Requires-Python: >=3.11
Requires-Dist: fsspec>=2025.1.0
Requires-Dist: joblib>=1.5.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: msgspec>=0.18.0
Requires-Dist: orjson>=3.11.0
Requires-Dist: pandas>=2.2.0
Requires-Dist: polars>=1.30.0
Requires-Dist: pyarrow>=20.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: requests>=2.25.0
Requires-Dist: rich>=14.0.0
Requires-Dist: sqlglot>=27.16.3
Provides-Extra: aws
Requires-Dist: boto3>=1.26.0; extra == 'aws'
Requires-Dist: s3fs>=2025.1.0; extra == 'aws'
Provides-Extra: azure
Requires-Dist: adlfs>=2024.12.0; extra == 'azure'
Provides-Extra: dev
Requires-Dist: mkdocs; extra == 'dev'
Requires-Dist: mkdocstrings[python]>=0.30.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pre-commit>=3.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.10.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: gcp
Requires-Dist: gcsfs>=2025.1.0; extra == 'gcp'
Description-Content-Type: text/markdown

# fsspec-utils

Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support.



## Overview

`fsspec-utils` is a comprehensive toolkit that extends [fsspec](https://filesystem-spec.readthedocs.io/) with:

- **Multi-cloud storage configuration** - Easy setup for AWS S3, Google Cloud Storage, Azure Storage, GitHub, and GitLab
- **Enhanced caching** - Improved caching filesystem with monitoring and path preservation  
- **Extended I/O operations** - Read/write operations for JSON, CSV, Parquet with Polars/PyArrow integration
- **Utility functions** - Type conversion, parallel processing, and data transformation helpers

[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/legout/fsspec-utils)

## Installation

```bash
# Basic installation
pip install fsspec-utils

# With all optional dependencies
pip install fsspec-utils[full]

# Specific cloud providers
pip install fsspec-utils[aws]     # AWS S3 support
pip install fsspec-utils[gcp]     # Google Cloud Storage
pip install fsspec-utils[azure]   # Azure Storage
```

## Quick Start

### Basic Filesystem Operations

```python
from fsspec_utils import filesystem

# Local filesystem
fs = filesystem("file")
files = fs.ls("/path/to/data")

# S3 with caching
fs = filesystem("s3://my-bucket/", cached=True)
data = fs.cat("data/file.txt")
```

### Storage Configuration

```python
from fsspec_utils.storage import AwsStorageOptions

# Configure S3 access
options = AwsStorageOptions(
    region="us-west-2",
    access_key_id="YOUR_KEY",
    secret_access_key="YOUR_SECRET"
)

fs = filesystem("s3", storage_options=options, cached=True)
```

### Environment-based Configuration

```python
from fsspec_utils.storage import AwsStorageOptions

# Load from environment variables
options = AwsStorageOptions.from_env()
fs = filesystem("s3", storage_options=options)
```

### Multiple Cloud Providers

```python
from fsspec_utils.storage import (
    AwsStorageOptions, 
    GcsStorageOptions,
    GitHubStorageOptions
)

# AWS S3
s3_fs = filesystem("s3", storage_options=AwsStorageOptions.from_env())

# Google Cloud Storage  
gcs_fs = filesystem("gs", storage_options=GcsStorageOptions.from_env())

# GitHub repository
github_fs = filesystem("github", storage_options=GitHubStorageOptions(
    org="microsoft",
    repo="vscode", 
    token="ghp_xxxx"
))
```

## Storage Options

### AWS S3

```python
from fsspec_utils.storage import AwsStorageOptions

# Basic credentials
options = AwsStorageOptions(
    access_key_id="AKIAXXXXXXXX",
    secret_access_key="SECRET",
    region="us-east-1"
)

# From AWS profile
options = AwsStorageOptions.create(profile="dev")

# S3-compatible service (MinIO)
options = AwsStorageOptions(
    endpoint_url="http://localhost:9000",
    access_key_id="minioadmin",
    secret_access_key="minioadmin",
    allow_http=True
)
```

### Google Cloud Storage

```python
from fsspec_utils.storage import GcsStorageOptions

# Service account
options = GcsStorageOptions(
    token="path/to/service-account.json",
    project="my-project-123"
)

# From environment
options = GcsStorageOptions.from_env()
```

### Azure Storage

```python
from fsspec_utils.storage import AzureStorageOptions

# Account key
options = AzureStorageOptions(
    protocol="az",
    account_name="mystorageacct",
    account_key="key123..."
)

# Connection string
options = AzureStorageOptions(
    protocol="az",
    connection_string="DefaultEndpoints..."
)
```

### GitHub

```python
from fsspec_utils.storage import GitHubStorageOptions

# Public repository
options = GitHubStorageOptions(
    org="microsoft",
    repo="vscode",
    ref="main"
)

# Private repository
options = GitHubStorageOptions(
    org="myorg",
    repo="private-repo",
    token="ghp_xxxx",
    ref="develop"
)
```

### GitLab

```python
from fsspec_utils.storage import GitLabStorageOptions

# Public project
options = GitLabStorageOptions(
    project_name="group/project",
    ref="main"
)

# Private project with token
options = GitLabStorageOptions(
    project_id=12345,
    token="glpat_xxxx",
    ref="develop"
)
```

## Enhanced Caching

```python
from fsspec_utils import filesystem

# Enable caching with monitoring
fs = filesystem(
    "s3://my-bucket/",
    cached=True,
    cache_storage="/tmp/my_cache",
    verbose=True
)

# Cache preserves directory structure
data = fs.cat("deep/nested/path/file.txt")
# Cached at: /tmp/my_cache/deep/nested/path/file.txt
```

## Utilities

### Parallel Processing

```python
from fsspec_utils.utils import run_parallel

# Run function in parallel
def process_file(path, multiplier=1):
    return len(path) * multiplier

results = run_parallel(
    process_file,
    ["/path1", "/path2", "/path3"],
    multiplier=2,
    n_jobs=4,
    verbose=True
)
```

### Type Conversion

```python
from fsspec_utils.utils import dict_to_dataframe, to_pyarrow_table

# Convert dict to DataFrame
data = {"col1": [1, 2, 3], "col2": [4, 5, 6]}
df = dict_to_dataframe(data)

# Convert to PyArrow table
table = to_pyarrow_table(df)
```

### Logging

```python
from fsspec_utils.utils import setup_logging

# Configure logging
setup_logging(level="DEBUG", format_string="{time} | {level} | {message}")
```

## Dependencies

### Core Dependencies
- `fsspec>=2023.1.0` - Filesystem interface
- `msgspec>=0.18.0` - Serialization
- `pyyaml>=6.0` - YAML support
- `requests>=2.25.0` - HTTP requests
- `loguru>=0.7.0` - Logging

### Optional Dependencies
- `orjson>=3.8.0` - Fast JSON processing
- `polars>=0.19.0` - Fast DataFrames
- `pyarrow>=10.0.0` - Columnar data
- `pandas>=1.5.0` - Data analysis
- `joblib>=1.3.0` - Parallel processing
- `rich>=13.0.0` - Progress bars

### Cloud Provider Dependencies
- `boto3>=1.26.0`, `s3fs>=2023.1.0` - AWS S3
- `gcsfs>=2023.1.0` - Google Cloud Storage  
- `adlfs>=2023.1.0` - Azure Storage

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Relationship to FlowerPower

This package was extracted from the [FlowerPower](https://github.com/your-org/flowerpower) workflow framework to provide standalone filesystem utilities that can be used independently or as a dependency in other projects.
