Advanced Usage
fsspec-utils extends the capabilities of fsspec to provide a more robust and feature-rich experience for handling diverse file systems and data formats. This section delves into advanced features, configurations, and performance tips to help you get the most out of the library.
Unified Filesystem Creation with filesystem
The fsspec_utils.core.filesystem function offers a centralized and enhanced way to instantiate fsspec filesystem objects. It supports:
- Intelligent Caching: Automatically wraps filesystems with
MonitoredSimpleCacheFileSystemfor improved performance and verbose logging of cache operations. - Structured Storage Options: Integrates seamlessly with
fsspec_utils.storage_optionsclasses, allowing for type-safe and organized configuration of cloud and Git-based storage. - Protocol Inference: Can infer the filesystem protocol directly from a URI or path, reducing boilerplate.
Example: Cached S3 Filesystem with Structured Options
from fsspec_utils.core import filesystem
from fsspec_utils.storage_options import AwsStorageOptions
# Configure S3 options using the structured class
s3_opts = AwsStorageOptions(
region="us-east-1",
access_key_id="YOUR_ACCESS_KEY",
secret_access_key="YOUR_SECRET_KEY"
)
# Create a cached S3 filesystem using the 'filesystem' helper
fs = filesystem(
"s3",
storage_options=s3_opts,
cached=True,
cache_storage="/tmp/s3_cache", # Optional: specify cache directory
verbose=True # Enable verbose cache logging
)
# Use the filesystem as usual
print(fs.ls("s3://your-bucket/"))Detailed Caching for Improved Performance
fsspec-utils provides an enhanced caching mechanism that improves performance for repeated file operations, especially useful for remote filesystems.
This example demonstrates how caching improves read performance. The first read populates the cache, while subsequent reads retrieve data directly from the cache, significantly reducing access time. It also shows that data can still be retrieved from the cache even if the original source becomes unavailable.
Caching in fsspec-utils is an enhanced mechanism that improves performance for repeated file operations, especially useful for remote filesystems where network latency can significantly impact performance.
The filesystem() function provides several parameters for configuring caching:
cached: When set toTrue, enables caching for all read operationscache_storage: Specifies the directory where cached files will be storedverbose: When set toTrue, provides detailed logging about cache operations
Step-by-step walkthrough:
First read (populating cache): When reading a file for the first time, the data is retrieved from the source (disk, network, etc.) and stored in the cache directory. This takes longer than subsequent reads because it involves both reading from the source and writing to the cache.
Second read (using cache): When the same file is read again, the data is retrieved directly from the cache instead of the source. This is significantly faster because it avoids network latency or disk I/O.
Demonstrating cache effectiveness: Even after the original file is removed, the cached version can still be accessed. This demonstrates that the cache acts as a persistent copy of the data, independent of the source file.
Performance comparison: The timing results clearly show the performance benefits of caching, with subsequent reads being orders of magnitude faster than the initial read.
This caching mechanism is particularly valuable when working with:
- Remote filesystems (S3, GCS, Azure) where network latency is a bottleneck
- Frequently accessed files that don’t change often
- Applications that read the same data multiple times
- Environments with unreliable network connections
Setup and First Read (Populating Cache)
In this step, we create a sample JSON file and initialize the fsspec-utils filesystem with caching enabled. The first read operation retrieves data from the source and populates the cache.
Setup steps:
- Create a temporary directory for our example
- Create sample data file
- Configure filesystem with caching
import tempfile
import time
import os
import shutil
from fsspec_utils import filesystem
from examples.caching.setup_data import create_sample_data_file
tmpdir = tempfile.mkdtemp()
print(f"Created temporary directory: {tmpdir}")
sample_file = create_sample_data_file(tmpdir)
cache_dir = os.path.join(tmpdir, "cache")
fs = filesystem(
protocol_or_path="file",
cached=True,
cache_storage=cache_dir,
verbose=True
)
print("\n=== First read (populating cache) ===")
start_time = time.time()
data1 = fs.read_json(sample_file)
first_read_time = time.time() - start_time
print(f"First read completed in {first_read_time:.4f} seconds")Second Read (Using Cache)
Now, let’s read the same file again to see the performance improvement from using the cache.
print("\n=== Second read (using cache) ===")
start_time = time.time()
data2 = fs.read_json(sample_file)
second_read_time = time.time() - start_time
print(f"Second read completed in {second_read_time:.4f} seconds")The second read retrieves data directly from the cache, which is significantly faster than reading from the source again.
Reading from Cache after Deletion
To demonstrate that the cache is persistent, we’ll remove the original file and try to read it again.
print("\n=== Demonstrating cache effectiveness ===")
print("Removing original file...")
os.remove(sample_file)
print(f"Original file exists: {os.path.exists(sample_file)}")
print("\n=== Third read (from cache only) ===")
start_time = time.time()
data3 = fs.read_json(sample_file)
third_read_time = time.time() - start_time
print(f"Third read completed in {third_read_time:.4f} seconds")
print("✓ Successfully read from cache even after original file was removed")
print("\n=== Performance Comparison ===")
print(f"First read (from disk): {first_read_time:.4f} seconds")
print(f"Second read (from cache): {second_read_time:.4f} seconds")
print(f"Third read (from cache): {third_read_time:.4f} seconds")
shutil.rmtree(tmpdir)
print(f"Cleaned up temporary directory: {tmpdir}")This step proves that the cache acts as a persistent copy of the data, allowing access even if the original source is unavailable.
Custom Filesystem Implementations
fsspec-utils provides specialized filesystem implementations for unique use cases:
GitLab Filesystem (GitLabFileSystem)
Access files directly from GitLab repositories. This is particularly useful for configuration files, datasets, or code stored in private or public GitLab instances.
Example: Reading from a GitLab Repository
from fsspec_utils.core import filesystem
# Instantiate a GitLab filesystem
gitlab_fs = filesystem(
"gitlab",
storage_options={
"project_name": "your-group/your-project", # Or "project_id": 12345
"ref": "main", # Branch, tag, or commit SHA
"token": "glpat_YOUR_PRIVATE_TOKEN" # Required for private repos
}
)
# List files in the repository root
print(gitlab_fs.ls("/"))
# Read a specific file
content = gitlab_fs.cat("README.md").decode("utf-8")
print(content[:200]) # Print first 200 charactersAdvanced Data Reading and Writing (read_files, write_files)
The fsspec_utils.core.ext module (exposed via AbstractFileSystem extensions) provides powerful functions for reading and writing various data formats (JSON, CSV, Parquet) with advanced features like:
- Batch Processing: Efficiently handle large datasets by processing files in configurable batches.
- Parallel Processing: Leverage multi-threading to speed up file I/O operations.
- Schema Unification & Optimization: Automatically unifies schemas when concatenating multiple files and optimizes data types for memory efficiency (e.g., using Polars’
opt_dtypesor PyArrow’s schema casting). - File Path Tracking: Optionally include the source file path as a column in the resulting DataFrame/Table.
Universal read_files
The read_files function acts as a universal reader, delegating to format-specific readers (JSON, CSV, Parquet) while maintaining consistent options.
Example: Reading CSVs in Batches with Parallelism
from fsspec_utils.core import filesystem
# Assuming you have multiple CSV files like 'data/part_0.csv', 'data/part_1.csv', etc.
# on your local filesystem
fs = filesystem("file")
# Read CSV files in batches of 10, using multiple threads, and including file path
for batch_df in fs.read_files(
"data/*.csv",
format="csv",
batch_size=10,
include_file_path=True,
use_threads=True,
verbose=True
):
print(f"Processed batch with {len(batch_df)} rows. Columns: {batch_df.columns}")
print(batch_df.head(2))Reading and Processing Multiple Files (PyArrow Tables, Batch Processing)
fsspec-utils simplifies reading multiple files of various formats (Parquet, CSV, JSON) from a folder into a single PyArrow Table or Polars DataFrame.
Reading multiple files into a single table is a powerful feature that allows you to efficiently process data distributed across multiple files. This is particularly useful when dealing with large datasets that are split into smaller files for better organization or parallel processing.
Key concepts demonstrated:
Glob patterns: The
**/*.parquet,**/*.csv, and**/*.jsonpatterns are used to select files recursively from the directory and its subdirectories. The**pattern matches any directories, allowing the function to find files in nested directories.Concat parameter: The
concat=Trueparameter tells the function to combine data from multiple files into a single table or DataFrame. When set toFalse, the function would return a list of individual tables/DataFrames.Format flexibility: The same interface can be used to read different file formats (Parquet, CSV, JSON), making it easy to work with heterogeneous data sources.
Step-by-step explanation:
Creating sample data: We create two subdirectories and populate them with sample data in three different formats (Parquet, CSV, JSON). Each format contains the same structured data but in different serialization formats.
Reading Parquet files: Using
fs.read_parquet("**/*.parquet", concat=True), we read all Parquet files recursively and combine them into a single PyArrow Table. Parquet is a columnar storage format that is highly efficient for analytical workloads.Reading CSV files: Using
fs.read_csv("**/*.csv", concat=True), we read all CSV files and combine them into a Polars DataFrame, which we then convert to a PyArrow Table for consistency.Reading JSON files: Using
fs.read_json("**/*.json", as_dataframe=True, concat=True), we read all JSON files and combine them into a Polars DataFrame, then convert it to a PyArrow Table.Verification: Finally, we verify that all three tables have the same number of rows, confirming that the data was correctly read and combined across all files and formats.
The flexibility of fsspec-utils allows you to use the same approach with different data sources, including remote filesystems like S3, GCS, or Azure Blob Storage, simply by changing the filesystem path.
Setup
First, we’ll create a temporary directory with sample data in different formats.
Setup steps:
- Create a temporary directory for our example
- Create sample data in subdirectories
import tempfile
import shutil
import os
from examples.read_folder.create_dataset import create_sample_dataset
temp_dir = tempfile.mkdtemp()
print(f"Created temporary directory: {temp_dir}")
create_sample_dataset(temp_dir)This step sets up the environment by creating a temporary directory and populating it with sample data files.
Reading Parquet Files
Now, let’s read all the Parquet files from the directory and its subdirectories into a single PyArrow Table.
Reading Parquet files:
- Read Parquet files using glob pattern
- Display table information and sample data
print("\n=== Reading Parquet Files ===")
from fsspec_utils import filesystem
fs = filesystem(temp_dir)
parquet_table = fs.read_parquet("**/*.parquet", concat=True)
print(f"Successfully read Parquet files into PyArrow Table")
print(f"Table shape: {parquet_table.num_rows} rows x {parquet_table.num_columns} columns")
print("First 3 rows:")
print(parquet_table.slice(0, 3).to_pandas())We use the read_parquet method with a glob pattern **/*.parquet to find all Parquet files recursively. The concat=True parameter combines them into a single table.
Reading CSV Files
Next, we’ll read all the CSV files into a Polars DataFrame and then convert it to a PyArrow Table.
Reading CSV files:
- Read CSV files using glob pattern
- Display DataFrame information and sample data
- Convert to PyArrow Table for consistency
print("\n=== Reading CSV Files ===")
csv_df = fs.read_csv("**/*.csv", concat=True)
print(f"Successfully read CSV files into Polars DataFrame")
print(f"DataFrame shape: {csv_df.shape}")
print("First 3 rows:")
print(csv_df.head(3))
csv_table = csv_df.to_arrow()Similarly, we use read_csv with the same glob pattern to read all CSV files.
Reading JSON Files
Finally, let’s read all the JSON files.
Reading JSON files:
- Read JSON files using glob pattern
- Display DataFrame information and sample data
- Convert to PyArrow Table for consistency
print("\n=== Reading JSON Files ===")
json_df = fs.read_json("**/*.json", as_dataframe=True, concat=True)
print(f"Successfully read JSON files into Polars DataFrame")
print(f"DataFrame shape: {json_df.shape}")
print("First 3 rows:")
print(json_df.head(3))
json_table = json_df.to_arrow()The read_json method is used to read all JSON files. We set as_dataframe=True to get a Polars DataFrame.
Verification
Let’s verify that all the tables have the same number of rows.
print("\n=== Verification ===")
print(f"All tables have the same number of rows: {parquet_table.num_rows == csv_table.num_rows == json_table.num_rows}")
shutil.rmtree(temp_dir)
print(f"\nCleaned up temporary directory: {temp_dir}")This final step confirms that our data reading and concatenation were successful.
This example shows how to read various file formats from a directory, including subdirectories, into a unified PyArrow Table or Polars DataFrame. It highlights the flexibility of fsspec-utils in handling different data sources and formats.
fsspec-utils enables efficient batch processing of large datasets by reading files in smaller, manageable chunks. This is particularly useful for memory-constrained environments or when processing streaming data.
Batch processing is a technique for handling large datasets by dividing them into smaller, manageable chunks. This is particularly important for:
- Memory-constrained environments: When working with datasets that are too large to fit in memory, batch processing allows you to process the data incrementally.
- Streaming data: When data is continuously generated (e.g., from IoT devices or real-time applications), batch processing enables you to process data as it arrives.
- Distributed processing: In distributed computing environments, batch processing allows different nodes to work on different chunks of data simultaneously.
The batch_size parameter controls how many files or records are processed together in each batch. A smaller batch size reduces memory usage but may increase processing overhead, while a larger batch size improves throughput but requires more memory.
Step-by-step walkthrough:
Creating sample batched data: We generate sample data and distribute it across multiple files in each format (Parquet, CSV, JSON). Each file contains a subset of the total data, simulating a real-world scenario where data is split across multiple files.
Reading Parquet files in batches: Using
fs.read_parquet(parquet_path, batch_size=2), we read all Parquet files in batches of 2 files at a time. Each iteration of the loop processes a batch of files, and thebatchvariable contains the combined data from those files.Reading CSV files in batches: Similarly, we use
fs.read_csv(csv_path, batch_size=2)to read CSV files in batches. The result is a Polars DataFrame for each batch, which we can process individually.Reading JSON files in batches: Finally, we use
fs.read_json(json_path, batch_size=2)to read JSON files in batches. The JSON data is automatically converted to Polars DataFrames for easy processing.
Benefits of batch processing:
- Reduced memory footprint: Instead of loading all files into memory at once, you only load the current batch.
- Progressive processing: You can start processing data as soon as the first batch is available, without waiting for all data to be loaded.
- Fault tolerance: If processing fails on one batch, you can restart from that batch without reprocessing all previous batches.
- Scalability: Batch processing scales well with both the size of the dataset and the available computational resources.
This approach is particularly valuable when working with large datasets in cloud storage, where downloading the entire dataset would be impractical due to network constraints or memory limitations.
Setup
First, we’ll create a temporary directory with sample data files.
This step sets up the environment by creating a temporary directory and populating it with sample data files in batches.
Setup steps:
- Create a temporary directory for our example
- Create sample batched data
import tempfile
import shutil
import os
from examples.batch_processing.generate_batched_data import create_batched_dataset
temp_dir = tempfile.mkdtemp()
print(f"Created temporary directory: {temp_dir}")
create_batched_dataset(temp_dir)Batch Reading Parquet Files
Now, let’s read the Parquet files in batches.
Reading Parquet files in batches:
- Initialize filesystem
- Set up path pattern for Parquet files
- Process files in batches of 2
print("\n=== Parquet Batch Reading ===")
from fsspec_utils import filesystem
fs = filesystem("file")
parquet_path = os.path.join(temp_dir, "*.parquet")
print("\nReading Parquet files in batches (batch_size=2):")
for i, batch in enumerate(fs.read_parquet(parquet_path, batch_size=2)):
print(f" Batch {i+1}: rows={batch.num_rows}")
print(f" - Data preview: {batch.to_pandas().head(1).to_dict('records')}")Batch Reading CSV Files
Next, we’ll read the CSV files in batches.
Reading CSV files in batches:
- Set up path pattern for CSV files
- Process files in batches of 2
print("\n=== CSV Batch Reading ===")
csv_path = os.path.join(temp_dir, "*.csv")
print("\nReading CSV files in batches (batch_size=2):")
for i, batch in enumerate(fs.read_csv(csv_path, batch_size=2)):
print(f" Batch {i+1}: shape={batch.shape}")
print(f" - Data preview: {batch.head(1).to_dicts()}")Similarly, we use read_csv with the same glob pattern to read all CSV files.
Batch Reading JSON Files
Finally, let’s read the JSON files in batches.
Reading JSON files in batches:
- Set up path pattern for JSON files
- Process files in batches of 2
- Clean up the temporary directory
print("\n=== JSON Batch Reading ===")
json_path = os.path.join(temp_dir, "*.json")
print("\nReading JSON files in batches (batch_size=2):")
for i, batch in enumerate(fs.read_json(json_path, batch_size=2)):
print(f" Batch {i+1}: shape={batch.shape}")
print(f" - Data preview: {batch.head(1).to_dicts()}")
shutil.rmtree(temp_dir)
print(f"\nCleaned up temporary directory: {temp_dir}")The read_json method is also used with batch_size=2 to process JSON files in batches.
This example illustrates how to read Parquet, CSV, and JSON files in batches using the batch_size parameter. This approach allows for processing of large datasets without loading the entire dataset into memory at once.
Advanced Parquet Handling and Delta Lake Integration
fsspec-utils enhances Parquet operations with deep integration with PyArrow and Pydala, enabling efficient dataset management, partitioning, and delta lake capabilities.
pyarrow_dataset: Create PyArrow datasets for optimized querying, partitioning, and predicate pushdown.pyarrow_parquet_dataset: Specialized for Parquet, handling_metadatafiles for overall dataset schemas.pydala_dataset: Integrates withpydalafor advanced features like Delta Lake operations (upserts, schema evolution).
Example: Writing to a PyArrow Dataset with Partitioning
import polars as pl
from fsspec_utils.core import filesystem
fs = filesystem("file")
base_path = "output/my_partitioned_data"
# Sample data
data = pl.DataFrame({
"id": [1, 2, 3, 4],
"value": ["A", "B", "C", "D"],
"year": [2023, 2023, 2024, 2024],
"month": [10, 11, 1, 2]
})
# Write data as a partitioned PyArrow dataset
fs.write_pyarrow_dataset(
data=data,
path=base_path,
partition_by=["year", "month"], # Partition by year and month
format="parquet",
compression="zstd",
mode="overwrite" # Overwrite if path exists
)
print(f"Data written to {base_path} partitioned by year/month.")
# Expected structure: output/my_partitioned_data/year=2023/month=10/data-*.parquetExample: Delta Lake Operations with Pydala Dataset
import polars as pl
from fsspec_utils.core import filesystem
fs = filesystem("file")
delta_path = "output/my_delta_table"
# Initial data
initial_data = pl.DataFrame({
"id": [1, 2],
"name": ["Alice", "Bob"],
"version": [1, 1]
})
# Write initial data to a Pydala dataset
fs.write_pydala_dataset(
data=initial_data,
path=delta_path,
mode="overwrite"
)
print("Initial Delta table created.")
# New data for an upsert: update Alice, add Charlie
new_data = pl.DataFrame({
"id": [1, 3],
"name": ["Alicia", "Charlie"],
"version": [2, 1]
})
# Perform a delta merge (upsert)
fs.write_pydala_dataset(
data=new_data,
path=delta_path,
mode="delta",
delta_subset=["id"] # Column(s) to use for merging
)
print("Delta merge completed.")
# Read the updated table
updated_df = fs.pydala_dataset(delta_path).to_polars()
print("Updated Delta table:")
print(updated_df)
# Expected: id=1 Alicia version=2, id=2 Bob version=1, id=3 Charlie version=1fsspec-utils facilitates integration with Delta Lake by providing StorageOptions that can be used to configure deltalake’s DeltaTable for various storage backends.
This example demonstrates how to use LocalStorageOptions with deltalake’s DeltaTable. It shows how to initialize a DeltaTable instance by passing the fsspec-utils storage options, enabling seamless interaction with Delta Lake tables across different storage types.
Step-by-step walkthrough:
- Create a temporary directory for our example
- Create a simple Polars DataFrame
- Write initial data to create the Delta table
- Create a LocalStorageOptions object for the temporary directory
- Create a DeltaTable instance, passing storage options
- Note: deltalake expects storage_options as a dict, which to_object_store_kwargs provides
- Read data from the DeltaTable
- Clean up the temporary directory
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It provides a reliable, scalable, and performant way to work with data lakes, combining the benefits of data lakes (low cost, flexibility) with data warehouses (reliability, performance).
from deltalake import DeltaTable
from fsspec_utils.storage_options import LocalStorageOptions
import tempfile
import shutil
import os
import polars as pl
temp_dir = tempfile.mkdtemp()
print(f"Created temporary directory: {temp_dir}")
delta_table_path = os.path.join(temp_dir, "my_delta_table")
print(f"Creating a dummy Delta table at: {delta_table_path}")
data = pl.DataFrame({
"id": [1, 2, 3],
"value": ["A", "B", "C"]
})
data.write_delta(delta_table_path, mode="overwrite")
print("Initial data written to Delta table.")
local_options = LocalStorageOptions(path=temp_dir)
dt = DeltaTable(delta_table_path, storage_options=local_options.to_object_store_kwargs())
print(f"\nSuccessfully created DeltaTable instance from: {delta_table_path}")
print(f"DeltaTable version: {dt.version()}")
print(f"DeltaTable files: {dt.files()}")
table_data = dt.to_pyarrow_table()
print("\nData read from DeltaTable:")
print(table_data.to_pandas())
shutil.rmtree(temp_dir)
print(f"Cleaned up temporary directory: {temp_dir}")Key features of Delta Lake:
- ACID transactions: Ensures data integrity even with concurrent operations
- Time travel: Allows querying data as it existed at any point in time
- Schema enforcement: Maintains data consistency with schema validation
- Scalable metadata: Handles billions of files efficiently
- Unified analytics: Supports both batch and streaming workloads
Integrating fsspec-utils with Delta Lake:
The fsspec-utils StorageOptions classes can be used to configure deltalake’s DeltaTable for various storage backends. This integration allows you to:
- Use consistent configuration patterns across different storage systems
- Leverage the benefits of fsspec’s unified filesystem interface
- Seamlessly switch between local and cloud storage without changing your Delta Lake code
The to_object_store_kwargs() method converts fsspec-utils storage options into a dictionary format that deltalake expects for its storage_options parameter. This is necessary because deltalake requires storage options as a dictionary, while fsspec-utils provides them as structured objects.
Step-by-step walkthrough:
Creating a temporary directory: We create a temporary directory to store our Delta table, ensuring the example is self-contained and doesn’t leave artifacts on your system.
Creating sample data: We create a simple Polars DataFrame with sample data that will be written to our Delta table.
Writing to Delta table: Using the
write_deltamethod, we convert our DataFrame into a Delta table. This creates the necessary Delta Lake metadata alongside the data files.Configuring storage options: We create a
LocalStorageOptionsobject that points to our temporary directory. This object contains all the information needed to access the Delta table.Initializing DeltaTable: We create a
DeltaTableinstance by passing the table path and the storage options converted to a dictionary viato_object_store_kwargs(). This allowsdeltalaketo locate and access the Delta table files.Verifying the DeltaTable: We check the version and files of our Delta table to confirm it was created correctly. Delta tables maintain version history, allowing you to track changes over time.
Reading data: Finally, we read the data from our Delta table back into a PyArrow Table, demonstrating that we can successfully interact with the Delta Lake table using the fsspec-utils configuration.
This integration is particularly valuable when working with Delta Lake in cloud environments, as it allows you to use the same configuration approach for local development and production deployments across different cloud providers.
Storage Options Management
fsspec-utils provides a robust system for managing storage configurations, simplifying credential handling and environment setup.
Loading from Environment Variables
Instead of hardcoding credentials, you can load storage options directly from environment variables.
Example: Loading AWS S3 Configuration from Environment
Set these environment variables before running your script:
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_KEY"
export AWS_DEFAULT_REGION="us-west-2"Then in Python:
from fsspec_utils.storage_options import AwsStorageOptions
# Load AWS options directly from environment variables
aws_opts = AwsStorageOptions.from_env()
print(f"Loaded AWS region: {aws_opts.region}")
# Use it to create a filesystem
# fs = aws_opts.to_filesystem()Merging Storage Options
Combine multiple storage option configurations, useful for layering default settings with user-specific overrides.
Example: Merging S3 Options
from fsspec_utils.storage_options import AwsStorageOptions, merge_storage_options
# Base configuration
base_opts = AwsStorageOptions(
protocol="s3",
region="us-east-1",
access_key_id="DEFAULT_KEY"
)
# User-provided overrides
user_overrides = {
"access_key_id": "USER_KEY",
"allow_http": True # New setting
}
# Merge, with user_overrides taking precedence
merged_opts = merge_storage_options(base_opts, user_overrides, overwrite=True)
print(f"Merged Access Key ID: {merged_opts.access_key_id}") # USER_KEY
print(f"Merged Region: {merged_opts.region}") # us-east-1
print(f"Allow HTTP: {merged_opts.allow_http}") # TrueNote on GitHub Examples
For a comprehensive collection of executable examples demonstrating various functionalities and advanced patterns of fsspec-utils, including those discussed in this document, please refer to the examples directory on GitHub. Each example is designed to be runnable and provides detailed insights into practical usage.
Performance Tips
- Caching: Always consider using
cached=Truewith thefilesystemfunction, especially for remote filesystems, to minimize repeated downloads. - Parallel Reading: For multiple files, set
use_threads=Trueinread_json,read_csv, andread_parquetto leverage concurrent I/O. - Batch Processing: When dealing with a very large number of files or extremely large individual files, use the
batch_sizeparameter in reading functions to process data in chunks, reducing memory footprint. opt_dtypes: Utilizeopt_dtypes=Truein reading functions when converting to Polars or PyArrow to automatically optimize column data types, leading to more efficient memory usage and faster subsequent operations.- Parquet Datasets: For large, partitioned Parquet datasets, use
pyarrow_datasetorpydala_dataset. These leverage PyArrow’s dataset API for efficient metadata handling, partition pruning, and predicate pushdown, reading only the necessary data. - Compression: When writing Parquet files, choose an appropriate compression codec (e.g.,
zstd,snappy) to reduce file size and improve I/O performance.zstdoften provides a good balance of compression ratio and speed.
Flexible Storage Configuration
fsspec-utils simplifies configuring connections to various storage systems, including local filesystems, AWS S3, Azure Storage, and Google Cloud Storage, using StorageOptions classes. These options can then be converted into fsspec filesystems.
Local Storage Example
This example demonstrates how to initialize LocalStorageOptions and use it to interact with the local filesystem.
Step-by-step walkthrough:
- Create a temporary directory for our test
- Create a test file and write content to it
- List files in the directory to verify our file was created
- Read the content back to verify it was written correctly
- Clean up the temporary directory
StorageOptions classes simplify configuration for different storage systems and provide a consistent interface for creating fsspec filesystem objects.
import os
import tempfile
import shutil
from fsspec_utils.storage_options import LocalStorageOptions
print("=== LocalStorageOptions Example ===\n")
local_options = LocalStorageOptions(auto_mkdir=True)
local_fs = local_options.to_filesystem()
temp_dir = tempfile.mkdtemp()
print(f"Working in temporary directory: {temp_dir}")
temp_file = os.path.join(temp_dir, "test_file.txt")
with local_fs.open(temp_file, "w") as f:
f.write("Hello, LocalStorageOptions!")
print(f"Created test file: {temp_file}")
files = local_fs.ls(temp_dir)
print(f"Files in {temp_dir}: {[os.path.basename(f) for f in files]}")
with local_fs.open(temp_file, "r") as f:
content = f.read()
print(f"File content: '{content}'")
shutil.rmtree(temp_dir)
print(f"Cleaned up temporary directory: {temp_dir}")
print("Local storage example completed.\n")Conceptual AWS S3 Configuration
This example demonstrates the configuration pattern for AwsStorageOptions. It is expected to fail when attempting to connect to actual cloud services because it uses dummy credentials.
Note: The to_filesystem() method converts StorageOptions into fsspec-compatible objects, allowing seamless integration with any fsspec-compatible library.
from fsspec_utils.storage_options import AwsStorageOptions
print("=== Conceptual AwsStorageOptions Example (using a dummy endpoint) ===\n")
aws_options = AwsStorageOptions(
endpoint_url="http://s3.dummy-endpoint.com",
access_key_id="DUMMY_KEY",
secret_access_key="DUMMY_SECRET",
allow_http=True,
region="us-east-1"
)
aws_fs = aws_options.to_filesystem()
print(f"Created fsspec filesystem for S3: {type(aws_fs).__name__}")
print("AWS storage example completed.\n")Conceptual Azure Configuration
This example shows how to configure AzureStorageOptions. It is expected to fail when attempting to connect to actual cloud services because it uses dummy credentials.
from fsspec_utils.storage_options import AzureStorageOptions
print("=== Conceptual AzureStorageOptions Example (using a dummy connection string) ===\n")
azure_options = AzureStorageOptions(
protocol="az",
account_name="demoaccount",
connection_string="DefaultEndpointsProtocol=https;AccountName=demoaccount;AccountKey=demokey==;EndpointSuffix=core.windows.net"
)
azure_fs = azure_options.to_filesystem()
print(f"Created fsspec filesystem for Azure: {type(azure_fs).__name__}")
print("Azure storage example completed.\n")Conceptual GCS Configuration
This example shows how to configure GcsStorageOptions. It is expected to fail when attempting to connect to actual cloud services because it uses dummy credentials.
StorageOptions classes provide a simplified, consistent interface for configuring connections to various storage systems. They abstract away the complexity of different storage backends and provide a unified way to create fsspec filesystem objects.
The to_filesystem() method converts these options into fsspec compatible objects, enabling seamless integration with any fsspec-compatible library or tool.
Important Note: The AWS, Azure, and GCS examples use dummy credentials and are for illustrative purposes only. These examples are expected to fail when attempting to connect to actual cloud services because:
- The endpoint URLs are not real service endpoints
- The credentials are placeholder values that don’t correspond to actual accounts
- The connection strings and tokens are examples, not valid credentials
This approach allows you to understand the configuration pattern without needing actual cloud credentials. When using these examples in production, you would replace the dummy values with your real credentials and service endpoints.
```python from fsspec_utils.storage_options import GcsStorageOptions
print(“=== Conceptual GcsStorageOptions Example (using a dummy token path) ===”) gcs_options = GcsStorageOptions( protocol=“gs”, project=“demo-project”, token=“path/to/dummy-service-account.json” )
gcs_fs = gcs_options.to_filesystem() print(f”Created fsspec filesystem for GCS: {type(gcs_fs).__name__}“) print(”GCS storage example completed.“)