fsspec_utils.core.ext API Documentation¶
This module provides extended functionalities for fsspec.AbstractFileSystem, including methods for reading and writing various file formats (JSON, CSV, Parquet) with advanced options like batch processing, parallelization, and data type optimization. It also includes functions for creating PyArrow and Pydala datasets.
path_to_glob()¶
Convert a path to a glob pattern for file matching.
Intelligently converts paths to glob patterns that match files of the specified format, handling various directory and wildcard patterns.
| Parameter | Type | Description |
|---|---|---|
path |
str |
Base path to convert. Can include wildcards (* or **). Examples: "data/", "data/.json", "data/*" |
format |
str | None |
File format to match (without dot). If None, inferred from path. Examples: "json", "csv", "parquet" |
| Returns | Type | Description |
|---|---|---|
str |
str |
Glob pattern that matches files of specified format. Examples: "data/**/.json", "data/.csv" |
Example:
read_json_file()¶
Read a single JSON file from any filesystem.
A public wrapper around _read_json_file providing a clean interface for reading individual JSON files.
| Parameter | Type | Description |
|---|---|---|
self |
AbstractFileSystem |
Filesystem instance to use for reading |
path |
str |
Path to JSON file to read |
include_file_path |
bool |
Whether to return dict with filepath as key |
jsonlines |
bool |
Whether to read as JSON Lines format |
| Returns | Type | Description |
|---|---|---|
dict or list[dict] |
dict or list[dict] |
Parsed JSON data. For regular JSON, returns a dict. For JSON Lines, returns a list of dicts. If include_file_path=True, returns {filepath: data}. |
Example:
read_json()¶
Read JSON data from one or more files with powerful options.
Provides a flexible interface for reading JSON data with support for:
- Single file or multiple files
- Regular JSON or JSON Lines format
- Batch processing for large datasets
- Parallel processing
- DataFrame conversion
- File path tracking
| Parameter | Type | Description |
|---|---|---|
path |
str or list[str] |
Path(s) to JSON file(s). Can be: - Single path string (globs supported) - List of path strings |
batch_size |
int | None |
If set, enables batch reading with this many files per batch |
include_file_path |
bool |
Include source filepath in output |
jsonlines |
bool |
Whether to read as JSON Lines format |
as_dataframe |
bool |
Convert output to Polars DataFrame(s) |
concat |
bool |
Combine multiple files/batches into single result |
use_threads |
bool |
Enable parallel file reading |
verbose |
bool |
Print progress information |
opt_dtypes |
bool |
Optimize DataFrame dtypes for performance |
**kwargs |
Any |
Additional arguments passed to DataFrame conversion |
| Returns | Type | Description |
|---|---|---|
dict or list[dict] or pl.DataFrame or list[pl.DataFrame] or Generator |
Various types depending on arguments: - dict: Single JSON file as dictionary - list[dict]: Multiple JSON files as list of dictionaries - pl.DataFrame: Single or concatenated DataFrame - list[pl.DataFrame]: List of Dataframes (if concat=False) - Generator: If batch_size set, yields batches of above types |
Example:
read_csv_file()¶
Read a single CSV file from any filesystem.
Internal function that handles reading individual CSV files and optionally adds the source filepath as a column.
| Parameter | Type | Description |
|---|---|---|
self |
AbstractFileSystem |
Filesystem instance to use for reading |
path |
str |
Path to CSV file |
include_file_path |
bool |
Add source filepath as a column |
opt_dtypes |
bool |
Optimize DataFrame dtypes |
**kwargs |
Any |
Additional arguments passed to pl.read_csv() |
| Returns | Type | Description |
|---|---|---|
pl.DataFrame |
pl.DataFrame |
DataFrame containing CSV data |
Example:
read_csv()¶
Read CSV data from one or more files with powerful options.
Provides a flexible interface for reading CSV files with support for:
- Single file or multiple files
- Batch processing for large datasets
- Parallel
- File path tracking
- Polars DataFrame output
| Parameter | Type | Description |
|---|---|---|
path |
str or list[str] |
Path(s) to CSV file(s). Can be: - Single path string (globs supported) - List of path strings |
batch_size |
int | None |
If set, enables batch reading with this many files per batch |
include_file_path |
bool |
Add source filepath as a column |
concat |
bool |
Combine multiple files/batches into single DataFrame |
use_threads |
bool |
Enable parallel file reading |
verbose |
bool |
Print progress information |
**kwargs |
Any |
Additional arguments passed to pl.read_csv() |
| Returns | Type | Description |
|---|---|---|
pl.DataFrame or list[pl.DataFrame] or Generator |
Various types depending on arguments: - pl.DataFrame: Single or concatenated DataFrame - list[pl.DataFrame]: List of DataFrames (if concat=False) - Generator: If batch_size set, yields batches of above types |
Example:
read_parquet_file()¶
Read a single Parquet file from any filesystem.
Internal function that handles reading individual Parquet files and optionally adds the source filepath as a column.
| Parameter | Type | Description |
|---|---|---|
self |
AbstractFileSystem |
Filesystem instance to use for reading |
path |
str |
Path to Parquet file |
include_file_path |
bool |
Add source filepath as a column |
opt_dtypes |
bool |
Optimize DataFrame dtypes |
**kwargs |
Any |
Additional arguments passed to pq.read_table() |
| Returns | Type | Description |
|---|---|---|
pa.Table |
pa.Table |
PyArrow Table containing Parquet data |
Example:
read_parquet()¶
Read Parquet data with advanced features and optimizations.
Provides a high-performance interface for reading Parquet files with support for:
- Single file or multiple files
- Batch processing for large datasets
- Parallel processing
- File path tracking
- Automatic concatenation
- PyArrow Table output
The function automatically uses optimal reading strategies:
- Direct dataset reading for simple cases
- Parallel processing for multiple files
- Batched reading for memory efficiency
| Parameter | Type | Description |
|---|---|---|
path |
str or list[str] |
Path(s) to Parquet file(s). Can be: - Single path string (globs supported) - List of path strings - Directory containing _metadata file |
batch_size |
int | None |
If set, enables batch reading with this many files per batch |
include_file_path |
bool |
Add source filepath as a column |
concat |
bool |
Combine multiple files/batches into single Table |
use_threads |
bool |
Enable parallel file reading |
verbose |
bool |
Print progress information |
opt_dtypes |
bool |
Optimize Table dtypes for performance |
**kwargs |
Any |
Additional arguments passed to pq.read_table() |
| Returns | Type | Description |
|---|---|---|
pa.Table or list[pa.Table] or Generator |
Various types depending on arguments: - pa.Table: Single or concatenated Table - list[pa.Table]: List of Tables (if concat=False) - Generator: If batch_size set, yields batches of above types |
Example:
read_files()¶
Universal interface for reading data files of any supported format.
A unified API that automatically delegates to the appropriate reading function based on file format, while preserving all advanced features like:
- Batch processing
- Parallel reading
- File path tracking
- Format-specific optimizations
| Parameter | Type | Description |
|---|---|---|
path |
str or list[str] |
Path(s) to data file(s). Can be: - Single path string (globs supported) - List of path strings |
format |
str |
File format to read. Supported values: - "json": Regular JSON or JSON Lines - "csv": CSV files - "parquet": Parquet files |
batch_size |
int | None |
If set, enables batch reading with this many files per batch |
include_file_path |
bool |
Add source filepath as column/field |
concat |
bool |
Combine multiple files/batches into single result |
jsonlines |
bool |
For JSON format, whether to read as JSON Lines |
use_threads |
bool |
Enable parallel file reading |
verbose |
bool |
Print progress information |
opt_dtypes |
bool |
Optimize DataFrame/Arrow Table dtypes for performance |
**kwargs |
Any |
Additional format-specific arguments |
| Returns | Type | Description |
|---|---|---|
pl.DataFrame or pa.Table or list[pl.DataFrame] or list[pa.Table] or Generator |
Various types depending on format and arguments: - pl.DataFrame: For CSV and optionally JSON - pa.Table: For Parquet - list[pl.DataFrame or pa.Table]: Without concatenation - Generator: If batch_size set, yields batches |
Example:
pyarrow_dataset()¶
Create a PyArrow dataset from files in any supported format.
Creates a dataset that provides optimized reading and querying capabilities including:
- Schema inference and enforcement
- Partition discovery and pruning
- Predicate pushdown
- Column projection
| Parameter | Type | Description |
|---|---|---|
path |
str |
Base path to dataset files |
format |
str |
File format. Currently supports: - "parquet" (default) - "csv" - "json" (experimental) |
schema |
pa.Schema | None |
Optional schema to enforce. If None, inferred from data. |
partitioning |
str or list[str] or pds.Partitioning |
How the dataset is partitioned. Can be: - str: Single partition field - list[str]: Multiple partition fields - pds.Partitioning: Custom partitioning scheme |
**kwargs |
Any |
Additional arguments for dataset creation |
| Returns | Type | Description |
|---|---|---|
pds.Dataset |
pds.Dataset |
PyArrow dataset instance |
Example:
pyarrow_parquet_dataset()¶
Create a PyArrow dataset optimized for Parquet files.
Creates a dataset specifically for Parquet data, automatically handling _metadata files for optimized reading.
This function is particularly useful for:
- Datasets with existing
_metadatafiles - Multi-file datasets that should be treated as one
- Partitioned Parquet datasets
| Parameter | Type | Description |
|---|---|---|
path |
str |
Path to dataset directory or _metadata file |
schema |
pa.Schema | None |
Optional schema to enforce. If None, inferred from data. |
partitioning |
str or list[str] or pds.Partitioning |
How the dataset is partitioned. Can be: - str: Single partition field - list[str]: Multiple partition fields - pds.Partitioning: Custom partitioning scheme |
**kwargs |
Any |
Additional dataset arguments |
| Returns | Type | Description |
|---|---|---|
pds.Dataset |
pds.Dataset |
PyArrow dataset instance |
Example:
pydala_dataset()¶
Create a Pydala dataset for advanced Parquet operations.
Creates a dataset with additional features beyond PyArrow including:
- Delta table support
- Schema evolution
- Advanced partitioning
- Metadata management
- Sort key optimization
| Parameter | Type | Description |
|---|---|---|
path |
str |
Path to dataset directory |
partitioning |
str or list[str] or pds.Partitioning |
How the dataset is partitioned. Can be: - str: Single partition field - list[str]: Multiple partition fields - pds.Partitioning: Custom partitioning scheme |
**kwargs |
Any |
Additional dataset configuration |
| Returns | Type | Description |
|---|---|---|
ParquetDataset |
ParquetDataset |
Pydala dataset instance |
Example:
write_parquet()¶
Write data to a Parquet file with automatic format conversion.
Handles writing data from multiple input formats to Parquet with:
- Automatic conversion to PyArrow
- Schema validation/coercion
- Metadata collection
- Compression and encoding options
| Parameter | Type | Description |
|---|---|---|
data |
pl.DataFrame or pl.LazyFrame or pa.Table or pd.DataFrame or dict or list[dict] |
Input data in various formats: - Polars DataFrame/LazyFrame - PyArrow Table - Pandas DataFrame - Dict or list of dicts |
path |
str |
Output Parquet file path |
schema |
pa.Schema | None |
Optional schema to enforce on write |
**kwargs |
Any |
Additional arguments for pq.write_table() |
| Returns | Type | Description |
|---|---|---|
pq.FileMetaData |
pq.FileMetaData |
Metadata of written Parquet file |
| Raises | Type | Description |
|---|---|---|
SchemaError |
SchemaError |
If data doesn't match schema |
ValueError |
ValueError |
If data cannot be converted |
Example:
write_json()¶
Write data to a JSON file with flexible input support.
Handles writing data in various formats to JSON or JSON Lines, with optional appending for streaming writes.
| Parameter | Type | Description |
|---|---|---|
data |
dict or pl.DataFrame or pl.LazyFrame or pa.Table or pd.DataFrame or dict or list[dict] |
Input data in various formats: - Dict or list of dicts - Polars DataFrame/LazyFrame - PyArrow Table - Pandas DataFrame |
path |
str |
Output JSON file path |
append |
bool |
Whether to append to existing file (JSON Lines mode) |
Example:
write_csv()¶
Write data to a CSV file with flexible input support.
Handles writing data from multiple formats to CSV with options for:
- Appending to existing files
- Custom delimiters and formatting
- Automatic type conversion
- Header handling
| Parameter | Type | Description |
|---|---|---|
data |
pl.DataFrame or pl.LazyFrame or pa.Table or pd.DataFrame or dict or list[dict] |
Input data in various formats: - Polars DataFrame/LazyFrame - PyArrow Table - Pandas DataFrame - Dict or list of dicts |
path |
str |
Output CSV file path |
append |
bool |
Whether to append to existing file |
**kwargs |
Any |
Additional arguments for CSV writing: - delimiter: Field separator (default ",") - header: Whether to write header row - quote_char: Character for quoting fields - date_format: Format for date/time fields - float_precision: Decimal places for floats |
Example:
write_file()¶
Write a DataFrame to a file in the given format.
| Parameter | Type | Description |
|---|---|---|
data |
pl.DataFrame or pl.LazyFrame or pa.Table or pd.DataFrame or dict |
Data to write. |
path |
str |
Path to write the data. |
format |
str |
Format of the file. |
**kwargs |
Any |
Additional keyword arguments. |
| Returns | Type | Description |
|---|---|---|
None |
None |
write_files()¶
Write a DataFrame or a list of DataFrames to a file or a list of files.
| Parameter | Type | Description |
|---|---|---|
data |
pl.DataFrame or pl.LazyFrame or pa.Table or pa.RecordBatch or pa.RecordBatchReader or pd.DataFrame or dict or list[pl.DataFrame or pl.LazyFrame or pa.Table or pa.RecordBatch or pa.RecordBatchReader or pd.DataFrame or dict] |
Data to write. |
path |
str or list[str] |
Path to write the data. |
basename |
str |
Basename of the files. Defaults to None. |
format |
str |
Format of the data. Defaults to None. |
concat |
bool |
If True, concatenate the DataFrames. Defaults to True. |
unique |
bool or list[str] or str |
If True, remove duplicates. Defaults to False. |
mode |
str |
Write mode. Defaults to 'append'. Options: 'append', 'overwrite', 'delete_matching', 'error_if_exists'. |
use_threads |
bool |
If True, use parallel processing. Defaults to True. |
verbose |
bool |
If True, print verbose output. Defaults to True. |
**kwargs |
Any |
Additional keyword arguments. |
| Returns | Type | Description |
|---|---|---|
None |
None |
| Raises | Type | Description |
|---|---|---|
FileExistsError |
FileExistsError |
If file already exists and mode is 'error_if_exists'. |
write_pyarrow_dataset()¶
Write a tabular data to a PyArrow dataset.
| Parameter | Type | Description |
|---|---|---|
data |
pl.DataFrame or pa.Table or pa.RecordBatch or pa.RecordBatchReader or pd.DataFrame or list[pl.DataFrame] or list[pa.Table] or list[pa.RecordBatch] or list[pa.RecordBatchReader] or list[pd.DataFrame] |
Data to write. |
path |
str |
Path to write the data. |
basename |
str | None |
Basename of the files. Defaults to None. |
schema |
pa.Schema | None |
Schema of the data. Defaults to None. |
partition_by |
str or list[str] or pds.Partitioning or None |
Partitioning of the data. Defaults to None. |
partitioning_flavor |
str |
Partitioning flavor. Defaults to 'hive'. |
mode |
str |
Write mode. Defaults to 'append'. |
format |
str | None |
Format of the data. Defaults to 'parquet'. |
compression |
str |
Compression algorithm. Defaults to 'zstd'. |
max_rows_per_file |
int | None |
Maximum number of rows per file. Defaults to 2,500,000. |
row_group_size |
int | None |
Row group size. Defaults to 250,000. |
concat |
bool |
If True, concatenate the DataFrames. Defaults to True. |
unique |
bool or str or list[str] |
If True, remove duplicates. Defaults to False. |
**kwargs |
Any |
Additional keyword arguments for pds.write_dataset. |
| Returns | Type | Description |
|---|---|---|
list[pq.FileMetaData] or None |
List of Parquet file metadata or None. |
write_pydala_dataset()¶
Write a tabular data to a Pydala dataset.
| Parameter | Type | Description |
|---|---|---|
data |
pl.DataFrame or pa.Table or pa.RecordBatch or pa.RecordBatchReader or pd.DataFrame or list[pl.DataFrame] or list[pa.Table] or list[pa.RecordBatch] or list[pa.RecordBatchReader] or list[pd.DataFrame] |
Data to write. |
path |
str |
Path to write the data. |
mode |
str |
Write mode. Defaults to 'append'. Options: 'delta', 'overwrite'. |
basename |
str | None |
Basename of the files. Defaults to None. |
partition_by |
str or list[str] or None |
Partitioning of the data. Defaults to None. |
partitioning_flavor |
str |
Partitioning flavor. Defaults to 'hive'. |
max_rows_per_file |
int | None |
Maximum number of rows per file. Defaults to 2,500,000. |
row_group_size |
int | None |
Row group size. Defaults to 250,000. |
compression |
str |
Compression algorithm. Defaults to 'zstd'. |
sort_by |
str or list[str] or list[tuple[str, str]] or None |
Columns to sort by. Defaults to None. |
unique |
bool or str or list[str] |
If True, ensure unique values. Defaults to False. |
delta_subset |
str or list[str] or None |
Subset of columns to include in delta table. Defaults to None. |
update_metadata |
bool |
If True, update metadata. Defaults to True. |
alter_schema |
bool |
If True, alter schema. Defaults to False. |
timestamp_column |
str or None |
Timestamp column. Defaults to None. |
verbose |
bool |
If True, print verbose output. Defaults to True. |
**kwargs |
Any |
Additional keyword arguments for ParquetDataset.write_to_dataset. |
| Returns | Type | Description |
|---|---|---|
None |
None |