๐Ÿš€ filoma demoยถ

Fast, multi-backend file analysis with a tiny API surface

Inย [1]:
from IPython.display import Image

import filoma

print(f"filoma version: {filoma.__version__}")
Image("../images/flow.png", width=600)
filoma version: 1.7.3
Out[1]:
No description has been provided for this image

Let's start with something simple, like getting a handy dataclass for a single file:

๐Ÿ”๐Ÿ“„ File analysisยถ

๐Ÿ“„ Single file (any kind)ยถ

Inย [2]:
from filoma import probe_file

file_info = probe_file("../README.md")
print(f"Path: {file_info.path}")
print(f"Size: {file_info.size}")
print(f"Modified: {file_info.modified}")
print(f"file_info: {[i for i in dir(file_info) if not i.startswith('_')]}")
Path: /home/kalfasy/repos/filoma/README.md
Size: 10914
Modified: 2025-09-13 16:01:30
file_info: ['accessed', 'as_dict', 'created', 'from_report', 'get', 'group', 'inode', 'is_dir', 'is_file', 'is_symlink', 'items', 'keys', 'mode', 'mode_str', 'modified', 'nlink', 'owner', 'path', 'rights', 'sha256', 'size', 'target_is_dir', 'target_is_file', 'to_dict', 'values', 'xattrs']

or specifically for image files:

๐Ÿ–ผ๏ธ Image file analysisยถ

Inย [3]:
from filoma import probe_image

img = probe_image("../images/logo.png")
print(f"Type of file: {img.file_type}, Type of img object: {type(img)}")
print(f"Shape: {img.shape}")
print(f"Data range: {img.min} - {img.max}")
print(f"img info: {img.as_dict()}")
Type of file: png, Type of img object: <class 'filoma.images.image_profiler.ImageReport'>
Shape: (762, 628, 4)
Data range: 0.0 - 255.0
img info: {'path': '../images/logo.png', 'file_type': 'png', 'shape': (762, 628, 4), 'dtype': 'uint8', 'min': 0.0, 'max': 255.0, 'mean': 230.89027732500793, 'nans': 0, 'infs': 0, 'unique': 256, 'status': None}

๐Ÿ”๐Ÿ“ Directory Analysisยถ

Do you want to analyze a directory of files and extract metadata, text content, and other useful information?
filoma makes it super easy to do so with just a few lines of code.

Inย [4]:
from filoma.directories import DirectoryProfiler, DirectoryProfilerConfig

# Create a profiler using the typed config dataclass
config = DirectoryProfilerConfig(use_rust=True)
dir_prof = DirectoryProfiler(config)

analysis = dir_prof.probe("../")
dir_prof.print_summary(analysis)
2025-09-13 17:25:20.543 | DEBUG    | filoma.directories.directory_profiler:__init__:343 - Interactive environment detected, disabling progress bars to avoid conflicts
2025-09-13 17:25:20.543 | INFO     | filoma.directories.directory_profiler:probe:430 - Starting directory analysis of '../' using ๐Ÿฆ€ Rust (Parallel) implementation
2025-09-13 17:25:21.221 | SUCCESS  | filoma.directories.directory_profiler:probe:446 - Directory analysis completed in 0.68s - Found 67,430 items (63,126 files, 4,304 folders) using ๐Ÿฆ€ Rust (Parallel)
 Directory Analysis: /home/kalfasy/repos/filoma (๐Ÿฆ€  
              Rust (Parallel)) - 0.68s               
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Metric                   โ”ƒ Value                  โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Total Files              โ”‚ 63,126                 โ”‚
โ”‚ Total Folders            โ”‚ 4,304                  โ”‚
โ”‚ Total Size               โ”‚ 2,049.2219619750977 MB โ”‚
โ”‚ Average Files per Folder โ”‚ 14.66682156133829      โ”‚
โ”‚ Maximum Depth            โ”‚ 14                     โ”‚
โ”‚ Empty Folders            โ”‚ 53                     โ”‚
โ”‚ Analysis Time            โ”‚ 0.68s                  โ”‚
โ”‚ Processing Speed         โ”‚ 99,451 items/sec       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Want to quickly see a report of your findings? filoma has you covered.

Inย [5]:
dir_prof.print_report(analysis)
 Directory Analysis: /home/kalfasy/repos/filoma (๐Ÿฆ€  
              Rust (Parallel)) - 0.68s               
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Metric                   โ”ƒ Value                  โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Total Files              โ”‚ 63,126                 โ”‚
โ”‚ Total Folders            โ”‚ 4,304                  โ”‚
โ”‚ Total Size               โ”‚ 2,049.2219619750977 MB โ”‚
โ”‚ Average Files per Folder โ”‚ 14.66682156133829      โ”‚
โ”‚ Maximum Depth            โ”‚ 14                     โ”‚
โ”‚ Empty Folders            โ”‚ 53                     โ”‚
โ”‚ Analysis Time            โ”‚ 0.68s                  โ”‚
โ”‚ Processing Speed         โ”‚ 99,451 items/sec       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

         File Extensions          
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Extension โ”ƒ Count โ”ƒ Percentage โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ .lock     โ”‚ 8     โ”‚ 0.0%       โ”‚
โ”‚ .rlib     โ”‚ 136   โ”‚ 0.2%       โ”‚
โ”‚ .sample   โ”‚ 14    โ”‚ 0.0%       โ”‚
โ”‚ .crt      โ”‚ 2     โ”‚ 0.0%       โ”‚
โ”‚ .build    โ”‚ 6     โ”‚ 0.0%       โ”‚
โ”‚ .h        โ”‚ 1,050 โ”‚ 1.7%       โ”‚
โ”‚ .cmd      โ”‚ 2     โ”‚ 0.0%       โ”‚
โ”‚ .po       โ”‚ 2     โ”‚ 0.0%       โ”‚
โ”‚ .d        โ”‚ 212   โ”‚ 0.3%       โ”‚
โ”‚ .fits     โ”‚ 2     โ”‚ 0.0%       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

                Common Folder Names                
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Folder Name                       โ”ƒ Occurrences โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ 50                                โ”‚ 1           โ”‚
โ”‚ datetimerange                     โ”‚ 2           โ”‚
โ”‚ pydev_sitecustomize               โ”‚ 2           โ”‚
โ”‚ it                                โ”‚ 4           โ”‚
โ”‚ assets                            โ”‚ 3           โ”‚
โ”‚ lib2to3                           โ”‚ 2           โ”‚
โ”‚ mark                              โ”‚ 2           โ”‚
โ”‚ requests_toolbelt-1.0.0.dist-info โ”‚ 2           โ”‚
โ”‚ certifi-2025.6.15.dist-info       โ”‚ 2           โ”‚
โ”‚ click                             โ”‚ 4           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

                             Empty Folders (showing 20 of 53)                             
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Path                                                                                   โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ /home/kalfasy/repos/filoma/notebooks/tests                                             โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/proc-macro2-fdcb222da4373f07/out         โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/pyo3-17c1b0b456a0c7e4/out                โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/crossbeam-utils-7b9bd07d9fad49b5/out     โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/serde_json-85cdf3d882972d5c/out          โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/pyo3-8813d706ead48d90/out                โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/pyo3-macros-backend-6807c5cda8b1462a/out โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/memoffset-f4a6c90d4e3f18eb/out           โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/rayon-core-03a79c366c595edc/out          โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/pyo3-macros-backend-3864f2b2dbf9644d/out โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/pyo3-264b9b4aa4d6a2b5/out                โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/portable-atomic-f4353affba5b92e8/out     โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/pyo3-ffi-747a9d93a69eb79b/out            โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/proc-macro2-4d577003447afa24/out         โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/pyo3-ffi-8e536f56d2d22bd8/out            โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/libc-9d68f4abca27c9af/out                โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/lock_api-96adb08174198570/out            โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/serde-6690732fba37d144/out               โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/parking_lot_core-d5ba795b49a036a8/out    โ”‚
โ”‚ /home/kalfasy/repos/filoma/target/debug/build/pyo3-ffi-0b1b893f015622ee/out            โ”‚
โ”‚ ... and 33 more                                                                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ Directory of files --> DataFrame --> Data Explorationยถ

Now that you saw what's up with your files, you might want to explore the data in a familiar format.
filoma can convert the analysis results into a Polars (or Pandas) DataFrame real quick.
NOTE: Pandas support requires the pd extra which you can install by running uv sync --extra pd in your terminal.

Inย [6]:
from filoma import probe_to_df

df = probe_to_df("../", max_depth=2, enrich=True)
print(f"Found {len(df)} files")
df.head()
2025-09-13 17:25:21.238 | DEBUG    | filoma.directories.directory_profiler:__init__:343 - Interactive environment detected, disabling progress bars to avoid conflicts
2025-09-13 17:25:21.239 | INFO     | filoma.directories.directory_profiler:probe:430 - Starting directory analysis of '../' using ๐Ÿ Python implementation
2025-09-13 17:25:21.584 | SUCCESS  | filoma.directories.directory_profiler:probe:446 - Directory analysis completed in 0.35s - Found 385 items (323 files, 62 folders) using ๐Ÿ Python
Found 384 files
Out[6]:
shape: (5, 18)
pathdepthparentnamestemsuffixsize_bytesmodified_timecreated_timeis_fileis_dirownergroupmode_strinodenlinksha256xattrs
stri64strstrstrstri64strstrboolboolstrstrstri64i64strstr
"../pyproject.toml"1"..""pyproject.toml""pyproject"".toml"1932"2025-09-13 17:16:30""2025-09-13 17:16:30"truefalse"kalfasy""kalfasy""-rw-rw-r--"75799611null"{}"
"../scripts"1"..""scripts""scripts"""4096"2025-09-05 20:26:25""2025-09-05 20:26:25"falsetrue"kalfasy""kalfasy""drwxrwxr-x"76031222null"{}"
"../.pytest_cache"1".."".pytest_cache"".pytest_cache"""4096"2025-07-05 22:28:03""2025-07-05 22:28:03"falsetrue"kalfasy""kalfasy""drwxrwxr-x"76048453null"{}"
"../.vscode"1".."".vscode"".vscode"""4096"2025-07-06 11:11:18""2025-07-06 11:11:18"falsetrue"kalfasy""kalfasy""drwxrwxr-x"75916352null"{}"
"../Makefile"1"..""Makefile""Makefile"""2876"2025-09-13 12:14:25""2025-09-13 12:14:25"truefalse"kalfasy""kalfasy""-rw-rw-r--"76031191null"{}"
Inย [7]:
print(f"Type of df:\t{type(df.to_pandas())}, \nShape of df:\t{df.to_pandas().shape}")
Type of df:	<class 'pandas.core.frame.DataFrame'>, 
Shape of df:	(384, 18)

โšก DataFrame enrichmentยถ

You're probably wondering "what is enrich=True?"
Well, since filoma gathers the paths of your files in a DataFrame, why not enrich this DataFrame with additional metadata. Its own DataFrame class has convenience functions like: add_path_components(), add_file_stats_cols(), add_depth_col()

Let's see it in action:

Inย [8]:
from rich.console import Console
from rich.panel import Panel

console = Console()

cfg = DirectoryProfilerConfig(build_dataframe=True, use_rust=True, return_absolute_paths=True)
dprof = DirectoryProfiler(cfg)
res = dprof.probe("../")

orig_cols = list(res.dataframe.columns)
console.print(Panel(f"Columns before enrich: [bold]{', '.join(orig_cols)}[/]"))
console.print(Panel(res.dataframe.head(3).to_pandas().to_string(index=False), title="DataFrame head (before enrich)"))

df = res.dataframe.enrich()
new_cols = sorted(set(df.columns) - set(orig_cols))
console.print(Panel(f"New columns after enrich: [bold]{', '.join(new_cols)}[/]"))
console.print(Panel(df.head(3).to_pandas().to_string(index=False), title="DataFrame head (after enrich)"))
2025-09-13 17:25:21.657 | DEBUG    | filoma.directories.directory_profiler:__init__:343 - Interactive environment detected, disabling progress bars to avoid conflicts
2025-09-13 17:25:21.657 | INFO     | filoma.directories.directory_profiler:probe:430 - Starting directory analysis of '../' using ๐Ÿฆ€ Rust (Parallel) implementation
2025-09-13 17:25:22.558 | SUCCESS  | filoma.directories.directory_profiler:probe:446 - Directory analysis completed in 0.90s - Found 67,432 items (63,128 files, 4,304 folders) using ๐Ÿฆ€ Rust (Parallel)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Columns before enrich: path                                                                                     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ DataFrame head (before enrich) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚              path                                                                                               โ”‚
โ”‚ ../pyproject.toml                                                                                               โ”‚
โ”‚        ../scripts                                                                                               โ”‚
โ”‚  ../.pytest_cache                                                                                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ New columns after enrich: created_time, depth, group, inode, is_dir, is_file, mode_str, modified_time, name,    โ”‚
โ”‚ nlink, owner, parent, sha256, size_bytes, stem, suffix, xattrs                                                  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ DataFrame head (after enrich) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚              path parent           name          stem suffix  size_bytes       modified_time                    โ”‚
โ”‚ created_time  is_file  is_dir   owner   group   mode_str   inode  nlink sha256 xattrs  depth                    โ”‚
โ”‚ ../pyproject.toml     .. pyproject.toml     pyproject  .toml        1932 2025-09-13 17:16:30 2025-09-13         โ”‚
โ”‚ 17:16:30     True   False kalfasy kalfasy -rw-rw-r-- 7579961      1   None     {}      1                        โ”‚
โ”‚        ../scripts     ..        scripts       scripts               4096 2025-09-05 20:26:25 2025-09-05         โ”‚
โ”‚ 20:26:25    False    True kalfasy kalfasy drwxrwxr-x 7603122      2   None     {}      1                        โ”‚
โ”‚  ../.pytest_cache     ..  .pytest_cache .pytest_cache               4096 2025-07-05 22:28:03 2025-07-05         โ”‚
โ”‚ 22:28:03    False    True kalfasy kalfasy drwxrwxr-x 7604845      3   None     {}      1                        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

๐Ÿค– ML-ready splits (train, val, test)ยถ

The next logical thing for filoma to serve in data science workflows is to provide easy ways to create ML-ready splits.

In a very simple case, you have a dataframe with a split/label/... column that you want to use to create the splits like so:

train, val, test = df[df["split"] == "train"], df[df["split"] == "val"], df[df["split"] == "test"]

But things are rarely that simple in practice.

When you're given data in the "real world", either in folders & files or in a dataframe, you often need to create the splits yourself.

Many practicioners unfortunately often disregard the importance of the data splits and split their data randomly, which can lead to data leakage, overfitting issues and unrealistic performance metrics

A minimum best practice is to split your data into 3 sets:

  • Training set: used to train your model
  • Validation set: used to tune your model's hyperparameters
  • Testing set: used to evaluate the final performance of your model

Ideally, you'd want to use a validation set that is representative of your test set, and both should be representative of your real-world data. So, special care should be taken to ensure that the splits are done correctly.

๐Ÿ“š traditional sklearn wayยถ

A very popular (if not the most popular) way of splitting data, is to use scikit-learn's function called train_test_split that can split your data into training and testing sets. It can do that for you rather easily, although you'll need to call it twice if you want a 3-way split.:

from sklearn.model_selection import train_test_split
train, temp = train_test_split(df, test_size=0.4)
val, test = train_test_split(temp, test_size=0.5)

Isn't it confusing that you just wanted a 60/20/20 split but had to specify 40% and then 50%? Imagine doing this for a 70/20/10 split...

from sklearn.model_selection import train_test_split
train, temp = train_test_split(df, test_size=0.3)
val, test = train_test_split(temp, test_size=0.3333)

๐Ÿง™โ€โ™‚๏ธ filoma's split methodยถ

filoma takes this a step further by providing a function that can not only split your data into training, validation, and testing sets in one-go, but also do that based on features found in your file-names or directories.

For example, your data might have subcategories encoded in filenames like dog_bulldog_001.jpg, dog_bulldog_002.jpg, cat_siamese_001.jpg, etc. You can use filoma's add_filename_features() to extract these features into separate columns. Plus, you can name these columns whatever you want!

Simple Random Splitยถ

Let's start with a simple random split of our scanned files. This is useful when you don't need to group related files together.

๐Ÿถ Example: splitting pet images by breedยถ

Create a DataFrame with file pathsยถ
Inย [9]:
from filoma import DataFrame

breeds_by_species = {
    "dog": ["labrador", "beagle", "bulldog", "poodle"],
    "cat": ["siamese", "mainecoon", "persian", "ragdoll"],
    "bird": ["sparrow", "robin", "parrot", "crow"],
}

colors_by_breed = {
    "labrador": ["black", "yellow", "chocolate"],
    "beagle": ["tricolor", "brown"],
    "bulldog": ["fawn", "white"],
    "poodle": ["white", "black"],
    "siamese": ["sealpoint", "bluepoint"],
    "mainecoon": ["tabby", "brown"],
    "persian": ["white", "silver"],
    "ragdoll": ["colorpoint", "sealpoint"],
    "sparrow": ["hatchling", "adult"],
    "robin": ["adult", "juvenile"],
    "parrot": ["green", "red"],
    "crow": ["black"],
}

# build all (species, breed, color) combos
combos = []
for species, breeds in breeds_by_species.items():
    for breed in breeds:
        colors = colors_by_breed.get(breed, ["unknown"])
        for color in colors:
            combos.append((species, breed, color))

# generate 100 file paths cycling through combos and numbering files
paths = []
for i in range(1, 101):
    species, breed, color = combos[(i - 1) % len(combos)]
    paths.append(f"{species}/{breed}_{color}_{i:03d}.jpg")

data = {"path": paths}

df = DataFrame(data)
print("Small sample of the DataFrame:")
df.sample(5)
Small sample of the DataFrame:
Out[9]:
shape: (5, 1)
path
str
"bird/sparrow_hatchling_090.jpg"
"dog/bulldog_white_031.jpg"
"cat/persian_white_062.jpg"
"dog/beagle_tricolor_052.jpg"
"cat/mainecoon_tabby_012.jpg"
Add path components as columnsยถ
Inย [10]:
df.add_filename_features(path_col="path", sep="_", token_names=["breed", "color", "number"], inplace=True)
Out[10]:
shape: (100, 4)
pathbreedcolornumber
strstrstrstr
"dog/labrador_black_001.jpg""labrador""black""001"
"dog/labrador_yellow_002.jpg""labrador""yellow""002"
"dog/labrador_chocolate_003.jpg""labrador""chocolate""003"
"dog/beagle_tricolor_004.jpg""beagle""tricolor""004"
"dog/beagle_brown_005.jpg""beagle""brown""005"
โ€ฆโ€ฆโ€ฆโ€ฆ
"bird/crow_black_096.jpg""crow""black""096"
"dog/labrador_black_097.jpg""labrador""black""097"
"dog/labrador_yellow_098.jpg""labrador""yellow""098"
"dog/labrador_chocolate_099.jpg""labrador""chocolate""099"
"dog/beagle_tricolor_100.jpg""beagle""tricolor""100"
Split by breed while preserving distributionยถ
Inย [11]:
train, val, test = df.split_data(seed=42, train_val_test=(70, 20, 10), feature="breed")
print(f"Shapes of train, val, test: {train.shape}, {val.shape}, {test.shape}")
2025-09-13 17:25:28.359 | WARNING  | filoma.ml:split_data:460 - filoma.ml.split_data: unique feature values differ across splits for '_feat_group' - counts train=8, val=3, test=1; examples missing_in_train=['poodle', 'labrador', 'beagle', 'bulldog'], missing_in_val=['crow', 'siamese', 'poodle', 'ragdoll', 'parrot'], missing_in_test=['crow', 'siamese', 'labrador', 'ragdoll', 'bulldog']
2025-09-13 17:25:28.359 | WARNING  | filoma.ml:_maybe_log_ratio_drift:329 - filoma.ml.split_data: achieved counts 60.0%,32.0%,8.0% ((60, 32, 8)) vs requested (70.0%,20.0%,10.0%) total=100 (grouped hashing can cause drift)
Shapes of train, val, test: (60, 5), (32, 5), (8, 5)

Note that filoma warns you of two things:

  1. It found that some categories have very few samples, which led to splits that do not have all categories represented.
  2. It tried to preserve the distribution of your selected column(s) across the splits, but it couldn't do it perfectly because some categories had very few samples.

โœ… Conclusionยถ

So this is how you can use filoma to go from a scary dataset directory tree to a clean DataFrame with enriched metadata and ML-ready splits, all with just a few lines of code!