๐ filoma demoยถ
Fast, multi-backend file analysis with a tiny API surface
from IPython.display import Image
import filoma
print(f"filoma version: {filoma.__version__}")
Image("../images/flow.png", width=600)
filoma version: 1.7.3
Let's start with something simple, like getting a handy dataclass for a single file:
๐๐ File analysisยถ
๐ Single file (any kind)ยถ
from filoma import probe_file
file_info = probe_file("../README.md")
print(f"Path: {file_info.path}")
print(f"Size: {file_info.size}")
print(f"Modified: {file_info.modified}")
print(f"file_info: {[i for i in dir(file_info) if not i.startswith('_')]}")
Path: /home/kalfasy/repos/filoma/README.md Size: 10914 Modified: 2025-09-13 16:01:30 file_info: ['accessed', 'as_dict', 'created', 'from_report', 'get', 'group', 'inode', 'is_dir', 'is_file', 'is_symlink', 'items', 'keys', 'mode', 'mode_str', 'modified', 'nlink', 'owner', 'path', 'rights', 'sha256', 'size', 'target_is_dir', 'target_is_file', 'to_dict', 'values', 'xattrs']
or specifically for image files:
๐ผ๏ธ Image file analysisยถ
from filoma import probe_image
img = probe_image("../images/logo.png")
print(f"Type of file: {img.file_type}, Type of img object: {type(img)}")
print(f"Shape: {img.shape}")
print(f"Data range: {img.min} - {img.max}")
print(f"img info: {img.as_dict()}")
Type of file: png, Type of img object: <class 'filoma.images.image_profiler.ImageReport'>
Shape: (762, 628, 4)
Data range: 0.0 - 255.0
img info: {'path': '../images/logo.png', 'file_type': 'png', 'shape': (762, 628, 4), 'dtype': 'uint8', 'min': 0.0, 'max': 255.0, 'mean': 230.89027732500793, 'nans': 0, 'infs': 0, 'unique': 256, 'status': None}
๐๐ Directory Analysisยถ
Do you want to analyze a directory of files and extract metadata, text content, and other useful information?
filoma makes it super easy to do so with just a few lines of code.
from filoma.directories import DirectoryProfiler, DirectoryProfilerConfig
# Create a profiler using the typed config dataclass
config = DirectoryProfilerConfig(use_rust=True)
dir_prof = DirectoryProfiler(config)
analysis = dir_prof.probe("../")
dir_prof.print_summary(analysis)
2025-09-13 17:25:20.543 | DEBUG | filoma.directories.directory_profiler:__init__:343 - Interactive environment detected, disabling progress bars to avoid conflicts 2025-09-13 17:25:20.543 | INFO | filoma.directories.directory_profiler:probe:430 - Starting directory analysis of '../' using ๐ฆ Rust (Parallel) implementation 2025-09-13 17:25:21.221 | SUCCESS | filoma.directories.directory_profiler:probe:446 - Directory analysis completed in 0.68s - Found 67,430 items (63,126 files, 4,304 folders) using ๐ฆ Rust (Parallel)
Directory Analysis: /home/kalfasy/repos/filoma (๐ฆ Rust (Parallel)) - 0.68s โโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโ โ Metric โ Value โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ Total Files โ 63,126 โ โ Total Folders โ 4,304 โ โ Total Size โ 2,049.2219619750977 MB โ โ Average Files per Folder โ 14.66682156133829 โ โ Maximum Depth โ 14 โ โ Empty Folders โ 53 โ โ Analysis Time โ 0.68s โ โ Processing Speed โ 99,451 items/sec โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโ
Want to quickly see a report of your findings? filoma has you covered.
dir_prof.print_report(analysis)
Directory Analysis: /home/kalfasy/repos/filoma (๐ฆ Rust (Parallel)) - 0.68s โโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโ โ Metric โ Value โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ Total Files โ 63,126 โ โ Total Folders โ 4,304 โ โ Total Size โ 2,049.2219619750977 MB โ โ Average Files per Folder โ 14.66682156133829 โ โ Maximum Depth โ 14 โ โ Empty Folders โ 53 โ โ Analysis Time โ 0.68s โ โ Processing Speed โ 99,451 items/sec โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโ
File Extensions โโโโโโโโโโโโโณโโโโโโโโณโโโโโโโโโโโโโ โ Extension โ Count โ Percentage โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ .lock โ 8 โ 0.0% โ โ .rlib โ 136 โ 0.2% โ โ .sample โ 14 โ 0.0% โ โ .crt โ 2 โ 0.0% โ โ .build โ 6 โ 0.0% โ โ .h โ 1,050 โ 1.7% โ โ .cmd โ 2 โ 0.0% โ โ .po โ 2 โ 0.0% โ โ .d โ 212 โ 0.3% โ โ .fits โ 2 โ 0.0% โ โโโโโโโโโโโโโดโโโโโโโโดโโโโโโโโโโโโโ
Common Folder Names โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโ โ Folder Name โ Occurrences โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ 50 โ 1 โ โ datetimerange โ 2 โ โ pydev_sitecustomize โ 2 โ โ it โ 4 โ โ assets โ 3 โ โ lib2to3 โ 2 โ โ mark โ 2 โ โ requests_toolbelt-1.0.0.dist-info โ 2 โ โ certifi-2025.6.15.dist-info โ 2 โ โ click โ 4 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโ
Empty Folders (showing 20 of 53) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ Path โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ /home/kalfasy/repos/filoma/notebooks/tests โ โ /home/kalfasy/repos/filoma/target/debug/build/proc-macro2-fdcb222da4373f07/out โ โ /home/kalfasy/repos/filoma/target/debug/build/pyo3-17c1b0b456a0c7e4/out โ โ /home/kalfasy/repos/filoma/target/debug/build/crossbeam-utils-7b9bd07d9fad49b5/out โ โ /home/kalfasy/repos/filoma/target/debug/build/serde_json-85cdf3d882972d5c/out โ โ /home/kalfasy/repos/filoma/target/debug/build/pyo3-8813d706ead48d90/out โ โ /home/kalfasy/repos/filoma/target/debug/build/pyo3-macros-backend-6807c5cda8b1462a/out โ โ /home/kalfasy/repos/filoma/target/debug/build/memoffset-f4a6c90d4e3f18eb/out โ โ /home/kalfasy/repos/filoma/target/debug/build/rayon-core-03a79c366c595edc/out โ โ /home/kalfasy/repos/filoma/target/debug/build/pyo3-macros-backend-3864f2b2dbf9644d/out โ โ /home/kalfasy/repos/filoma/target/debug/build/pyo3-264b9b4aa4d6a2b5/out โ โ /home/kalfasy/repos/filoma/target/debug/build/portable-atomic-f4353affba5b92e8/out โ โ /home/kalfasy/repos/filoma/target/debug/build/pyo3-ffi-747a9d93a69eb79b/out โ โ /home/kalfasy/repos/filoma/target/debug/build/proc-macro2-4d577003447afa24/out โ โ /home/kalfasy/repos/filoma/target/debug/build/pyo3-ffi-8e536f56d2d22bd8/out โ โ /home/kalfasy/repos/filoma/target/debug/build/libc-9d68f4abca27c9af/out โ โ /home/kalfasy/repos/filoma/target/debug/build/lock_api-96adb08174198570/out โ โ /home/kalfasy/repos/filoma/target/debug/build/serde-6690732fba37d144/out โ โ /home/kalfasy/repos/filoma/target/debug/build/parking_lot_core-d5ba795b49a036a8/out โ โ /home/kalfasy/repos/filoma/target/debug/build/pyo3-ffi-0b1b893f015622ee/out โ โ ... and 33 more โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Directory of files --> DataFrame --> Data Explorationยถ
Now that you saw what's up with your files, you might want to explore the data in a familiar format.
filoma can convert the analysis results into a Polars (or Pandas) DataFrame real quick.
NOTE: Pandas support requires the pd extra which you can install by running uv sync --extra pd in your terminal.
from filoma import probe_to_df
df = probe_to_df("../", max_depth=2, enrich=True)
print(f"Found {len(df)} files")
df.head()
2025-09-13 17:25:21.238 | DEBUG | filoma.directories.directory_profiler:__init__:343 - Interactive environment detected, disabling progress bars to avoid conflicts 2025-09-13 17:25:21.239 | INFO | filoma.directories.directory_profiler:probe:430 - Starting directory analysis of '../' using ๐ Python implementation 2025-09-13 17:25:21.584 | SUCCESS | filoma.directories.directory_profiler:probe:446 - Directory analysis completed in 0.35s - Found 385 items (323 files, 62 folders) using ๐ Python
Found 384 files
| path | depth | parent | name | stem | suffix | size_bytes | modified_time | created_time | is_file | is_dir | owner | group | mode_str | inode | nlink | sha256 | xattrs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | str | str | str | str | i64 | str | str | bool | bool | str | str | str | i64 | i64 | str | str |
| "../pyproject.toml" | 1 | ".." | "pyproject.toml" | "pyproject" | ".toml" | 1932 | "2025-09-13 17:16:30" | "2025-09-13 17:16:30" | true | false | "kalfasy" | "kalfasy" | "-rw-rw-r--" | 7579961 | 1 | null | "{}" |
| "../scripts" | 1 | ".." | "scripts" | "scripts" | "" | 4096 | "2025-09-05 20:26:25" | "2025-09-05 20:26:25" | false | true | "kalfasy" | "kalfasy" | "drwxrwxr-x" | 7603122 | 2 | null | "{}" |
| "../.pytest_cache" | 1 | ".." | ".pytest_cache" | ".pytest_cache" | "" | 4096 | "2025-07-05 22:28:03" | "2025-07-05 22:28:03" | false | true | "kalfasy" | "kalfasy" | "drwxrwxr-x" | 7604845 | 3 | null | "{}" |
| "../.vscode" | 1 | ".." | ".vscode" | ".vscode" | "" | 4096 | "2025-07-06 11:11:18" | "2025-07-06 11:11:18" | false | true | "kalfasy" | "kalfasy" | "drwxrwxr-x" | 7591635 | 2 | null | "{}" |
| "../Makefile" | 1 | ".." | "Makefile" | "Makefile" | "" | 2876 | "2025-09-13 12:14:25" | "2025-09-13 12:14:25" | true | false | "kalfasy" | "kalfasy" | "-rw-rw-r--" | 7603119 | 1 | null | "{}" |
print(f"Type of df:\t{type(df.to_pandas())}, \nShape of df:\t{df.to_pandas().shape}")
Type of df: <class 'pandas.core.frame.DataFrame'>, Shape of df: (384, 18)
โก DataFrame enrichmentยถ
You're probably wondering "what is enrich=True?"
Well, since filoma gathers the paths of your files in a DataFrame, why not enrich this DataFrame with additional metadata. Its own DataFrame class has convenience functions like: add_path_components(), add_file_stats_cols(), add_depth_col()
Let's see it in action:
from rich.console import Console
from rich.panel import Panel
console = Console()
cfg = DirectoryProfilerConfig(build_dataframe=True, use_rust=True, return_absolute_paths=True)
dprof = DirectoryProfiler(cfg)
res = dprof.probe("../")
orig_cols = list(res.dataframe.columns)
console.print(Panel(f"Columns before enrich: [bold]{', '.join(orig_cols)}[/]"))
console.print(Panel(res.dataframe.head(3).to_pandas().to_string(index=False), title="DataFrame head (before enrich)"))
df = res.dataframe.enrich()
new_cols = sorted(set(df.columns) - set(orig_cols))
console.print(Panel(f"New columns after enrich: [bold]{', '.join(new_cols)}[/]"))
console.print(Panel(df.head(3).to_pandas().to_string(index=False), title="DataFrame head (after enrich)"))
2025-09-13 17:25:21.657 | DEBUG | filoma.directories.directory_profiler:__init__:343 - Interactive environment detected, disabling progress bars to avoid conflicts 2025-09-13 17:25:21.657 | INFO | filoma.directories.directory_profiler:probe:430 - Starting directory analysis of '../' using ๐ฆ Rust (Parallel) implementation 2025-09-13 17:25:22.558 | SUCCESS | filoma.directories.directory_profiler:probe:446 - Directory analysis completed in 0.90s - Found 67,432 items (63,128 files, 4,304 folders) using ๐ฆ Rust (Parallel)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Columns before enrich: path โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ DataFrame head (before enrich) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ path โ โ ../pyproject.toml โ โ ../scripts โ โ ../.pytest_cache โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ New columns after enrich: created_time, depth, group, inode, is_dir, is_file, mode_str, modified_time, name, โ โ nlink, owner, parent, sha256, size_bytes, stem, suffix, xattrs โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ DataFrame head (after enrich) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ path parent name stem suffix size_bytes modified_time โ
โ created_time is_file is_dir owner group mode_str inode nlink sha256 xattrs depth โ
โ ../pyproject.toml .. pyproject.toml pyproject .toml 1932 2025-09-13 17:16:30 2025-09-13 โ
โ 17:16:30 True False kalfasy kalfasy -rw-rw-r-- 7579961 1 None {} 1 โ
โ ../scripts .. scripts scripts 4096 2025-09-05 20:26:25 2025-09-05 โ
โ 20:26:25 False True kalfasy kalfasy drwxrwxr-x 7603122 2 None {} 1 โ
โ ../.pytest_cache .. .pytest_cache .pytest_cache 4096 2025-07-05 22:28:03 2025-07-05 โ
โ 22:28:03 False True kalfasy kalfasy drwxrwxr-x 7604845 3 None {} 1 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ค ML-ready splits (train, val, test)ยถ
The next logical thing for filoma to serve in data science workflows is to provide easy ways to create ML-ready splits.
In a very simple case, you have a dataframe with a split/label/... column that you want to use to create the splits like so:
train, val, test = df[df["split"] == "train"], df[df["split"] == "val"], df[df["split"] == "test"]
But things are rarely that simple in practice.
When you're given data in the "real world", either in folders & files or in a dataframe, you often need to create the splits yourself.
Many practicioners unfortunately often disregard the importance of the data splits and split their data randomly, which can lead to data leakage, overfitting issues and unrealistic performance metrics
A minimum best practice is to split your data into 3 sets:
- Training set: used to train your model
- Validation set: used to tune your model's hyperparameters
- Testing set: used to evaluate the final performance of your model
Ideally, you'd want to use a validation set that is representative of your test set, and both should be representative of your real-world data. So, special care should be taken to ensure that the splits are done correctly.
๐ traditional sklearn wayยถ
A very popular (if not the most popular) way of splitting data, is to use scikit-learn's function called train_test_split that can split your data into training and testing sets. It can do that for you rather easily, although you'll need to call it twice if you want a 3-way split.:
from sklearn.model_selection import train_test_split
train, temp = train_test_split(df, test_size=0.4)
val, test = train_test_split(temp, test_size=0.5)
Isn't it confusing that you just wanted a 60/20/20 split but had to specify 40% and then 50%? Imagine doing this for a 70/20/10 split...
from sklearn.model_selection import train_test_split
train, temp = train_test_split(df, test_size=0.3)
val, test = train_test_split(temp, test_size=0.3333)
๐งโโ๏ธ filoma's split methodยถ
filoma takes this a step further by providing a function that can not only split your data into training, validation, and testing sets in one-go, but also do that based on features found in your file-names or directories.
For example, your data might have subcategories encoded in filenames
like dog_bulldog_001.jpg, dog_bulldog_002.jpg, cat_siamese_001.jpg, etc.
You can use filoma's add_filename_features() to extract these features into separate columns. Plus, you can name these columns whatever you want!
Simple Random Splitยถ
Let's start with a simple random split of our scanned files. This is useful when you don't need to group related files together.
from filoma import DataFrame
breeds_by_species = {
"dog": ["labrador", "beagle", "bulldog", "poodle"],
"cat": ["siamese", "mainecoon", "persian", "ragdoll"],
"bird": ["sparrow", "robin", "parrot", "crow"],
}
colors_by_breed = {
"labrador": ["black", "yellow", "chocolate"],
"beagle": ["tricolor", "brown"],
"bulldog": ["fawn", "white"],
"poodle": ["white", "black"],
"siamese": ["sealpoint", "bluepoint"],
"mainecoon": ["tabby", "brown"],
"persian": ["white", "silver"],
"ragdoll": ["colorpoint", "sealpoint"],
"sparrow": ["hatchling", "adult"],
"robin": ["adult", "juvenile"],
"parrot": ["green", "red"],
"crow": ["black"],
}
# build all (species, breed, color) combos
combos = []
for species, breeds in breeds_by_species.items():
for breed in breeds:
colors = colors_by_breed.get(breed, ["unknown"])
for color in colors:
combos.append((species, breed, color))
# generate 100 file paths cycling through combos and numbering files
paths = []
for i in range(1, 101):
species, breed, color = combos[(i - 1) % len(combos)]
paths.append(f"{species}/{breed}_{color}_{i:03d}.jpg")
data = {"path": paths}
df = DataFrame(data)
print("Small sample of the DataFrame:")
df.sample(5)
Small sample of the DataFrame:
| path |
|---|
| str |
| "bird/sparrow_hatchling_090.jpg" |
| "dog/bulldog_white_031.jpg" |
| "cat/persian_white_062.jpg" |
| "dog/beagle_tricolor_052.jpg" |
| "cat/mainecoon_tabby_012.jpg" |
Add path components as columnsยถ
df.add_filename_features(path_col="path", sep="_", token_names=["breed", "color", "number"], inplace=True)
| path | breed | color | number |
|---|---|---|---|
| str | str | str | str |
| "dog/labrador_black_001.jpg" | "labrador" | "black" | "001" |
| "dog/labrador_yellow_002.jpg" | "labrador" | "yellow" | "002" |
| "dog/labrador_chocolate_003.jpg" | "labrador" | "chocolate" | "003" |
| "dog/beagle_tricolor_004.jpg" | "beagle" | "tricolor" | "004" |
| "dog/beagle_brown_005.jpg" | "beagle" | "brown" | "005" |
| โฆ | โฆ | โฆ | โฆ |
| "bird/crow_black_096.jpg" | "crow" | "black" | "096" |
| "dog/labrador_black_097.jpg" | "labrador" | "black" | "097" |
| "dog/labrador_yellow_098.jpg" | "labrador" | "yellow" | "098" |
| "dog/labrador_chocolate_099.jpg" | "labrador" | "chocolate" | "099" |
| "dog/beagle_tricolor_100.jpg" | "beagle" | "tricolor" | "100" |
Split by breed while preserving distributionยถ
train, val, test = df.split_data(seed=42, train_val_test=(70, 20, 10), feature="breed")
print(f"Shapes of train, val, test: {train.shape}, {val.shape}, {test.shape}")
2025-09-13 17:25:28.359 | WARNING | filoma.ml:split_data:460 - filoma.ml.split_data: unique feature values differ across splits for '_feat_group' - counts train=8, val=3, test=1; examples missing_in_train=['poodle', 'labrador', 'beagle', 'bulldog'], missing_in_val=['crow', 'siamese', 'poodle', 'ragdoll', 'parrot'], missing_in_test=['crow', 'siamese', 'labrador', 'ragdoll', 'bulldog'] 2025-09-13 17:25:28.359 | WARNING | filoma.ml:_maybe_log_ratio_drift:329 - filoma.ml.split_data: achieved counts 60.0%,32.0%,8.0% ((60, 32, 8)) vs requested (70.0%,20.0%,10.0%) total=100 (grouped hashing can cause drift)
Shapes of train, val, test: (60, 5), (32, 5), (8, 5)
Note that filoma warns you of two things:
- It found that some categories have very few samples, which led to splits that do not have all categories represented.
- It tried to preserve the distribution of your selected column(s) across the splits, but it couldn't do it perfectly because some categories had very few samples.
โ Conclusionยถ
So this is how you can use filoma to go from a scary dataset directory tree to a clean DataFrame with enriched metadata and ML-ready splits, all with just a few lines of code!