filoma dedup tutorial¶

This notebook explains duplicate detection and near-duplicate matching in a friendly, beginner-oriented way.

Why care about duplicates?

  • Duplicate detection helps you find identical files (byte-for-byte) and near-duplicates (same content with small changes).
  • Use exact hashing (SHA256) to find perfect duplicates quickly.
  • Use text similarity (shingles + Jaccard or MinHash) to find documents that are very similar but not identical.
  • Use perceptual image hashes (aHash/dHash/pHash) to find images that look the same even if they were resized, compressed, or slightly edited.

What is covered in this tutorial:

  • How to run quick, practical checks on small datasets.
  • How the core algorithms work at a high level and what the parameters mean.
  • How to use FileProfiler, ImageProfiler, and DataFrame.evaluate_duplicates() to integrate dedup checks into your workflow.

Each section has a short "big idea" explanation followed by a compact runnable example.

Installing optional dependencies¶

What's needed to use the dedup tools?

  • Some features (image perceptual hashing, MinHash acceleration) require third-party packages that are not part of the small core.
  • Pillow lets us open and process images in Python so we can compute perceptual hashes.
  • datasketch provides a MinHash implementation and LSH helpers that scale text similarity to large collections.

When to install

  • If you only need exact duplicate checks (SHA256) and small-scale text comparisons, you can skip installation.
  • If you plan to deduplicate images or run similarity over many documents, install the optional packages.

Run this cell to install the optional tools (or use your environment manager):

!uv pip install --upgrade pillow datasketch

or

!uv sync --extra dedup

If you run in an environment that already has these packages (for example the project's dev environment), you can skip installation.

In [1]:
# This cell prepares the environment and imports the filoma dedup utilities.
# We'll create temporary files and run short examples in-memory so nothing is written to your project.

import os
import tempfile

from filoma import dedup
from filoma.dataframe import DataFrame
from filoma.files.file_profiler import FileProfiler
from filoma.images.image_profiler import ImageProfiler

# check Pillow availability for image examples
try:
    from PIL import Image

    _HAS_PIL = True
except Exception:
    _HAS_PIL = False

print("dedup module loaded:", dedup)
print("Pillow available:", _HAS_PIL)
dedup module loaded: <module 'filoma.dedup' from '/home/kalfasy/repos/filoma/src/filoma/dedup.py'>
Pillow available: True

Standalone text dedup example¶

What do we use for text deduplication?

  • Text near-duplicate detection often uses "shingles": short sequences of consecutive words (for example 3-word shingles).
  • The Jaccard similarity between the sets of shingles of two documents gives a simple measure of how similar they are (intersection / union).
  • MinHash is a probabilistic technique that approximates Jaccard similarity much faster for large document collections; datasketch provides an implementation.

Create two small text files that are near-duplicates and run find_duplicates with a lower threshold for short texts.

In [2]:
# Create two small text files that are almost the same, then run dedup and explain the results.
with tempfile.TemporaryDirectory() as td:
    p1 = os.path.join(td, "a.txt")
    p2 = os.path.join(td, "b.txt")

    text1 = "the quick brown fox jumps over the lazy dog"
    text2 = "the quick brown fox jumped over the lazy dog"  # only "jumps" -> "jumped"

    # Write files
    with open(p1, "w") as f:
        f.write(text1)
    with open(p2, "w") as f:
        f.write(text2)

    print("Created two files:")
    print(" -", p1)
    print("   contents:", text1)
    print(" -", p2)
    print("   contents:", text2)
    print()

    # Helper to compute k-word shingles and a human-readable Jaccard similarity
    def word_shingles(text, k=3):
        toks = text.lower().split()
        if len(toks) < k:
            return set()
        return {" ".join(toks[i : i + k]) for i in range(len(toks) - k + 1)}

    k = 3
    s1 = word_shingles(text1, k=k)
    s2 = word_shingles(text2, k=k)
    inter = s1 & s2
    union = s1 | s2
    jaccard = len(inter) / len(union) if union else 0.0

    print(f"Computed {k}-word shingles for each file:")
    print(" shingles a.txt:", s1)
    print(" shingles b.txt:", s2)
    print(f" Jaccard similarity (intersection/union): {len(inter)}/{len(union)} = {jaccard:.3f}")
    print()

    # Parameters used by the dedup routine
    text_threshold = 0.4
    print(f"Calling dedup.find_duplicates(text_k={k}, text_threshold={text_threshold})")
    res = dedup.find_duplicates([p1, p2], text_k=k, text_threshold=text_threshold)
    print("Raw return value from find_duplicates():")
    print(res)
    print()

    # Interpret the result in plain language
    if res.get("exact"):
        print("Exact duplicate groups (byte-for-byte identical files):")
        for group in res["exact"]:
            print("  -", group)
    else:
        print("No exact (byte-for-byte) duplicates found.")

    if res.get("text"):
        print("\nText-based near-duplicate groups (based on shingles/jaccard):")
        for group in res["text"]:
            # group is a list of file paths (or items); print paths and our computed jaccard for clarity
            print("  Group:")
            for item in group:
                # item may be a path string or a dict; handle both gracefully
                path = item if isinstance(item, str) else item.get("path", str(item))
                print("   -", path)
            # since we only compared two files here, print the computed Jaccard as additional context
            print(f"   (Computed {k}-word Jaccard between the two files = {jaccard:.3f}; threshold was {text_threshold})")
    else:
        print("\nNo text near-duplicate groups found (threshold was not met).")

    if res.get("image"):
        print("\nImage-based groups (not expected here):")
        print(res["image"])
Created two files:
 - /tmp/tmpjudpnmpf/a.txt
   contents: the quick brown fox jumps over the lazy dog
 - /tmp/tmpjudpnmpf/b.txt
   contents: the quick brown fox jumped over the lazy dog

Computed 3-word shingles for each file:
 shingles a.txt: {'the lazy dog', 'the quick brown', 'brown fox jumps', 'quick brown fox', 'over the lazy', 'fox jumps over', 'jumps over the'}
 shingles b.txt: {'jumped over the', 'the quick brown', 'quick brown fox', 'over the lazy', 'fox jumped over', 'the lazy dog', 'brown fox jumped'}
 Jaccard similarity (intersection/union): 4/10 = 0.400

Calling dedup.find_duplicates(text_k=3, text_threshold=0.4)
Raw return value from find_duplicates():
{'exact': [], 'text': [['/tmp/tmpjudpnmpf/a.txt', '/tmp/tmpjudpnmpf/b.txt']], 'image': []}

No exact (byte-for-byte) duplicates found.

Text-based near-duplicate groups (based on shingles/jaccard):
  Group:
   - /tmp/tmpjudpnmpf/a.txt
   - /tmp/tmpjudpnmpf/b.txt
   (Computed 3-word Jaccard between the two files = 0.400; threshold was 0.4)

Standalone image dedup example¶

What do we use for image deduplication?

  • Perceptual hashing converts an image into a compact fingerprint that captures visual structure rather than raw bytes.
  • Simple hashes like aHash and dHash are fast and robust to small changes like resizing or compression.
  • Compare hashes with Hamming distance: small distances mean visually similar images.

If Pillow is available this creates two identical images and demonstrates perceptual hashing and grouping.

In [3]:
# Image dedup demo with clear, human-friendly explanations
if not _HAS_PIL:
    print("Pillow is not available; skipping image dedup example")
else:
    with tempfile.TemporaryDirectory() as td:
        # Create two images: start identical, then make a very small change to the second
        p1 = os.path.join(td, "img1.png")
        p2 = os.path.join(td, "img2.png")

        # Base image (uniform color)
        img = Image.new("RGB", (64, 64), color=(123, 200, 100))
        img.save(p1)

        # Make a nearly imperceptible change: alter a single pixel in a copy
        img2 = img.copy()
        img2.putpixel((0, 0), (124, 200, 100))  # tiny change, one pixel different
        img2.save(p2)

        print("Created two images:")
        print(" -", p1, "(original)")
        print(" -", p2, "(one pixel changed)\n")

        # Compute perceptual hashes (aHash) for both files
        h1 = dedup.ahash_image(p1)
        h2 = dedup.ahash_image(p2)
        print("Perceptual hashes (aHash):")
        print(" aHash img1:", h1)
        print(" aHash img2:", h2)

        # Compute Hamming distance between the two hex hashes.
        # Convert hex to int, xor, then count differing bits.
        try:
            xor = int(h1, 16) ^ int(h2, 16)
            ham_dist = xor.bit_count()  # Python 3.8+: number of 1-bits in XOR
        except Exception:
            # Fallback if hash format is unexpected
            ham_dist = sum(c1 != c2 for c1, c2 in zip(h1, h2))

        print(f"\nHamming distance between the two aHashes: {ham_dist}")

        # Use the high-level dedup helper to find duplicate/near-duplicate groups
        threshold = 2
        print(f"\nCalling dedup.find_duplicates([...], image_max_distance={threshold})")
        res = dedup.find_duplicates([p1, p2], image_max_distance=threshold)
        print("Raw return value from find_duplicates():")
        print(res)
        print()

        # Interpret the results in plain language
        if res.get("image"):
            print("Image-based near-duplicate groups found:")
            for group in res["image"]:
                print(" Group:")
                for item in group:
                    path = item if isinstance(item, str) else item.get("path", str(item))
                    print("  -", path)
            print(
                f"\nInterpretation: the two images were grouped because their aHash Hamming distance ({ham_dist}) "
                f"is <= the threshold ({threshold}). Small visual changes (like one pixel) often produce small "
                "Hamming distances so perceptual hashing is useful to catch visually-similar images."
            )
        else:
            print("No image-based groups found.")
            print(
                f"Interpretation: the two images were considered different for the chosen threshold ({threshold}). "
                "You can increase the threshold to treat more-distorted images as near-duplicates, or try a different "
                "hash (dHash/pHash) if you need different sensitivity characteristics."
            )
Created two images:
 - /tmp/tmpnn_uqkgz/img1.png (original)
 - /tmp/tmpnn_uqkgz/img2.png (one pixel changed)

Perceptual hashes (aHash):
 aHash img1: ffffffffffffffff
 aHash img2: ffffffffffffffff

Hamming distance between the two aHashes: 0

Calling dedup.find_duplicates([...], image_max_distance=2)
Raw return value from find_duplicates():
{'exact': [], 'text': [], 'image': [['/tmp/tmpnn_uqkgz/img1.png', '/tmp/tmpnn_uqkgz/img2.png']]}

Image-based near-duplicate groups found:
 Group:
  - /tmp/tmpnn_uqkgz/img1.png
  - /tmp/tmpnn_uqkgz/img2.png

Interpretation: the two images were grouped because their aHash Hamming distance (0) is <= the threshold (2). Small visual changes (like one pixel) often produce small Hamming distances so perceptual hashing is useful to catch visually-similar images.

Using FileProfiler for dedup fingerprints¶

What does filoma's FileProfiler provide for deduplication?

  • FileProfiler gathers filesystem metadata (size, timestamps) and can compute a SHA256 fingerprint of file contents.
  • fingerprint_for_dedup() is a compact representation useful when scanning many files: you can store fingerprints and later group files with identical fingerprints or similar derived features (text shingles or image hashes).

FileProfiler exposes fingerprint_for_dedup() which produces a compact dict with sha256 and optional text_shingles or image_hash. This is handy for pipeline-style scanning.

In [4]:
prof = FileProfiler()  # create a FileProfiler instance for the example
# FileProfiler example with clear, human-friendly explanations.
with tempfile.TemporaryDirectory() as td:
    p = os.path.join(td, "doc.txt")
    text = "this is a sample document used for dedup testing"
    with open(p, "w") as f:
        f.write(text)

    # ask the profiler for the compact fingerprint used by dedup routines
    fp = prof.fingerprint_for_dedup(p, compute_text=True)

    # Friendly, step-by-step explanation of the results
    print("Fingerprint for dedup (human-friendly explanation):\n")

    print("1) Basic file info")
    print("   path :", fp.get("path"))
    print("   size :", fp.get("size"), "bytes  -> size quickly rules out equality when different")
    print()

    print("2) Exact fingerprint")
    print("   sha256:", fp.get("sha256"))
    print("   -> SHA256 is an exact content fingerprint: two files with the same SHA256 are byte-for-byte identical.")
    print()

    print("3) Text shingles (used for near-duplicate text detection)")
    shingles = fp.get("text_shingles")
    if shingles:
        print(f"   {len(shingles)} shingles (overlapping phrases):")
        for s in sorted(shingles):
            print("    -", s)
        print("\n   What this means:")
        print("    - Each shingle is a short overlapping phrase (here 3-word phrases).")
        print("    - To estimate text similarity between two files we compute the Jaccard index between their shingle sets")
        print("      (intersection size / union size). More shared shingles -> more similar text.")
        print()

        # quick verification so the user sees where the shingles come from
        def word_shingles(text, k=3):
            toks = text.lower().split()
            if len(toks) < k:
                return set()
            return {" ".join(toks[i : i + k]) for i in range(len(toks) - k + 1)}

        print("   Verification (recomputing 3-word shingles from the file contents):")
        print("   ", word_shingles(text, k=3))
    else:
        print("   No text shingles were computed for this file.")
        print("   (If compute_text=True was passed, check profiler configuration or file contents.)")

    print()
    print("4) Image hash field (not relevant here):", fp.get("image_hash"))
    print("\nSummary: the fingerprint contains an exact hash (SHA256) for byte-perfect deduping and")
    print("a set of text shingles useful for near-duplicate detection using Jaccard/MinHash approaches.")
Fingerprint for dedup (human-friendly explanation):

1) Basic file info
   path : /tmp/tmpiucdpm15/doc.txt
   size : 48 bytes  -> size quickly rules out equality when different

2) Exact fingerprint
   sha256: 064d7354bd3bf25c401f0899a9cde918cedf90f80392ff43080028a551e15782
   -> SHA256 is an exact content fingerprint: two files with the same SHA256 are byte-for-byte identical.

3) Text shingles (used for near-duplicate text detection)
   7 shingles (overlapping phrases):
    - a sample document
    - document used for
    - for dedup test
    - is a sample
    - sample document used
    - thi is a
    - used for dedup

   What this means:
    - Each shingle is a short overlapping phrase (here 3-word phrases).
    - To estimate text similarity between two files we compute the Jaccard index between their shingle sets
      (intersection size / union size). More shared shingles -> more similar text.

   Verification (recomputing 3-word shingles from the file contents):
    {'this is a', 'for dedup testing', 'sample document used', 'is a sample', 'a sample document', 'document used for', 'used for dedup'}

4) Image hash field (not relevant here): None

Summary: the fingerprint contains an exact hash (SHA256) for byte-perfect deduping and
a set of text shingles useful for near-duplicate detection using Jaccard/MinHash approaches.

Using ImageProfiler to compute perceptual hashes¶

What does filoma's ImageProfiler provide for deduplication?

  • Use ImageProfiler when you already treat images as data (numpy arrays) or when you want a consistent interface for computing image statistics and perceptual hashes.
  • Perceptual hashes let you cluster or filter visually-similar images before more expensive visual comparisons.

The ImageProfiler exposes compute_ahash / compute_dhash which delegate to filoma.dedup for consistent hash computation.

In [5]:
# ImageProfiler example with clear, human-friendly explanations.

ip = ImageProfiler()  # create an ImageProfiler instance for the example

# We use the existing ImageProfiler instance `ip` (do not recreate it).
if not _HAS_PIL:
    print("Pillow is not available; skipping ImageProfiler example")
else:
    print("Using existing ImageProfiler instance:", ip)
    with tempfile.TemporaryDirectory() as td:
        img_path = os.path.join(td, "img.png")

        # Create a small RGB image and save it to a temp file.
        new_img = Image.new("RGB", (32, 32), color=(10, 20, 30))
        new_img.save(img_path)
        print("\nCreated a test image at:", img_path)
        print(" Image size:", new_img.size, "mode:", new_img.mode)

        # Compute perceptual hashes via ImageProfiler.
        ahash = ip.compute_ahash(img_path)
        dhash = ip.compute_dhash(img_path)

        # Explain what each hash is and show the values.
        print("\nComputed perceptual hashes:")
        print(" - aHash (average hash):", ahash)
        print("   -> aHash summarizes average light/dark pattern; small visual edits usually produce small Hamming distances.")
        print(" - dHash (difference hash):", dhash)
        print("   -> dHash captures gradient differences; complementary sensitivity to aHash.")

        # Quick verification: computing the same hash twice on the same file should give identical results.
        ahash2 = ip.compute_ahash(img_path)
        print("\nVerification: recomputing aHash on the same file yields identical value:", ahash2 == ahash)
        print(" Recomputed aHash:", ahash2)

        # Show how you'd compare two hashes (Hamming distance example).
        # For demonstration we compare the aHash with itself (distance should be 0).
        try:
            xor = int(ahash, 16) ^ int(ahash2, 16)
            ham_dist = xor.bit_count()
            print("\nExample Hamming distance (aHash vs same aHash):", ham_dist, "(0 means identical)")
        except Exception:
            print("\nCould not compute Hamming distance (unexpected hash format).")

        # Short guidance for next steps.
        print("\nNotes:")
        print(" - To detect near-duplicate images, compute hashes for many files and compare Hamming distances.")
        print(" - Small distances indicate visually similar images; choose a threshold based on your tolerance for differences.")
        print(" - When comparing aHash and dHash directly, be careful: different algorithms may have different bit lengths/meanings.")
Using existing ImageProfiler instance: <filoma.images.image_profiler.ImageProfiler object at 0x79862ffc6950>

Created a test image at: /tmp/tmpuks7fj71/img.png
 Image size: (32, 32) mode: RGB

Computed perceptual hashes:
 - aHash (average hash): ffffffffffffffff
   -> aHash summarizes average light/dark pattern; small visual edits usually produce small Hamming distances.
 - dHash (difference hash): 0000000000000000
   -> dHash captures gradient differences; complementary sensitivity to aHash.

Verification: recomputing aHash on the same file yields identical value: True
 Recomputed aHash: ffffffffffffffff

Example Hamming distance (aHash vs same aHash): 0 (0 means identical)

Notes:
 - To detect near-duplicate images, compute hashes for many files and compare Hamming distances.
 - Small distances indicate visually similar images; choose a threshold based on your tolerance for differences.
 - When comparing aHash and dHash directly, be careful: different algorithms may have different bit lengths/meanings.

DataFrame convenience: evaluate_duplicates()¶

What does DataFrame.evaluate_duplicates() provide for deduplication?

  • DataFrame.evaluate_duplicates() is a convenience for quickly assessing duplicates in a DataFrame that contains file paths.
  • It's useful for exploratory data analysis and quick cleaning steps, but for production-scale deduplication you should prefer MinHash/LSH or a specialized image search index.
  • The method returns groups and can be used to export or mark duplicates for removal.

DataFrame.evaluate_duplicates() scans the path column and prints a small Rich summary table. It returns the raw groups for programmatic use.

In [6]:
# Friendly, step-by-step explanation of DataFrame.evaluate_duplicates()

# Build a DataFrame from the two text files used earlier and run evaluation
with tempfile.TemporaryDirectory() as td:
    p1 = os.path.join(td, "a.txt")
    p2 = os.path.join(td, "b.txt")
    with open(p1, "w") as f:
        f.write("the quick brown fox jumps over the lazy dog")
    with open(p2, "w") as f:
        f.write("the quick brown fox jumped over the lazy dog")

    df = DataFrame([p1, p2])
    groups = df.evaluate_duplicates(text_threshold=0.4, show_table=True)
    print("Returned groups:")
    print(groups)


print("DataFrame we'll evaluate (contains file paths):")
print(df)
print()

print("Plan:")
print(f"- Run df.evaluate_duplicates(text_k={k}, text_threshold={text_threshold})")
print("- This checks for exact (SHA256) duplicates and text near-duplicates using k-word shingles + Jaccard.")
print()

groups = df.evaluate_duplicates(text_k=k, text_threshold=text_threshold, show_table=True)
print("\nRaw returned groups:")
print(groups)
print()

# Human-friendly interpretation using the shingle/Jaccard info already computed in the notebook
print("Interpretation:")
if groups.get("exact"):
    print(" - Exact duplicate groups (byte-for-byte identical):")
    for grp in groups["exact"]:
        print("   >", grp)
else:
    print(" - No exact duplicates found.")

if groups.get("text"):
    print(" - Text near-duplicate groups found:")
    print(f"   The two files were grouped because their {k}-word Jaccard similarity = {jaccard:.3f} >= threshold {text_threshold}.")
    print("   Shingles for file A (example):", s1)
    print("   Shingles for file B (example):", s2)
    print("   Shared shingles (intersection):", inter)
    print(f"   Jaccard = {len(inter)}/{len(union)} = {jaccard:.3f}")
else:
    print(" - No text near-duplicate groups found (threshold not met).")

if groups.get("image"):
    print(" - Image groups (not relevant for these text files):", groups["image"])

print("\nSummary: evaluate_duplicates scanned the 'path' column, computed exact hashes and text shingles,")
print("and grouped files whose shingle-based Jaccard met the threshold. To change sensitivity, adjust text_k or text_threshold.")
         Duplicate Summary          
┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Type  ┃ Groups ┃ Files In Groups ┃
┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ exact │ 0      │ 0               │
│ text  │ 1      │ 2               │
│ image │ 0      │ 0               │
└───────┴────────┴─────────────────┘
2025-09-13 22:43:56.446 | INFO     | filoma.dataframe:evaluate_duplicates:915 - Duplicate summary: exact=0 groups (0 files), text=1 groups (2 files), image=0 groups (0 files)
Returned groups:
{'exact': [], 'text': [['/tmp/tmps6z965tp/a.txt', '/tmp/tmps6z965tp/b.txt']], 'image': []}
DataFrame we'll evaluate (contains file paths):
filoma.DataFrame with 2 rows
shape: (2, 1)
┌────────────────────────┐
│ path                   │
│ ---                    │
│ str                    │
╞════════════════════════╡
│ /tmp/tmps6z965tp/a.txt │
│ /tmp/tmps6z965tp/b.txt │
└────────────────────────┘

Plan:
- Run df.evaluate_duplicates(text_k=3, text_threshold=0.4)
- This checks for exact (SHA256) duplicates and text near-duplicates using k-word shingles + Jaccard.

         Duplicate Summary          
┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Type  ┃ Groups ┃ Files In Groups ┃
┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ exact │ 1      │ 2               │
│ text  │ 0      │ 0               │
│ image │ 0      │ 0               │
└───────┴────────┴─────────────────┘
2025-09-13 22:43:56.449 | INFO     | filoma.dataframe:evaluate_duplicates:915 - Duplicate summary: exact=1 groups (2 files), text=0 groups (0 files), image=0 groups (0 files)
Raw returned groups:
{'exact': [['/tmp/tmps6z965tp/a.txt', '/tmp/tmps6z965tp/b.txt']], 'text': [], 'image': []}

Interpretation:
 - Exact duplicate groups (byte-for-byte identical):
   > ['/tmp/tmps6z965tp/a.txt', '/tmp/tmps6z965tp/b.txt']
 - No text near-duplicate groups found (threshold not met).

Summary: evaluate_duplicates scanned the 'path' column, computed exact hashes and text shingles,
and grouped files whose shingle-based Jaccard met the threshold. To change sensitivity, adjust text_k or text_threshold.

Closing notes¶

  • For large datasets consider using datasketch.MinHash + LSH to scale text similarity.
  • For image deduping at scale consider using perceptual hashes + a nearest-neighbor index.
  • DataFrame.evaluate_duplicates() is intended as a quick way to get actionable groups; you can export the groups and apply cleaning workflows (drop, label, or move duplicates).