Metadata-Version: 2.4
Name: semhash
Version: 0.3.3
Summary: Fast Semantic Text Deduplication & Filtering
Author-email: Thomas van Dongen <thomas123@live.nl>, Stéphan Tulkens <stephantul@gmail.com>
License: MIT License
        
        Copyright (c) 2024 The Minish Lab
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/MinishLab
Project-URL: Bug Reports, https://github.com/MinishLab/semhash/issues
Project-URL: Source, https://github.com/MinishLab/semhash
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Natural Language :: English
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: model2vec>=0.3.4
Requires-Dist: vicinity[usearch]>=0.4.3
Requires-Dist: frozendict
Provides-Extra: dev
Requires-Dist: black; extra == "dev"
Requires-Dist: ipython; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-coverage; extra == "dev"
Requires-Dist: ruff; extra == "dev"



<h2 align="center">
  <img width="30%" alt="SemHash logo" src="assets/images/semhash_logo_v2.png"><br/>
  Fast Semantic Text Deduplication & Filtering
</h2>


<div align="center">




<h2>
    <a href="https://pypi.org/project/semhash/"><img src="https://img.shields.io/pypi/v/semhash?color=%23007ec6&label=pypi%20package" alt="Package version"></a>
    <a href="https://pypi.org/project/semhash/"><img src="https://img.shields.io/pypi/pyversions/semhash" alt="Supported Python versions"></a>
    <a href="https://pepy.tech/project/semhash">
      <img src="https://static.pepy.tech/badge/semhash" alt="Downloads">
    </a>
    <a href="https://app.codecov.io/gh/MinishLab/semhash">
        <img src="https://codecov.io/gh/MinishLab/semhash/graph/badge.svg?token=YPOD6HD0MG" alt="Codecov">
    </a>
    <a href="https://discord.gg/4BDPR5nmtK">
        <img src="https://img.shields.io/badge/Join-Discord-5865F2?logo=discord&logoColor=white" alt="Join Discord">
    </a>
    <a href="https://github.com/MinishLab/semhash/blob/main/LICENSE">
        <img src="https://img.shields.io/badge/license-MIT-green" alt="License - MIT">
    </a>
</h2>



[Quickstart](#quickstart) •
[Main Features](#main-features) •
[Usage](#usage) •
[Benchmarks](#benchmarks)

</div>


SemHash is a lightweight and flexible tool for deduplicating datasets, filtering outliers, and finding representative samples using semantic similarity. It combines fast embedding generation from [Model2Vec](https://github.com/MinishLab/model2vec) with efficient ANN-based similarity search through [Vicinity](https://github.com/MinishLab/vicinity).

SemHash supports both single-dataset deduplication & filtering (e.g., cleaning up a train set by removing duplicates and outliers) and multi-dataset deduplication & filtering (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.

## Quickstart

Install the package with:
```bash
pip install semhash
```

Deduplicate a single dataset, filter outliers, and find representative samples with the following code (note: the examples assume you have `datasets` installed, which you can install with `pip install datasets`):

```python
from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate().selected

# Filter outliers
filtered_texts = semhash.self_filter_outliers().selected

# Find representative texts
representative_texts = semhash.self_find_representative().selected
```

Or, deduplicate across two datasets, filter outliers, and find representative samples with the following code (e.g., eliminating train/test leakage):

```python
from datasets import load_dataset
from semhash import SemHash

# Load two datasets to deduplicate
train_texts = load_dataset("ag_news", split="train")["text"]
test_texts = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance with the training data
semhash = SemHash.from_records(records=train_texts)

# Deduplicate the test data against the training data, optionally with a specific threshold
deduplicated_test_texts = semhash.deduplicate(records=test_texts, threshold=0.9).selected

# Filter outliers from the test data against the training data,
# optionally with a specific percentage
filtered_test_texts = semhash.filter_outliers(records=test_texts, outlier_percentage=0.1).selected

# Find representative texts in the test data against the training data,
# optionally with a specific selection size
representative_test_texts = semhash.find_representative(
    records=test_texts, selection_size=10).selected


```

Or, deduplicate multi-column dataset, filter outliers, and find representative samples with the following code (e.g., deduplicating a QA dataset):

```python
from datasets import load_dataset
from semhash import SemHash

# Load the dataset
dataset = load_dataset("squad_v2", split="train")

# Convert the dataset to a list of dictionaries
records = [dict(row) for row in dataset]

# Initialize SemHash with the columns to deduplicate
semhash = SemHash.from_records(records=records, columns=["question", "context"])

# Deduplicate the records
deduplicated_records = semhash.self_deduplicate().selected

# Filter outliers from the records
filtered_texts = semhash.self_filter_outliers().selected

# Find representative texts in the records
representative_texts = semhash.self_find_representative().selected
```

The `deduplicate` and `self_deduplicate` functions return a [DeduplicationResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#L30). This object stores the deduplicated corpus, a set of duplicate object (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result. Examples of how these functions can be used can be found in the [usage](#usage) section.

The `filter_outliers`, `self_filter_outliers`, `find_representative`, and `self_find_representative` functions return a [FilterResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#106). This object stores the found outliers/representative samples.

For both the `DeduplicationResult` and `FilterResult` objects, you can easily view the filtered records with the `selected` attribute (e.g. to view outliers: `outliers = semhash.self_filter_outliers().filtered`)
## Main Features

- **Fast**: SemHash uses [model2vec](https://github.com/MinishLab/model2vec) to embed texts and [vicinity](https://github.com/MinishLab/vicinity) to perform similarity search, making it extremely fast.
- **Scalable**: SemHash can deduplicate & filter large datasets with millions of records thanks to the ANN backends in Vicinity.
- **Flexible**: SemHash can be used to deduplicate & filter a single dataset or across two datasets, and can also be used to deduplicate & filter multi-column datasets (such as QA datasets).
- **Lightweight**: SemHash is a lightweight package with minimal dependencies, making it easy to install and use.
- **Explainable**: Easily inspect the duplicates and what caused them with the `DeduplicationResult` object. You can also view the lowest similarity duplicates to find the right threshold for deduplication for your dataset.

## Usage

The following examples show the various ways you can use SemHash to deduplicate datasets, filter outliers, and find representative samples. These examples assume you have the `datasets` library installed, which you can install with `pip install datasets`.

<details>
<summary>  Deduplicate, filter outliers, and find representative samples on a single dataset </summary>
<br>

The following code snippet shows how to deduplicate a single dataset, filter outliers, and find representative samples using SemHash (in this example, the train split of the [AG News dataset](https://huggingface.co/datasets/fancyzhx/ag_news)):

```python
from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate().selected

# Filter outliers
filtered_texts = semhash.self_filter_outliers().selected

# Find representative texts
representative_texts = semhash.self_find_representative().selected
```
</details>

<details>
<summary>  Deduplicate, filter outliers, and find representative samples across two datasets </summary>
<br>

The following code snippet shows how to deduplicate across two datasets, filter outliers, and find representative samples using SemHash (in this example, the train/test split of the [AG News dataset](https://huggingface.co/datasets/fancyzhx/ag_news)):

```python
from datasets import load_dataset
from semhash import SemHash

# Initialize a SemHash instance
semhash = SemHash()

# Load two datasets to deduplicate
train_texts = load_dataset("ag_news", split="train")["text"]
test_texts = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=train_texts)

# Deduplicate the test data against the training data
deduplicated_test_texts = semhash.deduplicate(records=test_texts).selected

# Filter outliers from the test data
filtered_test_texts = semhash.filter_outliers(records=test_texts).selected

# Find representative texts in the test data
representative_test_texts = semhash.find_representative(records=test_texts).selected
```

</details>

<details>
<summary>  Deduplicate, filter outliers, and find representative samples on multi-column datasets </summary>
<br>

The following code snippet shows how to deduplicate multi-column datasets, filter outliers, and find representative samples using SemHash (in this example, the train split of the QA dataset [SQuAD 2.0](https://huggingface.co/datasets/rajpurkar/squad_v2), which consists of questions, contexts, and answers):

```python
from datasets import load_dataset
from semhash import SemHash

# Load the dataset
dataset = load_dataset("squad_v2", split="train")

# Convert the dataset to a list of dictionaries
records = [dict(row) for row in dataset]

# Initialize SemHash with the columns to deduplicate
semhash = SemHash.from_records(records=records, columns=["question", "context"])

# Deduplicate the records
deduplicated_records = semhash.self_deduplicate().selected

# Filter outliers from the records
filtered_records = semhash.self_filter_outliers().selected

# Find representative samples in the records
representative_records = semhash.self_find_representative().selected
```

</details>

<details>
<summary>  DeduplicationResult functionality </summary>
<br>

The `DeduplicationResult` object returned by the `deduplicate` and `self_deduplicate` functions contains several useful functions to inspect the deduplication result. The following code snippet shows how to use these functions:

```python
from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

# Deduplicate the texts
deduplication_result = semhash.self_deduplicate()

# Check the deduplicated texts
deduplication_result.selected
# Check the duplicates
deduplication_result.filtered
# See what percentage of the texts were duplicates
deduplication_result.duplicate_ratio
# See what percentage of the texts were exact duplicates
deduplication_result.exact_duplicate_ratio

# Get the least similar text from the duplicates. This is useful for finding the right threshold for deduplication.
least_similar = deduplication_result.get_least_similar_from_duplicates()

# Rethreshold the duplicates. This allows you to instantly rethreshold the duplicates with a new threshold without having to re-deduplicate the texts.
deduplication_result.rethreshold(0.95)

# View selected records along with their duplicates.
# This is the opposite of the `filtered` attribute, which shows for every duplicate the record that caused it.
deduplication_result.selected_with_duplicates
```

</details>

<details>
<summary>  Using custom encoders </summary>
<br>

The following code snippet shows how to use a custom encoder with SemHash:

```python
from datasets import load_dataset
from model2vec import StaticModel
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Load an embedding model (in this example, a multilingual model)
model = StaticModel.from_pretrained("minishlab/M2V_multilingual_output")

# Initialize a SemHash with the model and custom encoder
semhash = SemHash.from_records(records=texts, model=model)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate()
```

Any encoder can be used that adheres to our [encoder protocol](https://github.com/MinishLab/semhash/blob/main/semhash/utils.py). For example, any [sentence-transformers](https://github.com/UKPLab/sentence-transformers) model can be used as an encoder:

```python
from datasets import load_dataset
from semhash import SemHash
from sentence_transformers import SentenceTransformer

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Load a sentence-transformers model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Initialize a SemHash with the model and custom encoder
semhash = SemHash.from_records(records=texts, model=model)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate()
```

</details>




<details>
<summary>  Using custom ANN backends </summary>
<br>

The following code snippet shows how to use a custom ANN backend and custom args with SemHash:

```python
from datasets import load_dataset
from semhash import SemHash
from vicinity import Backend

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash with the model and custom ann backend and custom args
semhash = SemHash.from_records(records=texts, ann_backend=Backend.FAISS, nlist=50)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate()
```

For the full list of supported ANN backends and args, see the [Vicinity docs](https://github.com/MinishLab/vicinity/tree/main?tab=readme-ov-file#supported-backends).

</details>


<details>
<summary>  Using Pandas DataFrames </summary>
<br>

You can easily use Pandas DataFrames with SemHash. The following code snippet shows how to deduplicate a Pandas DataFrame:

```python
import pandas as pd
from datasets import load_dataset
from semhash import SemHash

# Load a dataset as a pandas dataframe
dataframe = load_dataset("ag_news", split="train").to_pandas()

# Convert the dataframe to a list of dictionaries
dataframe = dataframe.to_dict(orient="records")

# Initialize a SemHash instance with the columns to deduplicate
semhash = SemHash.from_records(records=dataframe, columns=["text"])

# Deduplicate the texts
deduplicated_records = semhash.self_deduplicate().selected

# Convert the deduplicated records back to a pandas dataframe
deduplicated_dataframe = pd.DataFrame(deduplicated_records)
```

</details>

NOTE: By default, we use the ANN (approximate-nearest neighbors) backend for deduplication. We recommend keeping this since the recall for smaller datasets is ~100%, and it's needed for larger datasets (>1M samples) since these will take too long to deduplicate without ANN. If you want to use the flat/exact-matching backend, you can set `use_ann=False` in the SemHash constructor:

```python
semhash = SemHash.from_records(records=texts, use_ann=False)
```



## Benchmarks

We've benchmarked SemHash on a variety of datasets to measure the deduplication performance and speed. The benchmarks were run with the following setup:
- The benchmarks were all run on CPU
- The benchmarks were all run with `use_ann=True`
- The used encoder is the default encoder ([potion-base-8M](https://huggingface.co/minishlab/potion-base-8M)).
- The timings include the encoding time, index building time, and deduplication time.
### Train Deduplication Benchmark

| Dataset              |  Original Train Size |  Deduplicated Train Size |  % Removed |   Deduplication Time (s) |
|----------------------|----------------------|--------------------------|------------|--------------------------|
| bbc                  |                 1225 |                     1144 |       6.61 |                     0.57 |
| senteval_cr          |                 3012 |                     2990 |       0.73 |                     0.14 |
| tweet_sentiment_extraction |                27481 |                    26695 |       2.86 |                     1.77 |
| emotion              |                16000 |                    15695 |       1.91 |                     0.77 |
| amazon_counterfactual |                 5000 |                     4992 |       0.16 |                     0.33 |
| ag_news              |               120000 |                   106921 |      10.90 |                     5.20 |
| enron_spam           |                31716 |                    20540 |      35.24 |                     2.03 |
| subj                 |                 8000 |                     7990 |       0.12 |                     0.63 |
| sst5                 |                 8544 |                     8526 |       0.21 |                     0.58 |
| 20_newgroups         |                11314 |                    10684 |       5.57 |                     0.73 |
| hatespeech_offensive |                22783 |                    22090 |       3.04 |                     0.92 |
| ade                  |                17637 |                    15718 |      10.88 |                     0.73 |
| imdb                 |                25000 |                    24830 |       0.68 |                     1.76 |
| massive_scenario     |                11514 |                     9366 |      18.66 |                     0.47 |
| student              |               117519 |                    63856 |      45.66 |                     8.80 |
| squad_v2             |               130319 |                   109698 |      15.82 |                     8.81 |
| wikitext             |              1801350 |                   884645 |      50.89 |                    83.53 |


### Train/Test Deduplication Benchmark

| Dataset              |   Train Size |    Test Size |   Deduplicated Test Size |  % Removed |   Deduplication Time (s) |
|----------------------|--------------|--------------|--------------------------|------------|--------------------------|
| bbc                  |         1225 |         1000 |                      870 |      13.00 |                     0.71 |
| senteval_cr          |         3012 |          753 |                      750 |       0.40 |                     0.13 |
| tweet_sentiment_extraction |        27481 |         3534 |                     3412 |       3.45 |                     1.53 |
| emotion              |        16000 |         2000 |                     1926 |       3.70 |                     0.65 |
| amazon_counterfactual |         5000 |         5000 |                     4990 |       0.20 |                     0.51 |
| ag_news              |       120000 |         7600 |                     6198 |      18.45 |                     3.74 |
| enron_spam           |        31716 |         2000 |                     1060 |      47.00 |                     1.94 |
| subj                 |         8000 |         2000 |                     1999 |       0.05 |                     0.62 |
| sst5                 |         8544 |         2210 |                     2205 |       0.23 |                     0.59 |
| 20_newgroups         |        11314 |         7532 |                     7098 |       5.76 |                     2.25 |
| hatespeech_offensive |        22783 |         2000 |                     1925 |       3.75 |                     0.77 |
| ade                  |        17637 |         5879 |                     4952 |      15.77 |                     0.81 |
| imdb                 |        25000 |        25000 |                    24795 |       0.82 |                     2.81 |
| massive_scenario     |        11514 |         2974 |                     2190 |      26.36 |                     0.46 |
| student              |       117519 |         5000 |                     2393 |      52.14 |                     3.78 |
| squad_v2             |       130319 |        11873 |                    11863 |       0.08 |                     7.13 |
| wikitext             |      1801350 |         4358 |                     2139 |      50.92 |                    40.32 |


As can be seen, SemHash is extremely fast, and scales to large datasets with millions of records. There are some notable examples of train/test leakage, such as `enron_spam` and `student`, where the test dataset contains a significant amount of semantic overlap with the training dataset.

### Reproducing the Benchmarks

To run the benchmarks yourself, you can use the following command (assuming you have the `datasets` library installed):

```bash
python -m benchmarks.run_benchmarks
```
Optionally, the datasets can be updated in the [data.py](https://github.com/MinishLab/semhash/blob/main/benchmarks/data.py) file.

## License

MIT

## Citing

If you use SemHash in your research, please cite the following:
```bibtex
@software{minishlab2025semhash,
  author       = {{van Dongen}, Thomas and Stephan Tulkens},
  title        = {SemHash: Fast Semantic Text Deduplication \& Filtering},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17265942},
  url          = {https://github.com/MinishLab/semhash},
  license      = {MIT}
}
```
