Metadata-Version: 2.1
Name: impresso-pipelines
Version: 0.4.6.7
Summary: A custom package containing multiple pipelines simplifying reproducibility of impresso steps
Author: Glebs Vinarskis
Author-email: gleb.vinarskis@gmail.com
Requires-Python: >=3.11,<3.12
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Provides-Extra: all
Provides-Extra: langident
Provides-Extra: ldatopics
Provides-Extra: newsagencies
Provides-Extra: ocrqa
Provides-Extra: solrnormalization
Requires-Dist: boto3 (>=1.28.0,<2.0.0) ; extra == "ldatopics" or extra == "all"
Requires-Dist: cython
Requires-Dist: floret ; extra == "langident" or extra == "ocrqa" or extra == "ldatopics" or extra == "solrnormalization" or extra == "all"
Requires-Dist: huggingface_hub ; extra == "langident" or extra == "ocrqa" or extra == "ldatopics" or extra == "solrnormalization" or extra == "all"
Requires-Dist: jpype1 (>=1.4.1,<2.0.0) ; extra == "ldatopics" or extra == "solrnormalization" or extra == "all"
Requires-Dist: jsonschema (==4.23.0) ; extra == "ldatopics" or extra == "all"
Requires-Dist: numpy (==1.26.4)
Requires-Dist: pybloomfilter3 ; extra == "ocrqa" or extra == "all"
Requires-Dist: pysolr ; extra == "solrnormalization"
Requires-Dist: python-dotenv (>=1.0.0,<2.0.0) ; extra == "ldatopics" or extra == "all"
Requires-Dist: smart_open (>=6.3.0,<7.0.0) ; extra == "ldatopics" or extra == "all"
Requires-Dist: spacy (==3.6.0) ; extra == "ldatopics" or extra == "all"
Requires-Dist: torch (==2.3.0) ; extra == "newsagencies" or extra == "all"
Requires-Dist: torchvision (==0.18.0) ; extra == "newsagencies"
Requires-Dist: transformers (>=4.49.0,<5.0.0) ; extra == "newsagencies" or extra == "all"
Project-URL: homepage, https://github.com/impresso/impresso-pipelines
Description-Content-Type: text/markdown

# Python Package: `impresso-pipelines`

[![PyPI](https://img.shields.io/pypi/v/impresso-pipelines)](https://pypi.org/project/impresso-pipelines/)
[![Python versions](https://img.shields.io/pypi/pyversions/impresso-pipelines)](https://pypi.org/project/impresso-pipelines/)
[![Weekly Downloads](https://img.shields.io/pypi/dm/impresso-pipelines)](https://pypi.org/project/impresso-pipelines/)
[![Contributors](https://img.shields.io/github/contributors/impresso/impresso-pipelines)](https://github.com/impresso/impresso-pipelines/graphs/contributors)
[![QA Workflow](https://github.com/impresso/impresso-pipelines/actions/workflows/qa.yml/badge.svg)](https://github.com/impresso/impresso-pipelines/actions/workflows/qa.yml)


## Overview
This repository contains a Python package designed for modular and efficient text processing workflows. Currently, it includes the following subpackages:

- **Language Identification Pipeline**: Identifies the language of input text and returns a probability score.
- **OCR QA Pipeline**: Assesses the quality of OCR text by estimating the proportion of recognized words (0–1), using efficient language-specific Bloom filters.
- **LDA Topic Modeling Pipeline**: Soft clustering of input texts using LDA-based topic modeling. 
- **News Agencies Pipeline**: Extracts and ranks news agency entities from text, providing relevance scores and optional links to Wikidata.
- **Lucene/Solr ormalization Pipeline**: Replicates Solr’s language-specific text normalization to clarify how input text is tokenized and indexed in impresso.


## Installation
To install the full package with all submodules:
```bash
pip install "impresso-pipelines[all]"
```
The `[all]` extra installs all dependencies required for each component.


To install individual modules without unnecessary dependencies, use:
```bash
pip install "impresso-pipelines[langident]"         # Language Identification
pip install "impresso-pipelines[ocrqa]"             # OCR QA
pip install "impresso-pipelines[ldatopics]"         # LDA Topics
pip install "impresso-pipelines[newsagencies]"      # News Agencies
pip install "impresso-pipelines[solrnormalization]" # Solr text normalization
```

## Usage
Each pipeline is instantiated from a corresponding class. 
```python
from impresso_pipelines.langident import LangIdentPipeline
from impresso_pipelines.ocrqa import OCRQAPipeline
from impresso_pipelines.ldatopics import LDATopicsPipeline
from impresso_pipelines.newsagencies import NewsAgenciesPipeline
from impresso_pipelines.solrnormalization import SolrNormalizationPipeline
```

## Pipeline Examples
For usage examples, refer to the individual README files:

 - [Langident Pipeline](README_langident.md)
 - [OCR QA Pipeline](REAMDE_ocrqa.md)
 - [LDA Topics Pipeline](README_ldatopics.md)
 - [News Agencies Pipeline](README_newsagencies.md)
 - [Solr normalization Pipeline](README_solrnormalization.md)



See also the interactive notebooks for further examples:
 - [langident_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/langident_pipeline_demo.ipynb)
 - [ocrqa_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/ocrqa_pipeline_demo.ipynb)
 - [ldatopics_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/ldatopics_pipeline_demo.ipynb)
 - [newsagencies_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/newsagencies_pipeline_demo.ipynb)
 - [solrnormalization_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/solrnormalization_pipeline_demo.ipynb).

## Future Plans
Additional functionality will be added to extend use cases and support further processing tasks.


## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2025 The Impresso team.

### License

This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>



