Metadata-Version: 2.4
Name: deepdoctection
Version: 0.44.1
Summary: Repository for Document AI
Home-page: https://github.com/deepdoctection/deepdoctection
Author: Dr. Janis Meyer
License: Apache License 2.0
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: catalogue==2.0.10
Requires-Dist: huggingface_hub>=0.26.0
Requires-Dist: importlib-metadata>=5.0.0
Requires-Dist: jsonlines==3.1.0
Requires-Dist: lazy-imports==0.3.1
Requires-Dist: mock==4.0.3
Requires-Dist: networkx>=2.7.1
Requires-Dist: numpy<2.0,>=1.21
Requires-Dist: packaging>=20.0
Requires-Dist: Pillow>=10.0.0
Requires-Dist: pypdf>=6.0.0
Requires-Dist: pypdfium2>=4.30.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: pyzmq>=16
Requires-Dist: scipy>=1.13.1
Requires-Dist: termcolor>=1.1
Requires-Dist: tabulate>=0.7.7
Requires-Dist: tqdm>=4.64.0
Provides-Extra: tf
Requires-Dist: catalogue==2.0.10; extra == "tf"
Requires-Dist: huggingface_hub>=0.26.0; extra == "tf"
Requires-Dist: importlib-metadata>=5.0.0; extra == "tf"
Requires-Dist: jsonlines==3.1.0; extra == "tf"
Requires-Dist: lazy-imports==0.3.1; extra == "tf"
Requires-Dist: mock==4.0.3; extra == "tf"
Requires-Dist: networkx>=2.7.1; extra == "tf"
Requires-Dist: numpy<2.0,>=1.21; extra == "tf"
Requires-Dist: packaging>=20.0; extra == "tf"
Requires-Dist: Pillow>=10.0.0; extra == "tf"
Requires-Dist: pypdf>=6.0.0; extra == "tf"
Requires-Dist: pypdfium2>=4.30.0; extra == "tf"
Requires-Dist: pyyaml>=6.0.1; extra == "tf"
Requires-Dist: pyzmq>=16; extra == "tf"
Requires-Dist: scipy>=1.13.1; extra == "tf"
Requires-Dist: termcolor>=1.1; extra == "tf"
Requires-Dist: tabulate>=0.7.7; extra == "tf"
Requires-Dist: tqdm>=4.64.0; extra == "tf"
Requires-Dist: tensorpack==0.11; extra == "tf"
Requires-Dist: protobuf==3.20.1; extra == "tf"
Requires-Dist: tensorflow-addons>=0.17.1; extra == "tf"
Requires-Dist: tf2onnx>=1.9.2; extra == "tf"
Requires-Dist: python-doctr==0.9.0; extra == "tf"
Requires-Dist: pycocotools>=2.0.2; extra == "tf"
Requires-Dist: boto3==1.34.102; extra == "tf"
Requires-Dist: pdfplumber>=0.11.0; extra == "tf"
Requires-Dist: fasttext-wheel; extra == "tf"
Requires-Dist: jdeskew>=0.2.2; extra == "tf"
Requires-Dist: apted==1.0.3; extra == "tf"
Requires-Dist: distance==0.1.3; extra == "tf"
Requires-Dist: lxml>=4.9.1; extra == "tf"
Provides-Extra: pt
Requires-Dist: catalogue==2.0.10; extra == "pt"
Requires-Dist: huggingface_hub>=0.26.0; extra == "pt"
Requires-Dist: importlib-metadata>=5.0.0; extra == "pt"
Requires-Dist: jsonlines==3.1.0; extra == "pt"
Requires-Dist: lazy-imports==0.3.1; extra == "pt"
Requires-Dist: mock==4.0.3; extra == "pt"
Requires-Dist: networkx>=2.7.1; extra == "pt"
Requires-Dist: numpy<2.0,>=1.21; extra == "pt"
Requires-Dist: packaging>=20.0; extra == "pt"
Requires-Dist: Pillow>=10.0.0; extra == "pt"
Requires-Dist: pypdf>=6.0.0; extra == "pt"
Requires-Dist: pypdfium2>=4.30.0; extra == "pt"
Requires-Dist: pyyaml>=6.0.1; extra == "pt"
Requires-Dist: pyzmq>=16; extra == "pt"
Requires-Dist: scipy>=1.13.1; extra == "pt"
Requires-Dist: termcolor>=1.1; extra == "pt"
Requires-Dist: tabulate>=0.7.7; extra == "pt"
Requires-Dist: tqdm>=4.64.0; extra == "pt"
Requires-Dist: timm>=0.9.16; extra == "pt"
Requires-Dist: transformers>=4.48.0; extra == "pt"
Requires-Dist: accelerate>=0.29.1; extra == "pt"
Requires-Dist: python-doctr==0.9.0; extra == "pt"
Requires-Dist: pycocotools>=2.0.2; extra == "pt"
Requires-Dist: boto3==1.34.102; extra == "pt"
Requires-Dist: pdfplumber>=0.11.0; extra == "pt"
Requires-Dist: fasttext-wheel; extra == "pt"
Requires-Dist: jdeskew>=0.2.2; extra == "pt"
Requires-Dist: apted==1.0.3; extra == "pt"
Requires-Dist: distance==0.1.3; extra == "pt"
Requires-Dist: lxml>=4.9.1; extra == "pt"
Provides-Extra: docs
Requires-Dist: tensorpack==0.11; extra == "docs"
Requires-Dist: boto3==1.34.102; extra == "docs"
Requires-Dist: transformers>=4.48.0; extra == "docs"
Requires-Dist: accelerate>=0.29.1; extra == "docs"
Requires-Dist: pdfplumber>=0.11.0; extra == "docs"
Requires-Dist: lxml>=4.9.1; extra == "docs"
Requires-Dist: lxml-stubs>=0.5.1; extra == "docs"
Requires-Dist: jdeskew>=0.2.2; extra == "docs"
Requires-Dist: jinja2; extra == "docs"
Requires-Dist: mkdocs-material; extra == "docs"
Requires-Dist: mkdocstrings-python; extra == "docs"
Requires-Dist: griffe==0.25.0; extra == "docs"
Provides-Extra: dev
Requires-Dist: python-dotenv==1.0.0; extra == "dev"
Requires-Dist: click; extra == "dev"
Requires-Dist: black==23.7.0; extra == "dev"
Requires-Dist: isort==5.13.2; extra == "dev"
Requires-Dist: pylint==2.17.4; extra == "dev"
Requires-Dist: mypy==1.4.1; extra == "dev"
Requires-Dist: wandb; extra == "dev"
Requires-Dist: types-PyYAML>=6.0.12.12; extra == "dev"
Requires-Dist: types-termcolor>=1.1.3; extra == "dev"
Requires-Dist: types-tabulate>=0.9.0.3; extra == "dev"
Requires-Dist: types-tqdm>=4.66.0.5; extra == "dev"
Requires-Dist: lxml-stubs>=0.5.1; extra == "dev"
Requires-Dist: types-Pillow>=10.2.0.20240406; extra == "dev"
Requires-Dist: types-urllib3>=1.26.25.14; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest==8.0.2; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

<p align="center">
  <img src="https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_logo.png" alt="Deep Doctection Logo" width="60%">
</p>

![GitHub Repo stars](https://img.shields.io/github/stars/deepdoctection/deepdoctection)
![PyPI - Version](https://img.shields.io/pypi/v/deepdoctection)
![PyPI - License](https://img.shields.io/pypi/l/deepdoctection)


------------------------------------------------------------------------------------------------------------------------
# NEW 

Version `v.0.43` includes a significant redesign of the Analyzer's default configuration.  Key changes include:

* More powerful models for Document Layout Analysis and OCR.
* Expanded functionality.
* Less dependencies.

------------------------------------------------------------------------------------------------------------------------

<p align="center">
  <h1 align="center">
  A Package for Document Understanding
  </h1>
</p>


**deep**doctection is a Python library that orchestrates Scan and PDF document layout analysis and extraction for RAG.
It also provides a framework for training, evaluating and inferencing Document AI models.

# Overview

- Document layout analysis and table recognition in PyTorch with 
[**Detectron2**](https://github.com/facebookresearch/detectron2/tree/main/detectron2) and 
[**Transformers**](https://github.com/huggingface/transformers)
  or Tensorflow and [**Tensorpack**](https://github.com/tensorpack),
- OCR with support of [**Tesseract**](https://github.com/tesseract-ocr/tesseract), [**DocTr**](https://github.com/mindee/doctr) and 
  [**AWS Textract**](https://aws.amazon.com/textract/),
- Document and token classification with the [**LayoutLM**](https://github.com/microsoft/unilm) family,
  [**LiLT**](https://github.com/jpWang/LiLT) and selected
  [**Bert**](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)-style including features like sliding windows.
- Text mining for native PDFs with [**pdfplumber**](https://github.com/jsvine/pdfplumber),
- Language detection with [**fastText**](https://github.com/facebookresearch/fastText),
- Deskewing and rotating images with [**jdeskew**](https://github.com/phamquiluan/jdeskew).
- Fine-tuning and evaluation tools.
- Lot's of [tutorials](https://github.com/deepdoctection/notebooks)

Have a look at the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Analyzer_Get_Started.ipynb)
for an easy start.

Check the [**release notes**](https://github.com/deepdoctection/deepdoctection/releases) for recent updates.


----------------------------------------------------------------------------------------

# Hugging Face Space Demo

Check the demo of a document layout analysis pipeline with OCR on 🤗
[**Hugging Face spaces**](https://huggingface.co/spaces/deepdoctection/deepdoctection) or use the gradio client. 

```
pip install gradio_client   # requires Python >= 3.10 
```

To process a single image:

```python
from gradio_client import Client, handle_file

if __name__ == "__main__":

    client = Client("deepdoctection/deepdoctection")
    result = client.predict(
        img=handle_file('/local_path/to/dir/file_name.jpeg'),  # accepts image files, e.g. JPEG, PNG
        pdf=None,   
        max_datapoints = 2,
        api_name = "/analyze_image"
    )
    print(result)
```

To process a PDF document:

```python
from gradio_client import Client, handle_file

if __name__ == "__main__":

    client = Client("deepdoctection/deepdoctection")
    result = client.predict(
        img=None,
        pdf=handle_file("/local_path/to/dir/your_doc.pdf"),
        max_datapoints = 2, # increase to process up to 9 pages
        api_name = "/analyze_image"
    )
    print(result)
```

--------------------------------------------------------------------------------------------------------

# Example

```python
import deepdoctection as dd
from IPython.core.display import HTML
from matplotlib import pyplot as plt

analyzer = dd.get_dd_analyzer()  # instantiate the built-in analyzer similar to the Hugging Face space demo

df = analyzer.analyze(path = "/path/to/your/doc.pdf")  # setting up pipeline
df.reset_state()                 # Trigger some initialization

doc = iter(df)
page = next(doc) 

image = page.viz(show_figures=True, show_residual_layouts=True)
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)
```

<p align="center">
  <img src="https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_sample.png" 
alt="sample" width="40%">
</p>

```
HTML(page.tables[0].html)
```

<p align="center">
  <img src="https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_table.png" 
alt="table" width="40%">
</p>

```
print(page.text)
```

<p align="center">
  <img src="https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_text.png" 
alt="text" width="40%">
</p>


-----------------------------------------------------------------------------------------

# Requirements

![requirements](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/install_01.png)

- Linux or macOS. Windows is not supported but there is a [Dockerfile](./docker/pytorch-cpu-jupyter/Dockerfile) available.
- Python >= 3.9
- 2.2 \<= PyTorch **or** 2.11 \<= Tensorflow < 2.16. (For lower Tensorflow versions the code will only run on a GPU).
  Tensorflow support will be stopped from Python 3.11 onwards.
- To fine-tune models, a GPU is recommended.

| Task | PyTorch | Torchscript | Tensorflow |
|---------------------------------------------|:-------:|----------------|:------------:|
| Layout detection via Detectron2/Tensorpack | ✅ | ✅ (CPU only) | ✅ (GPU only) |
| Table recognition via Detectron2/Tensorpack | ✅ | ✅ (CPU only) | ✅ (GPU only) |
| Table transformer via Transformers | ✅ | ❌ | ❌ |
| Deformable-Detr | ✅ | ❌ | ❌ |
| DocTr | ✅ | ❌ | ✅ |
| LayoutLM (v1, v2, v3, XLM) via Transformers | ✅ | ❌ | ❌ |

------------------------------------------------------------------------------------------

# Installation

We recommend using a virtual environment.

## Get started installation

For a simple setup which is enough to parse documents with the default setting, install the following:

**PyTorch**

```
pip install transformers
pip install python-doctr==0.9.0
pip install deepdoctection
```

**TensorFlow**

```
pip install tensorpack
pip install python-doctr==0.9.0
pip install deepdoctection
```

Both setups are sufficient to run the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Get_Started.ipynb).

### Full installation

The following installation will give you ALL models available within the Deep Learning framework as well as all models
that are independent of Tensorflow/PyTorch.

**PyTorch**

First install **Detectron2** separately as it is not distributed via PyPi. Check the instruction
[here](https://detectron2.readthedocs.io/en/latest/tutorials/install.html) or try:

```
pip install detectron2@git+https://github.com/deepdoctection/detectron2.git
```

Then install **deep**doctection with all its dependencies:

```
pip install deepdoctection[pt]
```

**Tensorflow**

```
pip install deepdoctection[tf]
```


For further information, please consult the [**full installation instructions**](https://deepdoctection.readthedocs.io/en/latest/install/).


## Installation from source

Download the repository or clone via

```
git clone https://github.com/deepdoctection/deepdoctection.git
```

**PyTorch**

```
cd deepdoctection
pip install ".[pt]" # or "pip install -e .[pt]"
```

**Tensorflow**

```
cd deepdoctection
pip install ".[tf]" # or "pip install -e .[tf]"
```


## Running a Docker container from Docker hub

Pre-existing Docker images can be downloaded from the [Docker hub](https://hub.docker.com/r/deepdoctection/deepdoctection).

```
docker pull deepdoctection/deepdoctection:<release_tag> 
```

Use the Docker compose file `./docker/pytorch-gpu/docker-compose.yaml`.
In the `.env` file provided, specify the host directory where **deep**doctection's cache should be stored.
Additionally, specify a working directory to mount files to be processed into the container.

```
docker compose up -d
```

will start the container. There is no endpoint exposed, though.

-----------------------------------------------------------------------------------------------

# Credits

We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible
to develop this framework.


# If you like **deep**doctection ...

...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.

# License

Distributed under the Apache 2.0 License. Check [LICENSE](https://github.com/deepdoctection/deepdoctection/blob/master/LICENSE) for additional information.
