Metadata-Version: 2.4
Name: mistocr
Version: 0.1.1
Summary: Simple batch OCR for PDFs using Mistral's state-of-the-art vision model
Home-page: https://github.com/franckalbinet/mistocr
Author: Solveit
Author-email: nobody@fast.ai
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastcore
Requires-Dist: mistralai
Requires-Dist: pillow
Requires-Dist: dotenv
Requires-Dist: lisette
Provides-Extra: dev
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# mistocr


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Why mistocr?

**Performance**: Mistral’s OCR delivers state-of-the-art accuracy on
complex documents including tables, charts, and multi-column layouts.

**Scale**: Process entire folders of PDFs in a single batch job. Upload
once, process asynchronously, and retrieve results when ready - perfect
for large document sets.

**Cost savings**: Batch OCR mode reduces costs from \$1/1000 pages to
\$0.50/1000 pages - a 50% reduction compared to synchronous processing.

**Simplicity**: A single
[`ocr()`](https://franckalbinet.github.io/mistocr/core.html#ocr)
function handles everything - uploading, batch submission, polling for
completion, and saving results as markdown with extracted images.
Process one PDF or an entire folder with the same simple interface.

**Organized output**: Each PDF is automatically saved to its own folder
with pages as separate markdown files and images in an `img` subfolder,
making results easy to navigate and process further.

## Installation

Install latest from the GitHub
[repository](https://github.com/franckalbinet/mistocr):

``` sh
$ pip install git+https://github.com/franckalbinet/mistocr.git
```

or from [pypi](https://pypi.org/project/mistocr/)

``` sh
$ pip install mistocr
```

## How to use

### Basic usage

Process a single PDF:

``` python
from mistocr.core import ocr

fname = 'files/test/attention-is-all-you-need.pdf'
result = ocr(fname)
```

Or process an entire folder:

``` python
results = ocr('files/test')
```

### Output structure

Each PDF is saved to its own folder with pages as separate markdown
files and images in an `img` subfolder:

    files/test/md/
    ├── attention-is-all-you-need/
    │   ├── img/
    │   │   ├── img-0.jpeg
    │   │   ├── img-1.jpeg
    │   │   └── ...
    │   ├── page_1.md
    │   ├── page_2.md
    │   └── ...
    └── resnet/
        ├── img/
        └── ...

### Reading results

Read all pages from a processed PDF:

``` python
from mistocr.core import read_pgs

text = read_pgs('files/test/md/attention-is-all-you-need')
```

Or read a specific page:

``` python
text = read_pgs('files/test/md/attention-is-all-you-need', 10)
```

### Customization

Customize output directory, image inclusion, and polling interval:

``` python
results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)
```

**Parameters:**

- **`path`**: A single PDF file or folder containing multiple PDFs
- **`out_dir`**: Directory name for saving markdown output (default:
  `'md'`)
- **`inc_img`**: Include extracted images in the output (default:
  `True`)
- **`key`**: Your Mistral API key (uses `MISTRAL_API_KEY` environment
  variable if not provided)
- **`poll_interval`**: Seconds between batch job status checks (default:
  `2`)

**Returns:** List of paths to the generated markdown files

## Developer Guide

If you are new to using `nbdev` here are some useful pointers to get you
started.

### Install mistocr in Development mode

``` sh
# make sure mistocr package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to mistocr
$ nbdev_prepare
```

### Documentation

Documentation can be found hosted on this GitHub
[repository](https://github.com/franckalbinet/mistocr)’s
[pages](https://franckalbinet.github.io/mistocr/). Additionally you can
find package manager specific guidelines on
[conda](https://anaconda.org/franckalbinet/mistocr) and
[pypi](https://pypi.org/project/mistocr/) respectively.
