Metadata-Version: 2.4
Name: woolworm
Version: 0.0.3
Summary: Add your description here
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: loguru>=0.7.3
Requires-Dist: marker-pdf[full]>=1.9.3
Requires-Dist: ollama>=0.5.4
Requires-Dist: opencv-python>=4.12.0.88
Requires-Dist: pytesseract>=0.3.13
Requires-Dist: scikit-image>=0.25.2
Requires-Dist: scipy>=1.16.2

# woolworm (Pre-Alpha State)

Hello Northwestern Digitization team, welcome to woolworm, your new (hopefully) one-stop shop for digitization.
I have attempted to abstract as much of the intricacies of image transformation in python. At least to the best of my ability.
While we are working on this grant, I will be working on build automation and a CLI for you all so that it can be even easier to use.

Here is my current feature list, where I am open to suggestions or requests, because I like this sort of thing:

- [ ] API
  - [x] Load image
  - [x] Deskew
  - [x] Intelligent document binarization/grayscale
  - [x] Tesseract OCR
  - [x] Standalone Ollama LLM OCR
  - [x] Marker Document Understanding LLM OCR
  - [ ] HathiTrust (currently experimental in a standalone script)
- [ ] Pipelines
  - [x] Image processing
  - [ ] OCR
  - [ ] HathiTrust
  - [ ] ???
  - [ ] Profit
- [ ] CLI (To be done later)
- [ ] Figure out how the hell I publish a python package

## Prerequisites

You will want to familiarize yourself with the absolute basics of calling object-methods. If you want to use any LLM models, you will need to install [Ollama](https://ollama.com/). Feel free to contact me if you need assistance in setting up Ollama.

## Quickstart

If you are extremely impatient, you can get started with two lines of code

```python
from woolworm import woolworm

woolworm.Pipelines.process_image("inputfilename.jpg", "outputfilename.jpg")
```

In the backend, it looks like this. You can find this code in the `cookbook` directory

```python
from woolworm import woolworm

p = woolworm()  # Creates the "woolworm" class

f = "filename.jpg"
base_name = f.replace(".jpg", "")

# Step 1: Load original
img = p.load(f)

# Step 2: de-skew
img = p.deskew_with_hough(img)

# Step 3: This is kinda weird, and currently fine-tuned for use with NU's environmental impact statements
# Long story short, the programming will use some heuristics to detect if the image is a diagram or mostly text
# If the program thinks it is text, it will binarize, if it thinks it is a diagram, it will not.
img = p.binarize_or_gray(img)

p.save_image(img)
```
# woolworm
