Metadata-Version: 2.4
Name: jurispacy-tokenizer
Version: 1.2.0
Summary: Flair tokenizer adapted to French court decisions using spacy tokenization
Keywords: nlp,spacy,flair,tokenizer,legal
Author-email: Cour de cassation <amaury.fouret@justice.fr>
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: torch==2.7.0
Requires-Dist: scipy<1.13.0
Requires-Dist: spacy==3.6.1
Requires-Dist: flair==0.15.1
Requires-Dist: regex==2024.9.11
Project-URL: Homepage, https://github.com/Cour-de-cassation/jurispacy-tokenizer
Project-URL: Issues, https://github.com/Cour-de-cassation/jurispacy-tokenizer/issues
Project-URL: Repository, https://github.com/Cour-de-cassation/jurispacy-tokenizer

# JuriSpacyTokenizer

## Description

Tokenizer(s) used in our NLP projects. Built using [Flair](https://github.com/flairNLP/flair) and [spaCy](https://github.com/explosion/spaCy/)

## Installation

```bash
pip install jurispacy-tokenizer
python -m spacy download fr_core_news_sm-3.6.0
```

## Usage

### Tokenize strings

You can use this library to tokenize a string into a list of strings representing tokens:

```python
from jurispacy_tokenizer import JuriSpacyTokenizer

tokenizer = JuriSpacyTokenizer()
text = "M.Paul et Jean-Pierre sont heureux."

tokens = tokenizer.tokenize(text)

for token in tokens:
    print(token)
```

This should ouptut:

```
M.
Paul
et
Jean-Pierre
sont
heureux
.
```

## Tokenize longer text into sentences

You can also parse longer text to create Flair Sentence objects:

```python
from jurispacy_tokenizer import JuriSpacyTokenizer

tokenizer = JuriSpacyTokenizer()

text = """Bonjour tout le monde! Je m'appelle Amaury.

Je travaille avec Paul."""

sentences = tokenizer.get_tokenized_sentences(text)

for s in sentences:
    print(s)

```

This should output:

```
Sentence[5]: "Bonjour tout le monde!"
Sentence[5]: "Je m'appelle Amaury."
Sentence[5]: "Je travaille avec Paul."
```
