Metadata-Version: 2.3
Name: fandom-scraper
Version: 0.6.2
Summary: A simple AI (span-marker) powered fandom scraper
Author: AnthonyP57
Author-email: AnthonyP57 <antonipawlowicz123@gmail.com>
Requires-Dist: accelerate==1.10.1
Requires-Dist: aiohappyeyeballs==2.4.4
Requires-Dist: aiohttp==3.11.9
Requires-Dist: aiosignal==1.3.1
Requires-Dist: attrs==24.2.0
Requires-Dist: beautifulsoup4==4.13.5
Requires-Dist: certifi==2025.8.3
Requires-Dist: charset-normalizer==3.4.3
Requires-Dist: datasets==3.1.0
Requires-Dist: dill==0.3.8
Requires-Dist: evaluate==0.4.5
Requires-Dist: fandom-py==0.2.1
Requires-Dist: fandom-scraper==0.1.4
Requires-Dist: filelock==3.13.1
Requires-Dist: frozenlist==1.5.0
Requires-Dist: fsspec==2024.9.0
Requires-Dist: huggingface-hub==0.26.3
Requires-Dist: idna==3.10
Requires-Dist: jinja2==3.1.6
Requires-Dist: joblib==1.5.2
Requires-Dist: markupsafe==3.0.2
Requires-Dist: mpmath==1.3.0
Requires-Dist: multidict==6.6.4
Requires-Dist: multiprocess==0.70.16
Requires-Dist: networkx==3.3
Requires-Dist: numpy==2.1.3
Requires-Dist: nvidia-cublas-cu11==11.11.3.6
Requires-Dist: nvidia-cuda-cupti-cu11==11.8.87
Requires-Dist: nvidia-cuda-nvrtc-cu11==11.8.89
Requires-Dist: nvidia-cuda-runtime-cu11==11.8.89
Requires-Dist: nvidia-cudnn-cu11==9.1.0.70
Requires-Dist: nvidia-cufft-cu11==10.9.0.58
Requires-Dist: nvidia-curand-cu11==10.3.0.86
Requires-Dist: nvidia-cusolver-cu11==11.4.1.48
Requires-Dist: nvidia-cusparse-cu11==11.7.5.86
Requires-Dist: nvidia-nccl-cu11==2.21.5
Requires-Dist: nvidia-nvtx-cu11==11.8.86
Requires-Dist: packaging==24.2
Requires-Dist: pandas==2.2.3
Requires-Dist: propcache==0.2.1
Requires-Dist: psutil==6.1.0
Requires-Dist: pyarrow==18.1.0
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: pytz==2025.2
Requires-Dist: pyyaml==6.0.2
Requires-Dist: regex==2024.11.6
Requires-Dist: requests==2.32.3
Requires-Dist: safetensors==0.6.2
Requires-Dist: scikit-learn==1.6.0
Requires-Dist: scipy==1.14.1
Requires-Dist: seqeval==1.2.2
Requires-Dist: six==1.17.0
Requires-Dist: soupsieve==2.8
Requires-Dist: span-marker==1.7.0
Requires-Dist: sympy==1.13.1
Requires-Dist: threadpoolctl==3.6.0
Requires-Dist: tokenizers==0.21.0
Requires-Dist: torch==2.5.1+cu118
Requires-Dist: tqdm==4.67.1
Requires-Dist: transformers==4.47.0
Requires-Dist: triton==3.1.0
Requires-Dist: typing-extensions==4.12.2
Requires-Dist: tzdata==2024.2
Requires-Dist: urllib3==2.2.3
Requires-Dist: xxhash==3.5.0
Requires-Dist: yarl==1.18.3
Requires-Python: >=3.11
Description-Content-Type: text/markdown

# Fandom Scraper
A simple AI (span marker) powered fandom scraper.

> [!NOTE]  
> This package is a part of the [Cirilla project](https://github.com/AnthonyP57/Cirilla---a-LLM-made-on-a-budget)

> [!IMPORTANT]  
> In order to use the package an nvidia gpu is required.
## Installation
```bash
# (recommended)
uv add fandom-scraper

# or
pip install fandom-scraper
```
## Usage
The usage is very simple, the function requires path with so-called seeds to start scraping e.g. `examples/witcher_json/witcher_1.json`
```json
[
    "Geralt of Rivia", "Triss Merigold", "Vesemir", "Leo", "Lambert", 
    "Eskel", "Alvin", "Shani", "Zoltan Chivay", "Dandelion (Jaskier)", 
    "King Foltest", "Adda the White",

    "Jacques de Aldersberg", "Azar Javed", "Professor (leader of Salamandra)", 
    ...
]
```
and later uses sugesions provided by an Named Entity Recognition (NER) model. The script saves the scraped pages and instructions into respective folders.
```python
from fandom_scraper import scrape_fandom
in_path = Path("./examples/witcher_json")
out_path = Path("./examples/async_fandom")
instruct_path = Path("./examples/async_fandom_instruct")

scrape_fandom(in_path, out_path, instruct_path)
```
See `examples/async_fandom/` and `examples/async_fandom_instruct/` for more examples.