Metadata-Version: 2.3
Name: fandom-scraper
Version: 0.6.4
Summary: A simple AI (span-marker) powered fandom scraper
Author: AnthonyP57
Author-email: AnthonyP57 <antonipawlowicz123@gmail.com>
Requires-Dist: accelerate==1.10.1
Requires-Dist: aiohappyeyeballs==2.4.4
Requires-Dist: aiohttp==3.11.9
Requires-Dist: aiosignal==1.3.1
Requires-Dist: attrs==24.2.0
Requires-Dist: beautifulsoup4==4.13.5
Requires-Dist: certifi==2025.8.3
Requires-Dist: charset-normalizer==3.4.3
Requires-Dist: datasets==3.1.0
Requires-Dist: dill==0.3.8
Requires-Dist: evaluate==0.4.5
Requires-Dist: fandom-py==0.2.1
Requires-Dist: filelock==3.13.1
Requires-Dist: frozenlist==1.5.0
Requires-Dist: fsspec==2024.9.0
Requires-Dist: huggingface-hub==0.26.3
Requires-Dist: idna==3.10
Requires-Dist: jinja2==3.1.6
Requires-Dist: joblib==1.5.2
Requires-Dist: markupsafe==3.0.2
Requires-Dist: mpmath==1.3.0
Requires-Dist: multidict==6.6.4
Requires-Dist: multiprocess==0.70.16
Requires-Dist: networkx==3.3
Requires-Dist: numpy==2.1.3
Requires-Dist: nvidia-cublas-cu11==11.11.3.6 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-cuda-cupti-cu11==11.8.87 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-cuda-nvrtc-cu11==11.8.89 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-cuda-runtime-cu11==11.8.89 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-cudnn-cu11==9.1.0.70 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-cufft-cu11==10.9.0.58 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-curand-cu11==10.3.0.86 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-cusolver-cu11==11.4.1.48 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-cusparse-cu11==11.7.5.86 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-nccl-cu11==2.21.5 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-nvjitlink-cu12==12.4.127 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: nvidia-nvtx-cu11==11.8.86 ; sys_platform == 'linux' or sys_platform == 'win32'
Requires-Dist: packaging==24.2
Requires-Dist: pandas==2.2.3
Requires-Dist: propcache==0.2.1
Requires-Dist: psutil==6.1.0
Requires-Dist: pyarrow==18.1.0
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: pytz==2025.2
Requires-Dist: pyyaml==6.0.2
Requires-Dist: regex==2024.11.6
Requires-Dist: requests==2.32.3
Requires-Dist: safetensors==0.6.2
Requires-Dist: scikit-learn==1.6.0
Requires-Dist: scipy==1.14.1
Requires-Dist: seqeval==1.2.2
Requires-Dist: six==1.17.0
Requires-Dist: soupsieve==2.8
Requires-Dist: span-marker==1.7.0
Requires-Dist: sympy==1.13.1
Requires-Dist: threadpoolctl==3.6.0
Requires-Dist: tokenizers==0.21.0
Requires-Dist: torch==2.5.1
Requires-Dist: tqdm==4.67.1
Requires-Dist: transformers==4.47.0
Requires-Dist: triton==3.1.0
Requires-Dist: typing-extensions==4.12.2
Requires-Dist: tzdata==2024.2
Requires-Dist: urllib3==2.2.3
Requires-Dist: xxhash==3.5.0
Requires-Dist: yarl==1.18.3
Requires-Python: >=3.11
Description-Content-Type: text/markdown

# Fandom Scraper
A simple AI (span marker) powered fandom scraper.

> [!NOTE]  
> This package is a part of the [Cirilla project](https://github.com/AnthonyP57/Cirilla---a-LLM-made-on-a-budget)

> [!IMPORTANT]  
> In order to use the package an nvidia gpu is required.
> 
> Considering how *fragile* huggingface's span marker can be, the requirements are fixed, so I advise to create a separate project in order to only scrape the data.
## Installation
```bash
# (recommended)
uv add fandom-scraper

# or
pip install fandom-scraper
```
## Usage
The usage is very simple, the function requires path with so-called seeds to start scraping e.g. `examples/witcher_json/witcher_1.json`
```json
[
    "Geralt of Rivia", "Triss Merigold", "Vesemir", "Leo", "Lambert", 
    "Eskel", "Alvin", "Shani", "Zoltan Chivay", "Dandelion (Jaskier)", 
    "King Foltest", "Adda the White",

    "Jacques de Aldersberg", "Azar Javed", "Professor (leader of Salamandra)", 
    ...
]
```
and later uses sugesions provided by an Named Entity Recognition (NER) model. The script saves the scraped pages and instructions into respective folders.
```python
from fandom_scraper import scrape_fandom
in_path = "./examples/witcher_json"
out_path = "./examples/async_fandom"
instruct_path = "./examples/async_fandom_instruct"

wiki = "Witcher"
lang = "en"

scrape_fandom(in_path=in_path,
              out_path=out_path,
              instruct_path=instruct_path,
              n_workers=50, # n async page "fetchers"
              wiki=wiki,
              lang=lang)
```
See `examples/async_fandom/` and `examples/async_fandom_instruct/` for more examples.

In order to turn the scraped instructions into a `.jsonl` file that can be used by [Cirilla](https://github.com/AnthonyP57/Cirilla---a-LLM-made-on-a-budget) models use:
```python
from fandom_scraper import instructions_into_conv

instructions_into_conv('./examples/async_fandom_instruct', './examples/fandom_instruct.jsonl')
```

The resulting file will look like:
```json
{"subject": "Shani", "text": [{"role": "user", "content": "What role did Shani play in the Battle of Brenna?"}, {"role": "assistant", "content": "Shani played a vital role in the Battle of Brenna in March 1268, serving in a field hospital under Milo Rusty Vanderbeck. Despite her youth, she effectively managed the demanding and gruesome conditions, providing essential medical aid to the wounded, contributing to the victory of the Northern Kingdoms."}], "data type": "conv", "source": "fandom"}
{"subject": "Shani", "text": [{"role": "user", "content": "How did Shani become the Dean of the Department of Medicine at Oxenfurt Academy?"}, {"role": "assistant", "content": "Shani, initially a medical student at Oxenfurt Academy, demonstrated her medical proficiency at a young age. She served in Milo Rusty Vanderbecks field hospital at the Battle of Brenna. Her exceptional skills and commitment to the field led to her appointment as the Dean of the Department of Medicine at Oxenfurt Academy."}], "data type": "conv", "source": "fandom"}
...
```
## Effectiveness
For the Witcher fandom, the scraper managed to gather 7506 pages, 1494 instructions. All in all around 40MiB of pure text in around 4 hours.