Metadata-Version: 2.4
Name: kgdata
Version: 7.0.12
Summary: Library to process dumps of knowledge graphs (Wikipedia, DBpedia, Wikidata)
License: MIT
License-File: LICENSE
Author: Binh Vu
Author-email: binh@toan2.com
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: all
Provides-Extra: dev
Provides-Extra: pyspark
Provides-Extra: ray
Requires-Dist: beautifulsoup4 (>=4.9.3,<5.0.0)
Requires-Dist: chardet (>=5.0.0,<6.0.0)
Requires-Dist: click (>=8.1.3,<9.0.0)
Requires-Dist: ftfy (>=6.1.3,<7.0.0)
Requires-Dist: hugedict (>=2.12.10,<3.0.0)
Requires-Dist: kgdata_core (>=4.0.1,<5.0.0)
Requires-Dist: loguru (>=0.7.0,<0.8.0)
Requires-Dist: lxml (>=6.0.2,<7.0.0)
Requires-Dist: numpy (>=2.1.1,<3.0.0)
Requires-Dist: orjson (>=3.9.0,<4.0.0)
Requires-Dist: parsimonious (>=0.8.1,<0.9.0)
Requires-Dist: pqdict (>=1.3.0,<2.0.0)
Requires-Dist: pyspark (>=3.5.0,<4.0.0) ; extra == "pyspark" or extra == "all"
Requires-Dist: ray (>=2.0.1,<3.0.0) ; extra == "ray" or extra == "all"
Requires-Dist: rdflib (>=7.0.0,<8.0.0)
Requires-Dist: redis (>=3.5.3,<4.0.0)
Requires-Dist: requests (>=2.28.0,<3.0.0)
Requires-Dist: rsoup (>=3.1.7,<4.0.0)
Requires-Dist: ruamel.yaml (>=0.17.21,<0.18.0)
Requires-Dist: sem-desc (>=6.11.2,<7.0.0)
Requires-Dist: six (>=1.16.0,<2.0.0)
Requires-Dist: tqdm (>=4.64.0,<5.0.0)
Requires-Dist: ujson (>=5.5.0,<6.0.0)
Project-URL: Homepage, https://github.com/binh-vu/kgdata
Project-URL: Repository, https://github.com/binh-vu/kgdata
Description-Content-Type: text/markdown

# kgdata ![PyPI](https://img.shields.io/pypi/v/kgdata) ![Documentation](https://readthedocs.org/projects/kgdata/badge/?version=latest&style=flat)

KGData is a library to process dumps of Wikipedia, Wikidata. What it can do:

- Clean up the dumps to ensure the data is consistent (resolve redirect, remove dangling references)
- Create embedded key-value databases to access entities from the dumps.
- Extract Wikidata ontology.
- Extract Wikipedia tables and convert the hyperlinks to Wikidata entities.
- Create Pyserini indices to search Wikidata’s entities.
- and more

For a full documentation, please see [the website](https://kgdata.readthedocs.io/).

## Installation

From PyPI (using pre-built binaries):

```bash
pip install kgdata[spark]   # omit spark to manually specify its version if your cluster has different version
```

