Metadata-Version: 2.4
Name: data_gatherer
Version: 0.1.2
Summary: DataGatherer Library
Home-page: https://github.com/VIDA-NYU/data-gatherer
Keywords: Information Extraction,NYU
Requires-Python: >=3.11
Description-Content-Type: text/x-rst
Requires-Dist: beautifulsoup4>=4.12.3
Requires-Dist: bs4
Requires-Dist: lxml
Requires-Dist: numpy>=1.26
Requires-Dist: ollama
Requires-Dist: openai
Requires-Dist: pandas
Requires-Dist: pydantic
Requires-Dist: pydantic_core
Requires-Dist: python-dotenv
Requires-Dist: PyYAML
Requires-Dist: regex
Requires-Dist: requests
Requires-Dist: selenium>=4.28.0
Requires-Dist: tokenizers
Requires-Dist: transformers
Requires-Dist: typing_extensions
Requires-Dist: webdriver-manager
Requires-Dist: google.generativeai
Requires-Dist: tiktoken
Requires-Dist: cloudscraper
Requires-Dist: pyui
Requires-Dist: pysdl2-dll
Requires-Dist: pyarrow
Requires-Dist: streamlit
Requires-Dist: ipywidgets
Requires-Dist: portkey-ai
Requires-Dist: xlsxwriter
Requires-Dist: sentence-transformers
Requires-Dist: pymupdf
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

.. image:: https://readthedocs.org/projects/data-gatherer/badge/?version=latest
   :target: https://data-gatherer.readthedocs.io/en/latest/
   :alt: Documentation Status

Data Gatherer
=============

**Data Gatherer** is a Python library for automatically extracting dataset references from scientific publications.
It processes full-text articles—whether in HTML or XML format—and uses both rule-based and LLM-based methods
to identify and structure dataset citations.

What It Does
------------

- Parses scientific articles from open-access sources like PubMed Central (PMC).
- Extracts dataset mentions from structured sections (e.g., Data Availability, Supplementary Material).
- Supports two main strategies:

  - **Retrieve-Then-Read (RTR)**: First retrieves relevant sections using hand-crafted rules, then applies LLMs.
  - **Full-Document Read (FDR)**: Applies LLMs to the full text without section filtering.

- Outputs structured results in JSON format.
- Includes support for known repositories (e.g., GEO, PRIDE, MassIVE) via a configurable ontology.

Use Cases
---------

- Helping data curators and librarians identify datasets cited in publications.
- Supporting meta-analysis and secondary data discovery.
- Enabling dataset indexing and retrieval across the open-access literature.
