Metadata-Version: 2.4
Name: AutoWebPdfSummarizer
Version: 0.1.1
Summary: Summarize web pages and PDFs with Google Gemini
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31
Requires-Dist: playwright>=1.43
Requires-Dist: pillow>=9.0
Requires-Dist: PyMuPDF>=1.23
Requires-Dist: nest_asyncio>=1.5
Requires-Dist: google-generativeai>=0.5.0
Dynamic: license-file

# AI Knowledge Summarizer

`AutoWebPdfSummarizer` packages the core logic from the original notebook into a reusable library
that can be published on PyPI. It classifies incoming URLs as either standard web pages or
PDF documents, extracts text and imagery, and sends the materials to Google Gemini for a
structured summary.

## Installation

The project uses Playwright for browser automation. Install the Python package and the
Chromium browser binaries:

```bash
pip install AutoWebPdfSummarizer
playwright install chromium
```

Additional runtime dependencies (such as PyMuPDF) are pulled in automatically via the
package metadata.

## Usage

```python
import logging
from AutoWebPdfSummarizer import summarize_url

logger = logging.getLogger("demo")
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

result = summarize_url(
    "https://example.com/article",
    google_api_key="YOUR_API_KEY",
    logger=logger,
)

print(result.summary)
```

Key features:

- Automatic detection of PDF vs. HTML content.
- Smart truncation of large text blocks and screenshot size management for web pages.
- PDF rendering and text extraction powered by PyMuPDF.
- Customizable logging: pass any `logging.Logger` instance or rely on the built-in
  no-op logger.
- Configurable Gemini prompt, model selection, and request limits.

## Configuration Options

`summarize_url` accepts several optional keyword arguments:

- `prompt`: supply a custom Gemini prompt string. The default prompt produces an English
  analyst-style summary.
- `max_chars`: maximum number of characters retained from the extracted text (default
  `6000`).
- `max_image_mb`: per-image size ceiling in megabytes for web page screenshots (default
  `4.0`).
- `max_pdf_pages`: number of PDF pages to process (default `5`).
- `request_timeout`: timeout in seconds used for HTTP and Playwright navigation (default
  `20`).

The Google API key can be provided explicitly or via the `GOOGLE_API_KEY` environment
variable.

## Development

Install the local package in editable mode and the Playwright browser binary:

```bash
pip install -e .
playwright install chromium
```

Then run static checks:

```bash
python -m compileall src
```

## License

MIT
