Metadata-Version: 2.4
Name: scrapeMM
Version: 0.3.1
Summary: LLM-friendly scraper for media and text from social media and the open web.
Author-email: Mark Rothermel <mark.rothermel@tu-darmstadt.de>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/multimodal-ai-lab/scrapeMM
Project-URL: Issues, https://github.com/multimodal-ai-lab/scrapeMM
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ezmm
Requires-Dist: telethon
Requires-Dist: tweepy
Requires-Dist: markdownify
Requires-Dist: platformdirs
Requires-Dist: PyYAML
Requires-Dist: atproto
Requires-Dist: TikTokResearchApi
Requires-Dist: yt-dlp
Requires-Dist: cryptography
Requires-Dist: firecrawl-py
Dynamic: license-file

# scrapeMM: Multimodal Web Retrieval
Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

## Usage
```python
from scrapemm import retrieve
import asyncio

url = "https://example.com"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(retrieve(url))
result.render()
```
`scrapeMM` will ask you for the **API keys** needed for the social media integrations. You may skip them if you don't need them. 
You will also be prompted to choose a **password** that is used to secure the secrets in an encrypted file.

## How it works
```
Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence
```
The `MultimodalSequence` is a sequence of Markdown-formatted text and media provided by the [ezMM](https://github.com/multimodal-ai-lab/ezmm) library.

Web scraping is done with [Firecrawl](https://github.com/mendableai/firecrawl) and [Decodo](https://decodo.com/).

## Supported Proprietary APIs
- ✅ X/Twitter
- ✅ Telegram
- ✅ Bluesky
- ✅ TikTok
- ⚠️ Facebook (working only sometimes and only with yt-dlp and Decodo)
- ⚠️ Instagram (done for videos but not for images yet)
- ⚠️ YouTube (working sometimes)
- ⏳ Threads
- ⏳ Reddit
