Metadata-Version: 2.1
Name: webtoolkit
Version: 0.0.10
Summary: Web tools and interfaces for Internet data processing.
License: GPL3
Author: Iwan Grozny
Author-email: renegat@renegat0x0.ddns.net
Requires-Python: >=3.9,<4.0
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: beautifulsoup4 (>=4.13.5,<5.0.0)
Requires-Dist: brutefeedparser (>=0.10.5,<0.11.0)
Requires-Dist: lxml (>=5.4.0,<6.0.0)
Requires-Dist: psutil
Requires-Dist: python-dateutil (>=2.8.2,<3.0.0)
Requires-Dist: pytz (>=2024.2,<2025.0)
Requires-Dist: requests (>=2.32.3,<3.0.0)
Requires-Dist: tldextract (>=5.1.2,<6.0.0)
Requires-Dist: url-cleaner
Description-Content-Type: text/markdown

# webtoolkit

Provides classes and tools for Internet data processing.

 - Url parsing
 - HTTP status codes identification
 - Page definitions: HtmlPage, RssPage, OpmlPage, Content interfaces
 - Means of calling crawling systems, Crawling interfaces

Remote crawling interfaces are implmented by [crawler-buddy](https://google.com/rumca-js/crawler-buddy).

Available on [pypi](https://pypi.org/project/webtoolkit).


# Url parsing

Clean link from trackers, sanitize
```
UrlLocation.get_cleaned_link
```

To obtain domain
```
UrlLocation(link).get_domain()
```

# HTTP processing

Identification of valid codes
```
PageResponseObject().is_valid
```

Identification of invalid codes
```
PageResponseObject().is_invalid
```

Some codes might not indicate that this page is valid, and is not invalid. For example if our crawler is throttled because of too many requests we do not know yet if the page is valid, or not.

# Page definitions

Easy access to HTML properties
```
page = HtmlPage(url, contents)
page.get_title()
page.get_description()
```

Easy access to RSS properties
```
page = RssPage(url, contents)
page.get_title()
page.get_description()
page.get_entries()
```

Easy access to Opml properties
```
page = OpmlPage(url, contents)
page.get_entries()
```

# Interfaces

 - RemoteServer - provides means of calling remote crawling systems
 - Url - wrapper for RemoteServer, to obtain ready to use data

