Metadata-Version: 2.2
Name: spiderforce4ai
Version: 1.2
Summary: Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
Home-page: https://petertam.pro
Author: Piotr Tamulewicz
Author-email: Piotr Tamulewicz <pt@petertam.pro>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: asyncio>=3.4.3
Requires-Dist: rich>=10.0.0
Requires-Dist: aiofiles>=0.8.0
Requires-Dist: httpx>=0.24.0
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# SpiderForce4AI Python Wrapper

A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.

## Features

- HTML to Markdown conversion
- Parallel and async crawling support
- Sitemap processing
- Custom content selection
- Automatic retry mechanism
- Detailed progress tracking
- Webhook notifications
- Customizable reporting

## Installation

```bash
pip install spiderforce4ai
```

## Quick Start

```python
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path

# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")

# Configure crawling options
config = CrawlConfig(
    target_selector="article",
    remove_selectors=[".ads", ".navigation"],
    max_concurrent_requests=5,
    save_reports=True
)

# Crawl a sitemap
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
```

## Key Features

### 1. Smart Retry Mechanism
- Automatically retries failed URLs
- Monitors failure ratio to prevent server overload
- Detailed retry statistics and progress tracking
- Aborts retries if failure rate exceeds 20%

```python
# Retry behavior is automatic
config = CrawlConfig(
    max_concurrent_requests=5,
    request_delay=1.0  # Delay between retries
)
results = spider.crawl_urls_async(urls, config)
```

### 2. Custom Webhook Integration
- Flexible payload formatting
- Custom headers support
- Variable substitution in templates

```python
config = CrawlConfig(
    webhook_url="https://your-webhook.com",
    webhook_headers={
        "Authorization": "Bearer token",
        "X-Custom-Header": "value"
    },
    webhook_payload_template='''{
        "url": "{url}",
        "content": "{markdown}",
        "status": "{status}",
        "custom_field": "value"
    }'''
)
```

### 3. Flexible Report Generation
- Optional report saving
- Customizable report location
- Detailed success/failure statistics

```python
config = CrawlConfig(
    save_reports=True,
    report_file=Path("custom_report.json"),
    output_dir=Path("content")
)
```

## Crawling Methods

### 1. Single URL Processing

```python
# Synchronous
result = spider.crawl_url("https://example.com", config)

# Asynchronous
async def crawl():
    result = await spider.crawl_url_async("https://example.com", config)
```

### 2. Multiple URLs

```python
urls = ["https://example.com/page1", "https://example.com/page2"]

# Server-side parallel (recommended)
results = spider.crawl_urls_server_parallel(urls, config)

# Client-side parallel
results = spider.crawl_urls_parallel(urls, config)

# Asynchronous
async def crawl():
    results = await spider.crawl_urls_async(urls, config)
```

### 3. Sitemap Processing

```python
# Server-side parallel (recommended)
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)

# Client-side parallel
results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)

# Asynchronous
async def crawl():
    results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
```

## Configuration Options

```python
config = CrawlConfig(
    # Content Selection
    target_selector="article",              # Target element to extract
    remove_selectors=[".ads", "#popup"],    # Elements to remove
    remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
    
    # Processing
    max_concurrent_requests=5,              # Parallel processing limit
    request_delay=0.5,                      # Delay between requests
    timeout=30,                             # Request timeout
    
    # Output
    output_dir=Path("content"),             # Output directory
    save_reports=False,                     # Enable/disable report saving
    report_file=Path("report.json"),        # Report location
    
    # Webhook
    webhook_url="https://webhook.com",      # Webhook endpoint
    webhook_timeout=10,                     # Webhook timeout
    webhook_headers={                       # Custom headers
        "Authorization": "Bearer token"
    },
    webhook_payload_template='''            # Custom payload format
    {
        "url": "{url}",
        "content": "{markdown}",
        "status": "{status}",
        "error": "{error}",
        "time": "{timestamp}"
    }'''
)
```

## Progress Tracking

The package provides detailed progress information:

```
Fetching sitemap from https://example.com/sitemap.xml...
Found 156 URLs in sitemap
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 156/156 URLs

Retrying failed URLs: 18 (11.5% failed)
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 18/18 retries

Crawling Summary:
Total URLs processed: 156
Initial failures: 18 (11.5%)
Final results:
  ✓ Successful: 150
  ✗ Failed: 6
Retry success rate: 12/18 (66.7%)
```

## Output Structure

### 1. Directory Layout
```
content/                    # Output directory
├── example-com-page1.md   # Markdown files
├── example-com-page2.md
└── report.json            # Crawl report
```

### 2. Report Format
```json
{
  "timestamp": "2025-02-15T10:30:00",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads"]
  },
  "results": {
    "successful": [...],
    "failed": [...]
  },
  "summary": {
    "total": 156,
    "successful": 150,
    "failed": 6
  }
}
```

## Performance Optimization

1. Server-side Parallel Processing
   - Recommended for most cases
   - Single HTTP request
   - Reduced network overhead
   - Built-in load balancing

2. Client-side Parallel Processing
   - Better control over processing
   - Customizable concurrency
   - Progress tracking per URL
   - Automatic retry handling

3. Asynchronous Processing
   - Ideal for async applications
   - Non-blocking operation
   - Real-time progress updates
   - Efficient resource usage

## Error Handling

The package provides comprehensive error handling:

- Automatic retry for failed URLs
- Failure ratio monitoring
- Detailed error reporting
- Webhook error notifications
- Progress tracking during retries

## Requirements

- Python 3.11+
- Running SpiderForce4AI service
- Internet connection

## Dependencies

- aiohttp
- asyncio
- rich
- aiofiles
- httpx

## License

MIT License

## Credits

Created by [Peter Tam](https://petertam.pro)
