Metadata-Version: 2.4
Name: freecrawl-mcp
Version: 0.1.1
Summary: FreeCrawl MCP Server - Self-hosted web scraping and document processing as a Firecrawl replacement
Project-URL: Homepage, https://github.com/dylan-gluck/freecrawl-mcp
Project-URL: Repository, https://github.com/dylan-gluck/freecrawl-mcp
Project-URL: Issues, https://github.com/dylan-gluck/freecrawl-mcp/issues
Author-email: Dylan Gluck <dylan@dylangluck.com>
License: MIT
Keywords: document-processing,firecrawl,mcp,playwright,web-scraping
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: aiosqlite>=0.19.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: markdownify>=0.11.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: playwright>=1.40.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: tenacity>=8.0.0
Description-Content-Type: text/markdown

# FreeCrawl MCP Server

A production-ready Model Context Protocol (MCP) server for web scraping and document processing, designed as a self-hosted replacement for Firecrawl.

## 🚀 Features

- **JavaScript-enabled web scraping** with Playwright and anti-detection measures
- **Document processing** with fallback support for various formats
- **Concurrent batch processing** with configurable limits
- **Intelligent caching** with SQLite backend
- **Rate limiting** per domain
- **Comprehensive error handling** with retry logic
- **Easy installation** via `uvx` or local development setup
- **Health monitoring** and metrics collection

## 📦 Installation & Usage

### Quick Start with uvx (Recommended)

The easiest way to use FreeCrawl is with `uvx`, which automatically manages dependencies:

```bash
# Install and run directly
uvx freecrawl-mcp

# Install browsers on first run
uvx freecrawl-mcp --install-browsers

# Test functionality
uvx freecrawl-mcp --test

# Get help
uvx freecrawl-mcp --help
```

### Local Development Setup

For local development or customization:

1. **Clone from GitHub:**
   ```bash
   git clone https://github.com/dylan-gluck/freecrawl-mcp.git
   cd freecrawl-mcp
   ```

2. **Set up environment:**
   ```bash
   # Sync dependencies
   uv sync

   # Install browser dependencies
   uv run freecrawl-mcp --install-browsers

   # Run tests
   uv run freecrawl-mcp --test
   ```

3. **Run the server:**
   ```bash
   uv run freecrawl-mcp
   ```

## 🛠 Configuration

Configure FreeCrawl using environment variables:

### Basic Configuration
```bash
# Transport (stdio for MCP, http for REST API)
export FREECRAWL_TRANSPORT=stdio

# Browser pool settings
export FREECRAWL_MAX_BROWSERS=3
export FREECRAWL_HEADLESS=true

# Concurrency limits
export FREECRAWL_MAX_CONCURRENT=10
export FREECRAWL_MAX_PER_DOMAIN=3

# Cache settings
export FREECRAWL_CACHE=true
export FREECRAWL_CACHE_DIR=/tmp/freecrawl_cache
export FREECRAWL_CACHE_TTL=3600
export FREECRAWL_CACHE_SIZE=536870912  # 512MB

# Rate limiting
export FREECRAWL_RATE_LIMIT=60  # requests per minute

# Logging
export FREECRAWL_LOG_LEVEL=INFO
```

### Security Settings
```bash
# API authentication (optional)
export FREECRAWL_REQUIRE_API_KEY=false
export FREECRAWL_API_KEYS=key1,key2,key3

# Domain blocking
export FREECRAWL_BLOCKED_DOMAINS=localhost,127.0.0.1

# Anti-detection
export FREECRAWL_ANTI_DETECT=true
export FREECRAWL_ROTATE_UA=true
```

## 🔧 MCP Tools

FreeCrawl provides the following MCP tools:

### `freecrawl_scrape`
Scrape content from a single URL with advanced options.

**Parameters:**
- `url` (string): URL to scrape
- `formats` (array): Output formats - `["markdown", "html", "text", "screenshot", "structured"]`
- `javascript` (boolean): Enable JavaScript execution (default: true)
- `wait_for` (string, optional): CSS selector or time (ms) to wait
- `anti_bot` (boolean): Enable anti-detection measures (default: true)
- `headers` (object, optional): Custom HTTP headers
- `cookies` (object, optional): Custom cookies
- `cache` (boolean): Use cached results if available (default: true)
- `timeout` (number): Total timeout in milliseconds (default: 30000)

**Example:**
```json
{
  "name": "freecrawl_scrape",
  "arguments": {
    "url": "https://example.com",
    "formats": ["markdown", "screenshot"],
    "javascript": true,
    "wait_for": "2000"
  }
}
```

### `freecrawl_batch_scrape`
Scrape multiple URLs concurrently.

**Parameters:**
- `urls` (array): List of URLs to scrape (max 100)
- `concurrency` (number): Maximum concurrent requests (default: 5)
- `formats` (array): Output formats (default: `["markdown"]`)
- `common_options` (object, optional): Options applied to all URLs
- `continue_on_error` (boolean): Continue if individual URLs fail (default: true)

**Example:**
```json
{
  "name": "freecrawl_batch_scrape",
  "arguments": {
    "urls": [
      "https://example.com/page1",
      "https://example.com/page2"
    ],
    "concurrency": 3,
    "formats": ["markdown", "text"]
  }
}
```

### `freecrawl_extract`
Extract structured data using schema-driven approach.

**Parameters:**
- `url` (string): URL to extract data from
- `schema` (object): JSON Schema or Pydantic model definition
- `prompt` (string, optional): Custom extraction instructions
- `validation` (boolean): Validate against schema (default: true)
- `multiple` (boolean): Extract multiple matching items (default: false)

**Example:**
```json
{
  "name": "freecrawl_extract",
  "arguments": {
    "url": "https://example.com/product",
    "schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"}
      }
    }
  }
}
```

### `freecrawl_process_document`
Process documents (PDF, DOCX, etc.) with OCR support.

**Parameters:**
- `file_path` (string, optional): Path to document file
- `url` (string, optional): URL to download document from
- `strategy` (string): Processing strategy - `"fast"`, `"hi_res"`, `"ocr_only"` (default: "hi_res")
- `formats` (array): Output formats - `["markdown", "structured", "text"]`
- `languages` (array, optional): OCR languages (e.g., `["eng", "fra"]`)
- `extract_images` (boolean): Extract embedded images (default: false)
- `extract_tables` (boolean): Extract and structure tables (default: true)

**Example:**
```json
{
  "name": "freecrawl_process_document",
  "arguments": {
    "url": "https://example.com/document.pdf",
    "strategy": "hi_res",
    "formats": ["markdown", "structured"]
  }
}
```

### `freecrawl_health_check`
Get server health status and metrics.

**Example:**
```json
{
  "name": "freecrawl_health_check",
  "arguments": {}
}
```

## 🔄 Integration with Claude Code

### MCP Configuration

Add FreeCrawl to your MCP configuration:

**Using uvx (Recommended):**
```json
{
  "mcpServers": {
    "freecrawl": {
      "command": "uvx",
      "args": ["freecrawl-mcp"]
    }
  }
}
```

**Using local development setup:**
```json
{
  "mcpServers": {
    "freecrawl": {
      "command": "uv",
      "args": ["run", "freecrawl-mcp"],
      "cwd": "/path/to/freecrawl-mcp"
    }
  }
}
```

### Usage in Prompts

```
Please scrape the content from https://example.com and extract the main article text in markdown format.
```

Claude Code will automatically use the `freecrawl_scrape` tool to fetch and process the content.

## 🚀 Performance & Scalability

### Resource Usage
- **Memory**: ~100MB base + ~50MB per browser instance
- **CPU**: Moderate usage during active scraping
- **Storage**: Cache grows based on configured limits

### Throughput
- **Single requests**: 2-5 seconds typical response time
- **Batch processing**: 10-50 concurrent requests depending on configuration
- **Cache hit ratio**: 30%+ for repeated content

### Optimization Tips
1. **Enable caching** for frequently accessed content
2. **Adjust concurrency** based on target site rate limits
3. **Use appropriate formats** - markdown is faster than screenshots
4. **Configure rate limiting** to avoid being blocked

## 🛡 Security Considerations

### Anti-Detection
- Rotating user agents
- Realistic browser fingerprints
- Request timing randomization
- JavaScript execution in sandboxed environment

### Input Validation
- URL format validation
- Private IP blocking
- Domain blocklist support
- Request size limits

### Resource Protection
- Memory usage monitoring
- Browser pool size limits
- Request timeout enforcement
- Rate limiting per domain

## 🔧 Troubleshooting

### Common Issues

| Issue | Possible Cause | Solution |
|-------|----------------|----------|
| High memory usage | Too many browser instances | Reduce `FREECRAWL_MAX_BROWSERS` |
| Slow responses | JavaScript-heavy sites | Increase timeout or disable JS |
| Bot detection | Missing anti-detection | Ensure `FREECRAWL_ANTI_DETECT=true` |
| Cache misses | TTL too short | Increase `FREECRAWL_CACHE_TTL` |
| Import errors | Missing dependencies | Run `uvx freecrawl-mcp --test` |

### Debug Mode

**With uvx:**
```bash
export FREECRAWL_LOG_LEVEL=DEBUG
uvx freecrawl-mcp --test
```

**Local development:**
```bash
export FREECRAWL_LOG_LEVEL=DEBUG
uv run freecrawl-mcp --test
```

## 📈 Monitoring & Observability

### Health Metrics
- Browser pool status
- Memory and CPU usage
- Cache hit rates
- Request success rates
- Response times

### Logging
FreeCrawl provides structured logging with configurable levels:
- ERROR: Critical failures
- WARNING: Recoverable issues
- INFO: General operations
- DEBUG: Detailed troubleshooting

## 🔧 Development

### Running Tests

**With uvx:**
```bash
# Basic functionality test
uvx freecrawl-mcp --test
```

**Local development:**
```bash
# Basic functionality test
uv run freecrawl-mcp --test
```

### Code Structure
- **Core server**: `FreeCrawlServer` class
- **Browser management**: `BrowserPool` for resource pooling
- **Content extraction**: `ContentExtractor` with multiple strategies
- **Caching**: `CacheManager` with SQLite backend
- **Rate limiting**: `RateLimiter` with token bucket algorithm

## 📄 License

This project is licensed under the MIT License - see the technical specification for details.

## 🤝 Contributing

1. Fork the repository at https://github.com/dylan-gluck/freecrawl-mcp
2. Create a feature branch
3. Set up local development: `uv sync`
4. Run tests: `uv run freecrawl-mcp --test`
5. Submit a pull request

## 📚 Technical Specification

For detailed technical information, see `ai_docs/FREECRAWL_TECHNICAL_SPEC.md`.

---

**FreeCrawl MCP Server** - Self-hosted web scraping for the modern web 🚀
