# llmsbrieftxt

Generate llms-brief.txt files from any documentation website using AI. A focused, production-ready CLI tool that does one thing exceptionally well.

## Quick Start

```bash
# Install
pip install llmsbrieftxt

# Set your OpenAI API key
export OPENAI_API_KEY="sk-your-api-key-here"

# Generate llms-brief.txt from a documentation site
llmtxt https://docs.python.org/3/

# Preview URLs before processing
llmtxt https://react.dev --show-urls

# Use a different model
llmtxt https://react.dev --model gpt-4o
```

## What It Does

Crawls documentation websites, extracts content, and uses OpenAI to generate structured llms-brief.txt files. Each entry contains a title, URL, keywords, and one-line summary - making it easy for LLMs and developers to navigate documentation.

**Key Features:**
- **Smart Crawling**: Breadth-first discovery up to depth 3, with URL deduplication
- **Content Extraction**: HTML to Markdown using trafilatura
- **AI Summarization**: Structured output using OpenAI
- **Automatic Caching**: Summaries cached in `.llmsbrieftxt_cache/` to avoid reprocessing
- **Production-Ready**: Clean output, proper error handling, scriptable

## Installation

```bash
# With pip
pip install llmsbrieftxt

# With uv (recommended)
uv pip install llmsbrieftxt
```

## Prerequisites

- **Python 3.10+**
- **OpenAI API Key**: Required for generating summaries
  ```bash
  export OPENAI_API_KEY="sk-your-api-key-here"
  ```

## Usage

### Basic Command

```bash
llmtxt <url> [options]
```

Output is automatically saved to `~/.claude/docs/<domain>.txt` (e.g., `docs.python.org.txt`)

### Options

- `--output PATH` - Custom output path (default: `~/.claude/docs/<domain>.txt`)
- `--model MODEL` - OpenAI model to use (default: `gpt-5-mini`)
- `--max-concurrent-summaries N` - Concurrent LLM requests (default: 10)
- `--show-urls` - Preview discovered URLs without processing
- `--max-urls N` - Limit number of URLs to process

### Examples

```bash
# Basic usage - saves to ~/.claude/docs/docs.python.org.txt
llmtxt https://docs.python.org/3/

# Use a different model
llmtxt https://react.dev --model gpt-4o

# Preview URLs before processing (no API calls)
llmtxt https://react.dev --show-urls

# Limit scope for testing
llmtxt https://docs.python.org --max-urls 50

# Custom output location
llmtxt https://react.dev --output ./my-docs/react.txt

# Process with higher concurrency (if you have high rate limits)
llmtxt https://fastapi.tiangolo.com --max-concurrent-summaries 20
```

## Searching and Listing

This tool focuses on **generating** llms-brief.txt files. For searching and listing, use standard Unix tools:

### Search Documentation

```bash
# Search all docs
rg "async functions" ~/.claude/docs/

# Search specific file
rg "hooks" ~/.claude/docs/react.dev.txt

# Case-insensitive search
rg -i "error handling" ~/.claude/docs/

# Show context around matches
rg -C 2 "api" ~/.claude/docs/

# Or use grep
grep -r "async" ~/.claude/docs/
```

### List Documentation

```bash
# List all docs
ls ~/.claude/docs/

# List with details
ls -lh ~/.claude/docs/

# Count entries in a file
grep -c "^Title:" ~/.claude/docs/react.dev.txt

# Find all docs and show sizes
find ~/.claude/docs/ -name "*.txt" -exec wc -l {} +
```

**Why use standard tools?** They're:
- Already installed on your system
- More powerful and flexible
- Well-documented
- Composable with other commands
- Faster than any custom implementation

## How It Works

### URL Discovery

The tool uses a comprehensive breadth-first search strategy:
- Explores links up to 3 levels deep from your starting URL
- Automatically excludes assets (CSS, JS, images) and non-documentation pages
- Sophisticated URL normalization prevents duplicate processing
- Discovers 100-300+ pages on typical documentation sites

### Content Processing Pipeline

```
URL Discovery → Content Extraction → LLM Summarization → File Generation
```

1. **Crawl**: Discover all documentation URLs
2. **Extract**: Convert HTML to markdown using trafilatura
3. **Summarize**: Generate structured summaries using OpenAI
4. **Cache**: Store summaries in `.llmsbrieftxt_cache/` for reuse
5. **Generate**: Compile into searchable llms-brief.txt format

### Output Format

Each entry in the generated file contains:
```
Title: [Page Name](URL)
Keywords: searchable, terms, functions, concepts
Summary: One-line description of page content

```

## Development

### Setup

```bash
# Clone and install with dev dependencies
git clone https://github.com/stevennevins/llmsbrief.git
cd llmsbrief
uv sync --group dev
```

### Running Tests

```bash
# All tests
uv run pytest

# Unit tests only
uv run pytest tests/unit/

# Specific test file
uv run pytest tests/unit/test_cli.py

# With verbose output
uv run pytest -v
```

### Code Quality

```bash
# Lint code
uv run ruff check llmsbrieftxt/ tests/

# Format code
uv run ruff format llmsbrieftxt/ tests/

# Type checking
uv run mypy llmsbrieftxt/
```

## Configuration

### Default Settings

- **Crawl Depth**: 3 levels (hardcoded)
- **Output Location**: `~/.claude/docs/<domain>.txt`
- **Cache Directory**: `.llmsbrieftxt_cache/`
- **OpenAI Model**: `gpt-5-mini`
- **Concurrent Requests**: 10

### Environment Variables

- `OPENAI_API_KEY` - Required for all operations

## Usage Tips

### Managing API Costs

- Use `--show-urls` first to preview scope
- Use `--max-urls` to limit processing during testing
- Summaries are cached automatically - rerunning is cheap
- Default model `gpt-5-mini` is cost-effective for most documentation

### Organizing Documentation

All docs are saved to `~/.claude/docs/` by domain name:
```
~/.claude/docs/
├── docs.python.org.txt
├── react.dev.txt
├── pytorch.org.txt
└── fastapi.tiangolo.com.txt
```

This makes it easy for Claude Code and other tools to find and reference documentation.

## Integrations

### Claude Code

This tool is designed to work seamlessly with Claude Code. Once you've generated documentation files, Claude can search and reference them during development sessions.

### MCP Servers

Generated llms-brief.txt files can be served via MCP (Model Context Protocol) servers. See the [mcpdoc project](https://github.com/langchain-ai/mcpdoc) for an example integration.

## Troubleshooting

### API Key Issues

```bash
# Verify API key is set
echo $OPENAI_API_KEY

# Set it if missing
export OPENAI_API_KEY="sk-your-api-key-here"
```

### Rate Limiting

If you hit rate limits, reduce concurrent requests:
```bash
llmtxt https://example.com --max-concurrent-summaries 5
```

### Large Documentation Sites

For very large sites (500+ pages):
1. Start with `--show-urls` to see scope
2. Use `--max-urls` to process in batches
3. Increase `--max-concurrent-summaries` if you have high rate limits

## Migrating from 0.x

Version 1.0.0 removes search and list subcommands in favor of Unix tools:

```bash
# Before (v0.x)
llmsbrieftxt generate https://docs.python.org/3/
llmsbrieftxt search "async"
llmsbrieftxt list

# After (v1.0.0)
llmtxt https://docs.python.org/3/
rg "async" ~/.claude/docs/
ls ~/.claude/docs/
```

**Why the change?** Focus on doing one thing well. Search and list are better served by mature, powerful Unix tools you already have.

## License

MIT

## Contributing

Contributions welcome! Please:
1. Run tests: `uv run pytest`
2. Lint code: `uv run ruff check llmsbrieftxt/ tests/`
3. Format code: `uv run ruff format llmsbrieftxt/ tests/`
4. Check types: `uv run mypy llmsbrieftxt/`
5. Submit a PR

## Links

- **Homepage**: https://github.com/stevennevins/llmsbrief
- **Issues**: https://github.com/stevennevins/llmsbrief/issues
- **llms.txt Spec**: https://llmstxt.org/
