Metadata-Version: 2.4
Name: attachments
Version: 0.24.0
Summary: The Python funnel for LLM context - turn any file into model-ready text + images, in one line.
Author-email: Maxime Rivest <mrive052@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://maximerivest.github.io/attachments/
Project-URL: Documentation, https://maximerivest.github.io/attachments/
Project-URL: Repository, https://github.com/maximerivest/attachments
Project-URL: Bug Tracker, https://github.com/maximerivest/attachments/issues
Project-URL: Changelog, https://maximerivest.github.io/attachments/changelog.html
Keywords: llm,ai,pdf,document,multimodal,openai,claude,context,attachments
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Multimedia
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25.0
Requires-Dist: beautifulsoup4>=4.9.0
Requires-Dist: pillow>=11.2.1
Requires-Dist: pydantic>=2.11.5
Requires-Dist: pdfplumber>=0.11.6
Requires-Dist: pillow-heif>=0.22.0
Requires-Dist: nbformat>=5.10.4
Requires-Dist: copykitten>=1.2.2
Requires-Dist: typer>=0.12
Requires-Dist: pyperclip>=1.8
Requires-Dist: pytesseract>=0.3.13
Requires-Dist: pypdfium2>=4.30.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-randomly>=3.15.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=24.0.0; extra == "dev"
Requires-Dist: ruff>=0.5.6; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Requires-Dist: pre-commit>=3.6.0; extra == "dev"
Requires-Dist: mystmd>=1.2.0; extra == "dev"
Requires-Dist: jupyter>=1.0.0; extra == "dev"
Requires-Dist: jupytext>=1.16.0; extra == "dev"
Requires-Dist: sphinx>=7.0.0; extra == "dev"
Requires-Dist: sphinx-autodoc2>=0.5.0; extra == "dev"
Requires-Dist: ipykernel>=6.29.5; extra == "dev"
Requires-Dist: dspy-ai>=2.6.24; extra == "dev"
Requires-Dist: python-pptx>=1.0.2; extra == "dev"
Provides-Extra: common
Requires-Dist: pandas>=2.0.0; extra == "common"
Requires-Dist: Pillow>=10.0.0; extra == "common"
Requires-Dist: pdfplumber>=0.10.0; extra == "common"
Requires-Dist: python-pptx>=1.0.2; extra == "common"
Requires-Dist: python-docx>=1.1.0; extra == "common"
Requires-Dist: openpyxl>=3.1.0; extra == "common"
Requires-Dist: pypdfium2>=4.0.0; extra == "common"
Provides-Extra: office
Requires-Dist: python-pptx>=0.6.0; extra == "office"
Requires-Dist: python-docx>=1.1.0; extra == "office"
Requires-Dist: openpyxl>=3.1.0; extra == "office"
Provides-Extra: 3d
Requires-Dist: bpy>=4.4.0; (python_version >= "3.11" and python_version < "3.12") and extra == "3d"
Requires-Dist: trimesh>=4.3.0; extra == "3d"
Requires-Dist: pyrender>=0.1.45; extra == "3d"
Requires-Dist: numpy>=1.24.0; extra == "3d"
Provides-Extra: pdf
Requires-Dist: pdfplumber>=0.7.0; extra == "pdf"
Requires-Dist: pypdf>=5.0.0; extra == "pdf"
Requires-Dist: pypdfium2>=4.26.0; extra == "pdf"
Provides-Extra: extended
Requires-Dist: pillow-heif>=0.22.0; extra == "extended"
Requires-Dist: pytesseract>=0.3.10; extra == "extended"
Requires-Dist: copykitten>=1.0.0; extra == "extended"
Provides-Extra: browser
Requires-Dist: playwright>=1.43.0; extra == "browser"
Provides-Extra: most
Requires-Dist: dspy-ai>=2.6.24; extra == "most"
Requires-Dist: pandas>=1.3.0; extra == "most"
Requires-Dist: Pillow>=8.0.0; extra == "most"
Requires-Dist: pdfplumber>=0.7.0; extra == "most"
Requires-Dist: python-pptx>=0.6.21; extra == "most"
Requires-Dist: python-docx>=0.8.11; extra == "most"
Requires-Dist: openpyxl>=3.0.9; extra == "most"
Requires-Dist: pillow-heif>=0.22.0; extra == "most"
Requires-Dist: pytesseract>=0.3.10; extra == "most"
Requires-Dist: copykitten>=1.0.0; extra == "most"
Requires-Dist: pypdf>=5.0.0; extra == "most"
Requires-Dist: pypdfium2>=4.26.0; extra == "most"
Requires-Dist: playwright>=1.43.0; extra == "most"
Requires-Dist: trimesh>=4.3.0; extra == "most"
Requires-Dist: pyrender>=0.1.45; extra == "most"
Provides-Extra: test
Requires-Dist: pytest>=8.0.0; extra == "test"
Requires-Dist: pytest-randomly>=3.15.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: tox>=4.0.0; extra == "test"
Requires-Dist: coverage[toml]>=7.4.0; extra == "test"
Dynamic: license-file

# Attachments – the Python funnel for LLM context

### Turn *any* file into model-ready text ＋ images, in one line

Most users will not have to learn anything more than: `Attachments("path/to/file.pdf")`

## 🎬 Demo

![Demo](https://github.com/MaximeRivest/attachments/raw/main/demo_full.gif)

> **TL;DR**  
> ```bash
> pip install attachments
> ```
> ```python
> from attachments import Attachments
> ctx = Attachments("https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample.pdf",
>                    "https://github.com/MaximeRivest/attachments/raw/refs/heads/main/src/attachments/data/sample_multipage.pptx")
> llm_ready_text   = str(ctx)       # all extracted text, already "prompt-engineered"
> llm_ready_images = ctx.images     # list[str] – base64 PNGs
> ```


Attachments aims to be **the** community funnel from *file → text + base64 images* for LLMs.  
Stop re-writing that plumbing in every project – contribute your *loader / modifier / presenter / refiner / adapter* plugin instead!

## Quick-start ⚡

```bash
pip install attachments
```

### Try it now with sample files

```python
from attachments import Attachments
from attachments.data import get_sample_path

# Option 1: Use included sample files (works offline)
pdf_path = get_sample_path("sample.pdf")
txt_path = get_sample_path("sample.txt")
ctx = Attachments(pdf_path, txt_path)

print(str(ctx))      # Pretty text view
print(len(ctx.images))  # Number of extracted images

# Try different file types
docx_path = get_sample_path("test_document.docx")
csv_path = get_sample_path("test.csv")
json_path = get_sample_path("sample.json")

ctx = Attachments(docx_path, csv_path, json_path)
print(f"Processed {len(ctx)} files: Word doc, CSV data, and JSON")

# Option 2: Use URLs (same API, works with any URL)
ctx = Attachments(
    "https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample.pdf",
    "https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample_multipage.pptx"
)

print(str(ctx))      # Pretty text view  
print(len(ctx.images))  # Number of extracted images
```

### Advanced usage with DSL

```python
from attachments import Attachments

a = Attachments(
    "https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/" \
    "sample_multipage.pptx[3-5]"
)
print(a)           # pretty text view
len(a.images)      # 👉 base64 PNG list
```

### Send to OpenAI

```bash
pip install openai
```

```python
from openai import OpenAI
from attachments import Attachments

pptx = Attachments("https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample_multipage.pptx[3-5]")

client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=pptx.openai_chat("Analyse the following document:")
)
print(resp.choices[0].message.content)
```

or with the response API

```python
from openai import OpenAI
from attachments import Attachments

pptx = Attachments("https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample_multipage.pptx[3-5]")

client = OpenAI()
resp = client.responses.create(
    input=pptx.openai_responses("Analyse the following document:"),
    model="gpt-4.1-nano"
)
print(resp.output[0].content[0].text)
```

### Send to Anthropic / Claude

```bash
pip install anthropic
```

```python
import anthropic
from attachments import Attachments

pptx = Attachments("https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample_multipage.pptx[3-5]")

msg = anthropic.Anthropic().messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=8_192,
    messages=pptx.claude("Analyse the slides:")
)
print(msg.content)
```

### DSPy Integration

We have a special `dspy` module that allows you to use Attachments with DSPy.

```bash
pip install dspy
```

```python
from attachments.dspy import Attachments  # Automatic type registration!
import dspy

# Configure DSPy
dspy.configure(lm=dspy.LM('openai/gpt-4.1-nano'))

# Both approaches work seamlessly:

# 1. Class-based signatures (recommended)
class DocumentAnalyzer(dspy.Signature):
    """Analyze document content and extract insights."""
    document: Attachments = dspy.InputField()
    insights: str = dspy.OutputField()

# 2. String-based signatures (works automatically!)
analyzer = dspy.Signature("document: Attachments -> insights: str")

# Use with any file type
doc = Attachments("report.pdf")
result = dspy.ChainOfThought(DocumentAnalyzer)(document=doc)
print(result.insights)
```

**Key Features:**
- 🎯 **Automatic Type Registration**: Import from `attachments.dspy` and use `Attachments` in string signatures immediately
- 🔄 **Seamless Serialization**: Handles complex multimodal content automatically  
- 🖼️ **Image Support**: Base64 images work perfectly with vision models
- 📝 **Rich Text**: Preserves formatting and structure
- 🧩 **Full Compatibility**: Works with all DSPy signatures and programs

### Optional: CSS Selector Highlighting 🎯

For advanced web scraping with visual element highlighting in screenshots:

```bash
# Install Playwright for CSS selector highlighting
pip install playwright
playwright install chromium

# Or with uv
uv add playwright
uv run playwright install chromium

# Or install with browser extras
pip install attachments[browser]
playwright install chromium
```

**What this enables:**
- 🎯 Visual highlighting of selected elements with animations
- 📸 High-quality screenshots with JavaScript rendering  
- 🎨 Professional styling with glowing borders and badges
- 🔍 Perfect for extracting specific page elements

```python
# CSS selector highlighting examples
title = Attachments("https://example.com[select:h1]")  # Highlights H1 elements
content = Attachments("https://example.com[select:.content]")  # Highlights .content class
main = Attachments("https://example.com[select:#main]")  # Highlights #main ID

# Multiple elements with counters and different colors
multi = Attachments("https://example.com[select:h1, .important][viewport:1920x1080]")
```

*Note: Without Playwright, CSS selectors still work for text extraction, but no visual highlighting screenshots are generated.*

### Optional: Microsoft Office Support 📄

For dedicated Microsoft Office format processing:

```bash
# Install just Office format support
pip install attachments[office]

# Or with uv
uv add attachments[office]
```

**What this enables:**
- 📊 PowerPoint (.pptx) slide extraction and processing
- 📝 Word (.docx) document text and formatting extraction  
- 📈 Excel (.xlsx) spreadsheet data analysis
- 🎯 Lightweight installation for Office-only workflows

```python
# Office format examples
presentation = Attachments("slides.pptx[1-5]")  # Extract specific slides
document = Attachments("report.docx")           # Word document processing
spreadsheet = Attachments("data.xlsx[summary:true]")  # Excel with summary
```

*Note: Office formats are also included in the `common` and `all` dependency groups.*

### Advanced Pipeline Processing

For power users, use the full grammar system with composable pipelines:

```python
from attachments import attach, load, modify, present, refine, adapt

# Custom processing pipeline
result = (attach("document.pdf[pages:1-5]") 
         | load.pdf_to_pdfplumber 
         | modify.pages 
         | present.markdown + present.images
         | refine.add_headers | refine.truncate
         | adapt.claude("Analyze this content"))

# Web scraping pipeline
title = (attach("https://en.wikipedia.org/wiki/Llama[select:title]")
        | load.url_to_bs4 
        | modify.select 
        | present.text)

# Reusable processors
csv_analyzer = (load.csv_to_pandas 
               | modify.limit 
               | present.head + present.summary + present.metadata
               | refine.add_headers)

# Use as function
result = csv_analyzer("data.csv[limit:1000]")
analysis = result.claude("What patterns do you see?")
```



---

## DSL cheatsheet 📝

| Piece                     | Example                   | Notes                                         |
| ------------------------- | ------------------------- | --------------------------------------------- |
| **Select pages / slides** | `report.pdf[1,3-5,-1]`    | Supports ranges, negative indices, `N` = last |
| **Image transforms**      | `photo.jpg[rotate:90]`    | Any token implemented by a `Transform` plugin |
| **Data-frame summary**    | `table.csv[summary:true]` | Ships with a quick `df.describe()` renderer   |
| **Web content selection** | `url[select:title]`       | CSS selectors for web scraping               |
| **Web element highlighting** | `url[select:h1][viewport:1920x1080]` | Visual highlighting in screenshots |
| **Image processing**      | `image.jpg[crop:100,100,400,300][rotate:45]` | Chain multiple transformations |
| **Content filtering**     | `doc.pdf[format:plain][images:false]` | Control text/image extraction |
| **Repository processing** | `repo[files:false][ignore:standard]` | Smart codebase analysis |
| **Content Control**       | `doc.pdf[truncate:5000]`  | *Explicit* truncation when needed (user choice) |
| **Repository Filtering**  | `repo[max_files:100]`     | Limit file processing (performance, not content) |
| **Processing Limits**     | `data.csv[limit:1000]`    | Row limits for large datasets (explicit) |

> 🔒 **Default Philosophy**: All content preserved unless you explicitly request limits

---

## Supported formats (out of the box)

* **Docs**: PDF, PowerPoint (`.pptx`), CSV, TXT, Markdown, HTML
* **Images**: PNG, JPEG, BMP, GIF, WEBP, HEIC/HEIF, …
* **Web**: URLs with BeautifulSoup parsing and CSS selection
* **Archives**: ZIP files → image collections with tiling
* **Repositories**: Git repos with smart ignore patterns
* **Data**: CSV with pandas, JSON

---

## Advanced Examples 🧩

### **Multimodal Document Processing**
```python
# PDF with image tiling and analysis
result = Attachments("report.pdf[tile:2x3][resize_images:400]")
analysis = result.claude("Analyze both text and visual elements")

# Multiple file types in one context
ctx = Attachments("report.pdf", "data.csv", "chart.png")
comparison = ctx.openai("Compare insights across all documents")
```

### **Repository Analysis**
```python
# Codebase structure only
structure = Attachments("./my-project[mode:structure]")

# Full codebase analysis with smart filtering
codebase = Attachments("./my-project[ignore:standard]")
review = codebase.claude("Review this code for best practices")

# Custom ignore patterns
filtered = Attachments("./app[ignore:.env,*.log,node_modules]")
```

### **Web Scraping with CSS Selectors**
```python
# Extract specific content from web pages
title = Attachments("https://example.com[select:h1]")
paragraphs = Attachments("https://example.com[select:p]")

# Visual highlighting in screenshots with animations
highlighted = Attachments("https://example.com[select:h1][viewport:1920x1080]")
# Creates screenshot with animated highlighting of h1 elements

# Multiple element highlighting with counters
multi_select = Attachments("https://example.com[select:h1, .important][fullpage:true]")
# Shows "H1 (1/3)", "DIV (2/3)", etc. with different colors for multiple selections

# Pipeline approach for complex scraping
content = (attach("https://en.wikipedia.org/wiki/Llama[select:p]")
          | load.url_to_bs4 
          | modify.select 
          | present.text
          | refine.truncate)
```

### **Image Processing Chains**
```python
# HEIC support with transformations
processed = Attachments("IMG_2160.HEIC[crop:100,100,400,300][rotate:90]")

# Batch image processing with tiling
collage = Attachments("photos.zip[tile:3x2][resize_images:800]")
description = collage.claude("Describe this image collage")
```

### **Data Analysis Workflows**
```python
# Rich data presentation
data_summary = Attachments("sales_data.csv[limit:1000][summary:true]")

# Pipeline for complex data processing
result = (attach("data.csv[limit:500]")
         | load.csv_to_pandas 
         | modify.limit
         | present.head + present.summary + present.metadata
         | refine.add_headers
         | adapt.claude("What trends do you see?"))
```

---

## Extending 🧩

```python
# my_ocr_presenter.py
from attachments.core import Attachment, presenter

@presenter
def ocr_text(att: Attachment, pil_image: 'PIL.Image.Image') -> Attachment:
    """Extract text from images using OCR."""
    try:
        import pytesseract
        
        # Extract text using OCR
        extracted_text = pytesseract.image_to_string(pil_image)
        
        # Add OCR text to attachment
        att.text += f"\n## OCR Extracted Text\n\n{extracted_text}\n"
        
        # Add metadata
        att.metadata['ocr_extracted'] = True
        att.metadata['ocr_text_length'] = len(extracted_text)
        
        return att
        
    except ImportError:
        att.text += "\n## OCR Not Available\n\nInstall pytesseract: pip install pytesseract\n"
        return att
```

**How it works:**
1. **Save the file** anywhere in your project
2. **Import it** before using attachments: `import my_ocr_presenter`
3. **Use automatically**: `Attachments("scanned_document.png")` will now include OCR text

**Other extension points:**
- `@loader` - Add support for new file formats
- `@modifier` - Add new transformations (crop, rotate, etc.)
- `@presenter` - Add new content extraction methods
- `@refiner` - Add post-processing steps
- `@adapter` - Add new API format outputs

---

## API reference (essentials)

| Object / method         | Description                                                     |
| ----------------------- | --------------------------------------------------------------- |
| `Attachments(*sources)` | Many `Attachment` objects flattened into one container          |
| `Attachments.text`      | All text joined with blank lines                                |
| `Attachments.images`    | Flat list of base64 PNGs                                        |
| `.claude(prompt="")`    | Claude API format with image support                            |
| `.openai_chat(prompt="")` | OpenAI Chat Completions API format                            |
| `.openai_responses(prompt="")` | OpenAI Responses API format (different structure)       |
| `.openai(prompt="")`    | Alias for openai_chat (backwards compatibility)                 |
| `.dspy()`               | DSPy BaseType-compatible objects                                 |

### Grammar System (Advanced)

| Namespace | Purpose | Examples |
|-----------|---------|----------|
| `load.*` | File format → objects | `pdf_to_pdfplumber`, `csv_to_pandas`, `url_to_bs4` |
| `modify.*` | Transform objects | `pages`, `limit`, `select`, `crop`, `rotate` |
| `present.*` | Extract content | `text`, `images`, `markdown`, `summary` |
| `refine.*` | Post-process | `truncate`, `add_headers`, `tile_images` |
| `adapt.*` | Format for APIs | `claude`, `openai`, `dspy` |

**Operators**: `|` (sequential), `+` (additive)

---

### Roadmap
- [ ] **Documentation**: Architecture, Grammar, How to extend, examples (at least 1 per pipeline), DSL, API reference
- [ ] **Test coverage**: 100% for pipelines, 100% for DSL.
- [ ] **More pipelines**: Google Suite, Google Drive, Email(!?), Youtube url, X link, ChatGPT url, Slack url (?), data (parquet, duckdb, arrow, sqlite), etc.
- [ ] **More adapters**: Bedrock, Azure, Openrouter, Ollama (?),  Litellm, Langchain, vllm(?), sglang(?), cossette, claudette, etc.
- [ ] **Add .audio and .video**: and corresponding pipelines.

Join us – file an issue or open a PR! 🚀

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=maximerivest/attachments&type=Date)](https://www.star-history.com/#maximerivest/attachments&Date)

## Installation

```bash
# Stable release (recommended for most users)
pip install attachments

# Alpha testing (latest features, may have bugs)
pip install attachments==0.13.0a1
# or
pip install --pre attachments
```

### 🧪 Alpha Testing

We're actively developing new features! If you want to test the latest capabilities:

**Install alpha version:**
```bash
pip install attachments==0.13.0a1
```

**What's new in alpha:**
- 🔍 Enhanced DSL cheatsheet with types, defaults, and allowable values
- 📊 Automatic DSL command discovery and documentation
- 🚀 Improved logging and verbosity system
- 🛠️ Better error messages and suggestions

**Feedback welcome:** [GitHub Issues](https://github.com/MaximeRivest/attachments/issues)

---
