Metadata-Version: 2.4
Name: grepctl
Version: 0.1.1
Summary: One-command orchestration for multimodal semantic search in BigQuery
Author-email: Gregory Mulla <gregory.cr.mulla@gmail.com>
Maintainer-email: Gregory Mulla <gregory.cr.mulla@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/yourusername/grepctl
Project-URL: Documentation, https://github.com/yourusername/grepctl#readme
Project-URL: Repository, https://github.com/yourusername/grepctl.git
Project-URL: Issues, https://github.com/yourusername/grepctl/issues
Project-URL: Changelog, https://github.com/yourusername/grepctl/blob/main/CHANGELOG.md
Keywords: bigquery,semantic-search,vector-search,multimodal,google-cloud,machine-learning,embeddings,gcs,vertex-ai,document-search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.2.1
Requires-Dist: google-cloud-aiplatform>=1.113.0
Requires-Dist: google-cloud-bigquery>=3.37.0
Requires-Dist: google-cloud-documentai>=3.6.0
Requires-Dist: google-cloud-speech>=2.33.0
Requires-Dist: google-cloud-videointelligence>=2.16.2
Requires-Dist: google-cloud-vision>=3.10.2
Requires-Dist: google-cloud-storage>=2.10.0
Requires-Dist: pypdf2>=3.0.1
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: rich>=14.1.0
Requires-Dist: tqdm>=4.67.1
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Requires-Dist: diagrams>=0.24.4; extra == "dev"
Provides-Extra: multimedia
Requires-Dist: imageio>=2.37.0; extra == "multimedia"
Requires-Dist: imageio-ffmpeg>=0.6.0; extra == "multimedia"
Requires-Dist: matplotlib>=3.10.6; extra == "multimedia"
Provides-Extra: research
Requires-Dist: arxiv>=2.2.0; extra == "research"
Requires-Dist: datasets>=4.0.0; extra == "research"
Requires-Dist: tensorflow-datasets>=4.9.9; extra == "research"
Requires-Dist: yt-dlp>=2025.9.5; extra == "research"
Requires-Dist: graphviz>=0.20.3; extra == "research"
Dynamic: license-file

# grepctl - BigQuery Semantic Search Orchestrator

[![PyPI version](https://badge.fury.io/py/grepctl.svg)](https://pypi.org/project/grepctl/)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

🚀 **One-command multimodal semantic search across your entire data lake using BigQuery ML and Google Cloud AI.**

## 📦 Installation

```bash
# Install from PyPI
pip install grepctl
```

## 🎯 Quick Start - From Zero to Search in One Command

```bash
# Complete setup with automatic data ingestion
grepctl init all --bucket your-bucket --auto-ingest

# Start searching immediately
grepctl search "find all mentions of machine learning"
```

That's it! The system automatically:
- ✅ Enables all required Google Cloud APIs
- ✅ Creates BigQuery dataset and tables
- ✅ Deploys Vertex AI embedding models
- ✅ Ingests all 8 data modalities from your GCS bucket
- ✅ Generates 768-dimensional embeddings
- ✅ Configures semantic search with VECTOR_SEARCH

## 📊 What is grepctl?

`grepctl` is a powerful command-line orchestration tool that transforms your Google Cloud Storage data lake into a searchable knowledge base. It provides a unified interface for searching across **8 different data types**:
- 📄 **Text & Markdown** - Direct content extraction
- 📑 **PDF Documents** - OCR with Document AI
- 🖼️ **Images** - Vision API analysis (labels, text, objects, faces)
- 🎵 **Audio Files** - Speech-to-Text transcription
- 🎬 **Video Files** - Video Intelligence analysis
- 📊 **JSON & CSV** - Structured data parsing

All searchable through semantic understanding, not just keywords!

## 🏗️ Architecture Overview

```
┌─────────────────────────────────────────────────────────────┐
│                     GCS DATA LAKE                           │
│                    (Your Documents)                         │
│  📄 Text  📑 PDF  🖼️ Images  🎵 Audio  🎬 Video  📊 Data    │
└─────────────────────────┬───────────────────────────────────┘
                          │
                    ┌─────▼─────┐
                    │  grepctl  │ ← One command orchestration
                    └─────┬─────┘
                          │
        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│ Ingestion    │  │ Google APIs  │  │ Processing   │
│ • 6 scripts  │  │ • Vision     │  │ • Extract    │
│ • All types  │  │ • Speech     │  │ • Transform  │
│              │  │ • Video      │  │ • Enrich     │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
        └─────────────────┼─────────────────┘
                          ▼
                ┌─────────────────────┐
                │  BigQuery Dataset   │
                │   search_corpus     │
                └─────────┬───────────┘
                          ▼
                ┌─────────────────────┐
                │   Vertex AI         │
                │ text-embedding-004  │
                │  768 dimensions     │
                └─────────┬───────────┘
                          ▼
                ┌─────────────────────┐
                │  Semantic Search    │
                │   VECTOR_SEARCH     │
                │  <1 second query    │
                └─────────────────────┘
```

## 🛠️ Installation & Setup

### Prerequisites

1. **Google Cloud Project** with billing enabled
2. **Python 3.11+**
3. **gcloud CLI** authenticated with appropriate permissions

### Install from PyPI

```bash
# Install the package
pip install grepctl

# Verify installation
grepctl --help
```

### Install from Source

```bash
# Clone repository
git clone https://github.com/yourusername/grepctl.git
cd grepctl

# Install with uv (recommended)
uv sync
uv run grepctl --help

# Or install with pip
pip install -e .
grepctl --help
```

### Complete System Setup

#### Option 1: Fully Automated (Recommended)

```bash
# One command does everything!
grepctl init all --bucket your-bucket --auto-ingest

# This single command:
# 1. Enables 7 Google Cloud APIs
# 2. Creates BigQuery dataset and 3 tables
# 3. Deploys 3 Vertex AI models
# 4. Ingests all files from GCS
# 5. Generates embeddings
# 6. Sets up semantic search
```

#### Option 2: Step-by-Step Control

```bash
# Enable APIs
grepctl apis enable --all

# Initialize BigQuery
grepctl init dataset
grepctl init models

# Ingest data
grepctl ingest all

# Generate embeddings
grepctl index update

# Start searching
grepctl search "your query"
```

## 🔍 Using the System

### Command Line Interface

```bash
# Search across all data
grepctl search "machine learning algorithms"

# Search specific modalities
grepctl search "error handling" -k 20 -m pdf -m markdown

# Check system status
grepctl status

# View all available commands
grepctl --help
```

### SQL Interface

```sql
-- Direct semantic search
WITH query_embedding AS (
  SELECT ml_generate_embedding_result AS embedding
  FROM ML.GENERATE_EMBEDDING(
    MODEL `your-project.mmgrep.text_embedding_model`,
    (SELECT 'machine learning' AS content),
    STRUCT(TRUE AS flatten_json_output)
  )
)
SELECT doc_id, source, text_content, distance AS score
FROM VECTOR_SEARCH(
  TABLE `your-project.mmgrep.search_corpus`,
  'embedding',
  (SELECT embedding FROM query_embedding),
  top_k => 10
)
ORDER BY distance;
```

### Python API (When installed from source)

```python
from bq_semgrep.search.vector_search import SemanticSearch
from bq_semgrep.bigquery.connection import BigQueryClient
from bq_semgrep.config import load_config

# Load configuration
config = load_config()
client = BigQueryClient(config)

# Initialize searcher
searcher = SemanticSearch(client, config)

# Search across all modalities
results = searcher.search(
    query="neural networks",
    top_k=20,
    source_filter=['pdf', 'images'],
    use_rerank=True
)
```

## 📈 System Capabilities

### Current Status (Production Ready)
- ✅ **425+ documents** indexed across 8 modalities
- ✅ **768-dimensional embeddings** for semantic understanding
- ✅ **Sub-second query response** times
- ✅ **100% embedding coverage** for all documents
- ✅ **5 Google Cloud APIs** integrated
- ✅ **Auto-recovery** from embedding issues

### Supported Operations
| Operation | Command | Description |
|-----------|---------|-------------|
| **Setup** | `grepctl init all --auto-ingest` | Complete one-command setup |
| **Ingest** | `grepctl ingest all` | Process all file types |
| **Index** | `grepctl index update` | Generate embeddings |
| **Fix** | `grepctl fix embeddings` | Auto-fix dimension issues |
| **Search** | `grepctl search "query"` | Semantic search |
| **Status** | `grepctl status` | System health check |

## 🧰 grepctl Commands

### Complete CLI Management

```bash
# System initialization
grepctl init all --bucket your-bucket --auto-ingest

# API management
grepctl apis enable --all
grepctl apis check

# Data ingestion
grepctl ingest pdf        # Process PDFs
grepctl ingest images     # Analyze images with Vision API
grepctl ingest audio      # Transcribe audio
grepctl ingest video      # Analyze videos

# Index management
grepctl index rebuild     # Rebuild from scratch
grepctl index update      # Update missing embeddings
grepctl index verify      # Check embedding health

# Troubleshooting
grepctl fix embeddings    # Fix dimension issues
grepctl fix stuck         # Handle stuck processing
grepctl fix validate      # Check data integrity

# Search
grepctl search "query" -k 20 -o json
```

### Configuration

grepctl uses `~/.grepctl.yaml` for configuration:

```yaml
project_id: your-project
dataset: mmgrep
bucket: your-bucket
location: US
batch_size: 100
chunk_size: 1000
```

## 📊 Supported Data Types

| Modality | Extensions | Processing Method | Google API Used |
|----------|------------|-------------------|-----------------|
| Text | .txt, .log | Direct extraction | — |
| Markdown | .md | Markdown parsing | — |
| PDF | .pdf | OCR extraction | Document AI |
| Images | .jpg, .png, .gif | Visual analysis | Vision API |
| Audio | .mp3, .wav, .m4a | Transcription | Speech-to-Text |
| Video | .mp4, .avi, .mov | Frame + audio analysis | Video Intelligence |
| JSON | .json, .jsonl | Structured parsing | — |
| CSV | .csv, .tsv | Tabular analysis | — |

## 🚀 Advanced Features

### Multimodal Search
Search across all data types simultaneously:
```bash
# Find mentions across PDFs, images, and videos
grepctl search "quarterly revenue" -m pdf -m images -m video
```

### Automatic Processing
- **Vision API** extracts text, labels, objects from images
- **Document AI** performs OCR on scanned PDFs
- **Speech-to-Text** transcribes audio with punctuation
- **Video Intelligence** analyzes frames and transcribes speech

### Error Recovery
```bash
# Automatic fix for common issues
grepctl fix embeddings    # Fixes dimension mismatches
grepctl fix stuck         # Clears stuck processing
```

## 📚 Documentation

- **[grepctl Documentation](grepctl_README.md)** - Complete grepctl usage guide
- **[Architecture Diagrams](visualize_architecture.py)** - System visualization
- **[Lessons Learned](lessons_learned.md)** - Implementation insights
- **[API Integration](api_integration_detail.png)** - Google Cloud API details

## 🔧 Troubleshooting

### Common Issues & Solutions

| Issue | Solution |
|-------|----------|
| "Permission denied" | Run `gcloud auth login` and ensure BigQuery Admin role |
| "Dataset not found" | Run `grepctl init dataset` |
| "Embedding dimension mismatch" | Run `grepctl fix embeddings` |
| "No search results" | Check `grepctl status` and run `grepctl index update` |
| "API not enabled" | Run `grepctl apis enable --all` |

### Quick Diagnostics

```bash
# Check everything
grepctl status

# Verify APIs
grepctl apis check

# Check embeddings
grepctl index verify

# Fix any issues
grepctl fix embeddings
```

## 🎯 Example Use Cases

1. **Code Search**: Find code patterns across repositories
2. **Document Discovery**: Search PDFs for specific topics
3. **Media Analysis**: Find content in images and videos
4. **Log Analysis**: Semantic search through log files
5. **Data Mining**: Query structured data semantically

## 📈 Performance

- **Ingestion**: ~50 docs/second for text
- **Embedding Generation**: ~20 docs/second
- **Search Latency**: <1 second for most queries
- **Storage**: ~500MB for 425+ documents
- **Accuracy**: 768-dimensional embeddings for semantic precision

## 📦 Package Information

- **PyPI**: https://pypi.org/project/grepctl/
- **Version**: 0.1.0
- **Requirements**: Python 3.11+, Google Cloud Project
- **License**: MIT

## 🤝 Contributing

Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

### Development Setup

```bash
# Clone the repository
git clone https://github.com/yourusername/grepctl.git
cd grepctl

# Install in development mode with uv
uv sync
uv add --dev pytest black flake8

# Run tests
uv run pytest

# Format code
uv run black .
```

## 📄 License

MIT License - see [LICENSE](LICENSE) for details.

## 🙏 Acknowledgments

Built with:
- Google BigQuery ML
- Vertex AI (text-embedding-004)
- Google Cloud Vision, Document AI, Speech-to-Text, Video Intelligence APIs
- Python, Click, and Rich CLI libraries

## 📊 Citation

If you use grepctl in your research or project, please cite:

```bibtex
@software{grepctl2024,
  title = {grepctl: One-Command Orchestration for Multimodal Semantic Search in BigQuery},
  author = {Mulla, Gregory},
  year = {2024},
  url = {https://github.com/yourusername/grepctl},
  version = {0.1.0}
}
```

---

**Ready to transform your data lake into a searchable knowledge base?**

```bash
pip install grepctl
grepctl init all --bucket your-bucket --auto-ingest
```

🎉 That's all it takes!
