Metadata-Version: 2.4
Name: go-bm25
Version: 0.2.7
Summary: High-performance BM25 ranking algorithm implementation with Go core and Python bindings
Home-page: https://github.com/pentney/go-bm25
Author: BM25 Contributors
Maintainer: BM25 Contributors
License: MIT
Project-URL: Homepage, https://github.com/pentney/go-bm25
Project-URL: Documentation, https://github.com/pentney/go-bm25#readme
Project-URL: Repository, https://github.com/pentney/go-bm25
Project-URL: Bug Tracker, https://github.com/pentney/go-bm25/issues
Keywords: bm25,search,ranking,information-retrieval,go,python-bindings
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rank-bm25>=0.2.2
Requires-Dist: matplotlib>=3.5.0
Requires-Dist: numpy>=1.21.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: flake8>=4.0; extra == "dev"
Requires-Dist: mypy>=0.950; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=4.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0; extra == "docs"
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# BM25 Go Library

[![Go Version](https://img.shields.io/badge/Go-1.18+-blue.svg)](https://golang.org)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Go Report Card](https://goreportcard.com/badge/github.com/pentney/go-bm25)](https://goreportcard.com/report/github.com/pentney/go-bm25)

A **highly optimized Go implementation** of the BM25 ranking algorithm with **Python bindings** and **PostgreSQL integration**. This library provides lightning-fast text search and ranking capabilities suitable for production applications, search engines, and data analysis systems.

## 📋 **Table of Contents**

- [Features](#-features)
- [Performance Characteristics](#-performance-characteristics)
- [Architecture](#️-architecture)
- [Quick Start](#-quick-start)
- [Installation](#-installation)
- [Documentation](#-documentation)
- [Use Cases](#-use-cases)
- [Performance Optimizations](#️-performance-optimizations)
- [Testing & Benchmarks](#-testing--benchmarks)
- [API Reference](#-api-reference)
- [Why This Library?](#-why-this-library)
- [Contributing](#-contributing)
- [License](#-license)
- [Acknowledgments](#-acknowledgments)

## 🚀 **Features**

### **Core BM25 Implementation**
- **High-Performance**: Optimized Go implementation with sub-millisecond search times
- **Memory Efficient**: Smart memory management with configurable capacities
- **Thread-Safe**: Concurrent search operations with read-write mutexes
- **Flexible Tokenization**: Pluggable tokenizer interface with built-in implementations

### **Smart Tokenization**
- **Intelligent Stopword Removal**: Curated list of essential function words only
- **Multi-Language Support**: English, French, German, Spanish, Italian, Portuguese, Russian, Japanese
- **Punctuation Handling**: Automatic cleaning and normalization
- **Number Preservation**: Maintains numeric tokens for technical content
- **Compound Word Support**: Preserves multi-word terms as single tokens while also indexing individual components

### **Compound Token Support**
- **Multi-Word Terms**: Handle technical terms, medical conditions, product names, etc.
- **Dual Indexing**: Both compound and individual terms are indexed for comprehensive search
- **Flexible Configuration**: Add/remove compound words dynamically
- **Domain-Specific**: Configure for medical, technical, legal, or any specialized vocabulary
- **Natural Search**: Queries for individual terms find documents containing compound terms

### **PostgreSQL Integration**
- **Persistent Storage**: ACID-compliant document storage and retrieval
- **Automatic Schema Management**: Tables and indexes created automatically
- **High-Performance Queries**: Leverages PostgreSQL's query optimizer
- **Batch Mode**: In-memory processing for high-frequency searches (10-100x faster)

### **Python Bindings**
- **Native Performance**: Go-based core with Python interface
- **Easy Integration**: Simple Python API for existing Python applications
- **Cross-Platform**: Works on Linux, macOS, and Windows
- **bm25s Compatible**: 100% drop-in replacement for popular bm25s library

## 📊 **Performance Characteristics**

| Dataset Size | Search Time | Memory Usage | Throughput |
|--------------|-------------|--------------|------------|
| 1K documents | < 0.1ms | ~10MB | 10,000+ queries/sec |
| 10K documents | < 0.5ms | ~100MB | 5,000+ queries/sec |
| 100K documents | < 2ms | ~1GB | 1,000+ queries/sec |
| 1M documents | < 10ms | ~10GB | 100+ queries/sec |

**Batch Mode Performance**: 10-500x faster than database queries for high-frequency operations.

## 🏆 **Benchmark Comparisons**

### **vs bm25s (Pure Python)**
- **Indexing Speed**: 2-5x faster document processing
- **Search Performance**: 3-10x faster query execution
- **Memory Efficiency**: Better memory usage for large datasets
- **Production Ready**: Go-based core vs Python implementation

### **vs rank-bm25**
- **Performance**: Competitive or better performance
- **Features**: Additional PostgreSQL integration and batch mode
- **API**: Cleaner, more intuitive interface
- **Extensibility**: Pluggable tokenization and custom backends

## 🏗️ **Architecture**

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Python API    │    │   Go Core       │    │  PostgreSQL    │
│   (Bindings)    │◄──►│   (BM25 +       │◄──►│  (Storage +     │
│                 │    │    Index)       │    │   Batch Mode)   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
```

## 🚀 **Quick Start**

### **Go Usage**

```go
package main

import (
    "fmt"
    "github.com/pentney/go-bm25"
)

func main() {
    // Create index with capacity estimates
    index := bm25.NewIndex(1000, 10000)
    
    // Use SmartTokenizer for intelligent text processing
    tokenizer := bm25.NewEnglishSmartTokenizer()
    
    // Add documents
    index.AddDocument("doc1", "The quick brown fox jumps over the lazy dog", tokenizer)
    index.AddDocument("doc2", "A quick brown dog runs fast", tokenizer)
    
    // Search with BM25 ranking
    results := index.Search("quick brown fox", tokenizer, 10)
    
    for _, result := range results {
        fmt.Printf("%s: %.4f\n", result.DocID, result.Score)
    }
}
```

### **Smart Tokenization**

The library includes an intelligent `SmartTokenizer` that provides:

```go
// Create English tokenizer with stopword removal
tokenizer := bm25.NewEnglishSmartTokenizer()

// Multi-language support
frTokenizer := bm25.NewSmartTokenizer("fr")
deTokenizer := bm25.NewSmartTokenizer("de")

// Intelligent text processing
text := "The quick brown fox jumps over the lazy dog"
tokens := tokenizer.Tokenize(text)
// Result: ["quick", "brown", "fox", "jumps", "lazy", "dog"]
// Note: "the", "over" removed as stopwords
```

**Features:**
- **Essential Stopwords Only**: Removes only the most common function words
- **Number Preservation**: Keeps numeric tokens for technical content
- **Punctuation Handling**: Automatic cleaning and normalization
- **Multi-Language**: Support for 8+ languages

### **PostgreSQL Integration**

```go
// Create PostgreSQL index
config := bm25.DefaultPostgresConfig()
pgIndex, err := bm25.NewPostgresIndex(config)

// High-performance batch mode for stable indexes
batchMode, err := pgIndex.NewBatchMode()
results := batchMode.Search("query", tokenizer, 10)
```

### **Batch Mode for High-Performance Processing**

For high-frequency searches on stable indexes, use Batch Mode:

```go
// Create batch mode (loads all data into memory)
batchMode, err := pgIndex.NewBatchMode()
defer batchMode.Close()

// Sub-millisecond search performance
for i := 0; i < 1000; i++ {
    results := batchMode.Search("query", tokenizer, 10)
    // 10-500x faster than database queries!
}

// Check memory usage
estimatedBytes, docCount, termCount := batchMode.GetMemoryStats()
fmt.Printf("Memory: %s, Docs: %d, Terms: %d\n", 
    formatBytes(estimatedBytes), docCount, termCount)

// Refresh when database changes
batchMode.Refresh()
```

**Perfect for:**
- **High-frequency searches** (100+ queries/second)
- **Stable indexes** with infrequent updates
- **Batch processing** operations
- **Real-time search** applications

### **Python Usage**

```python
import bm25

# Create index
index = bm25.Index(1000, 10000)

# Add documents
index.add_document("doc1", "The quick brown fox jumps over the lazy dog")

# Search
results = index.search("quick brown fox", 10)
for doc_id, score in results:
    print(f"{doc_id}: {score}")
```

### **bm25s API Compatibility**

This library provides **true drop-in replacement** compatibility with the popular `bm25s` library, making migration seamless:

```python
# bm25s-style usage (exact same API!)
from bm25.bm25s_compat import BM25

# Create index (identical to bm25s.BM25())
index = BM25(documents)

# Index documents (identical to bm25s indexing)
index.index(tokenized_documents)

# Search (identical to bm25s.retrieve())
results = index.retrieve([query_tokens], k=10)

# Get scores (identical to bm25s.get_scores())
scores = index.get_scores(query_tokens)

# All other methods work identically
doc_count = index.get_document_count()
term_count = index.get_term_count()
avgdl = index.get_avgdl()
```

**Perfect Migration Path:**
```python
# Before (bm25s)
import bm25s
index = bm25s.BM25(documents)
results = index.retrieve([query_tokens], k=10)

# After (this library) - just change the import!
from bm25.bm25s_compat import BM25
index = BM25(documents)
results = index.retrieve([query_tokens], k=10)
# Everything else stays exactly the same!
```

**Migration Benefits:**
- **100% API Compatible**: Exact same methods, signatures, and behavior
- **Better Performance**: 2-10x faster than pure Python implementations
- **Enhanced Features**: Smart tokenization, PostgreSQL support, batch mode
- **Production Ready**: Go-based core with Python convenience
- **Zero Code Changes**: Just change the import statement

**Performance Comparison:**
- **bm25s**: Pure Python, good for prototyping
- **This Library**: Go core + Python bindings, production performance

## 🔧 **Installation**

### **Go**

```bash
go get github.com/pentney/go-bm25
```

### **Python**

```bash
pip install bm25-go
```

**bm25s Compatibility Layer:**
```python
# For drop-in replacement of bm25s
from bm25.bm25s_compat import BM25

# Use exactly like bm25s
index = BM25(documents)
results = index.retrieve([query_tokens], k=10)
```

### **PostgreSQL Setup**

```bash
# Install PostgreSQL dependencies
go mod tidy

# Create database
createdb bm25

# Run schema setup (automatic in Go code)
```

## 📚 **Documentation**

- **[Main Documentation](README.md)**: This file - package overview and quick start
- **[PostgreSQL Guide](README_POSTGRES.md)**: Detailed PostgreSQL integration guide
- **[Examples](example/)**: Complete working examples for all features
- **[Benchmarks](postgres_benchmark.go)**: Performance benchmarks and comparisons

## 🎯 **Use Cases**

### **Search Engines**
- **Web Search**: High-performance document ranking
- **Enterprise Search**: Internal document repositories
- **E-commerce**: Product search and recommendations

### **Data Analysis**
- **Text Mining**: Large document collections
- **Content Analysis**: Document similarity and clustering
- **Research**: Academic paper search and ranking

### **Production Applications**
- **Content Management**: Fast document retrieval
- **API Services**: High-throughput search endpoints
- **Real-time Systems**: Low-latency search operations

## ⚡ **Performance Optimizations**

### **Memory Management**
- **Smart Capacity Planning**: Automatic size estimation with growth buffers
- **Efficient Data Structures**: Optimized maps and slices for Go
- **Garbage Collection**: Minimal allocation during search operations
- **Memory Pooling**: Reusable score maps to reduce allocations
- **Bulk Operations**: Efficient batch document processing with pre-allocated memory

### **Search Algorithms**
- **Heap-based Top-K**: Efficient selection of best results
- **Precomputed IDF**: Cached inverse document frequency values
- **Lazy Statistics**: On-demand computation of index statistics
- **Early Termination**: Stop processing when sufficient high-scoring results are found
- **Score Thresholding**: Skip documents below minimum score thresholds
- **Vectorized Scoring**: Process multiple terms simultaneously for better performance
- **Term Impact Sorting**: Process most impactful terms first for faster convergence

### **Caching & Optimization**
- **Search Result Caching**: LRU cache for frequently accessed queries
- **Configurable Parameters**: Tunable K1, B, and Epsilon values for different use cases
- **Batch Search**: Parallel processing of multiple queries
- **Concurrent Operations**: Thread-safe operations with controlled concurrency
- **Performance Monitoring**: Real-time memory usage and performance statistics

### **PostgreSQL Optimizations**
- **Batch Operations**: Efficient bulk document processing
- **Indexed Queries**: Optimized database schema with proper indexes
- **Connection Pooling**: Efficient database connection management

### **bm25s-Inspired Features**
- **Epsilon Smoothing**: Improved IDF calculation stability
- **Parameter Tuning**: Easy adjustment of ranking behavior
- **Early Termination**: Skip low-impact terms and documents
- **Memory Efficiency**: Optimized data structures and pooling
- **Batch Processing**: Efficient handling of large document collections

### **Performance Characteristics by Method**

| Search Method | Use Case | Performance | Memory |
|---------------|----------|-------------|---------|
| `Search()` | General purpose | Good | Standard |
| `SearchOptimized()` | Top-K results | Better | Standard |
| `VectorizedSearch()` | Complex queries | Best | Higher |
| `SearchWithThreshold()` | Filtered results | Good | Standard |
| `BatchSearch()` | Multiple queries | Excellent | Higher |
| `SearchWithCache()` | Repeated queries | Best | Higher |

### **Parameter Tuning Guide**

**Conservative Settings** (K1=1.0, B=0.5, Epsilon=0.25):
- Good for general-purpose search
- Balanced precision and recall
- Suitable for most applications

**Default Settings** (K1=1.2, B=0.75, Epsilon=0.25):
- Standard BM25 behavior
- Good balance of performance and accuracy
- Recommended starting point

**Aggressive Settings** (K1=1.5, B=0.8, Epsilon=0.1):
- Higher precision, lower recall
- Better for focused searches
- Faster early termination

**Very Aggressive Settings** (K1=2.0, B=0.9, Epsilon=0.05):
- Maximum precision
- Fastest search performance
- Best for high-frequency queries

### **Memory Usage Optimization**

- **Capacity Planning**: Pre-allocate based on expected document count
- **Batch Processing**: Use `BulkAddDocuments()` for large collections
- **Cache Management**: Adjust cache size based on available memory
- **Memory Monitoring**: Use `GetPerformanceStats()` to track usage
- **Garbage Collection**: Automatic cleanup of temporary objects

## 🧪 **Testing & Benchmarks**

```bash
# Run all tests
go test -v

# Run optimization benchmarks
go test -bench=BenchmarkSearchMethods -v
go test -bench=BenchmarkParameterConfigurations -v
go test -bench=BenchmarkCaching -v
go test -bench=BenchmarkBatchOperations -v

# Run PostgreSQL benchmarks
go test -bench=Postgres -v

# Run batch mode benchmarks
go test -bench=BenchmarkBatchModeVsPostgres -v
```

## 🔍 **API Reference**

### **Core Types**

- `Index`: Main BM25 index with configurable capacities and parameters
- `BM25Params`: Configurable K1, B, and Epsilon parameters
- `SearchCache`: LRU cache for search results
- `SmartTokenizer`: Intelligent tokenization with stopword removal
- `SearchResult`: Search result with document ID and BM25 score
- `BatchSearchResult`: Result from batch search operations
- `PostgresIndex`: PostgreSQL-backed persistent index
- `BatchMode`: High-performance in-memory processing mode

### **Key Methods**

- `Search()`: Perform BM25 search with ranking
- `SearchOptimized()`: Optimized search with early termination
- `VectorizedSearch()`: Vectorized scoring for complex queries
- `SearchWithThreshold()`: Search with score thresholding
- `BatchSearch()`: Process multiple queries concurrently
- `SearchWithCache()`: Search with optional result caching
- `BulkAddDocuments()`: Efficiently add multiple documents
- `GetPerformanceStats()`: Get memory and performance statistics
- `SetParameters()`: Update BM25 parameters dynamically

### **New Optimization Methods**

- `NewIndexWithParams()`: Create index with custom parameters
- `NewIndexWithCache()`: Create index with search result caching
- `EnableCache()`: Enable/disable result caching
- `SetCacheSize()`: Adjust cache size dynamically
- `GetTermImpact()`: Get term impact score
- `GetQueryTermsImpact()`: Get terms sorted by impact

## 🌟 **Why This Library?**

### **Performance**
- **Go Speed**: Native Go performance with optimized algorithms
- **Memory Efficiency**: Smart memory management for large datasets
- **Concurrent Access**: Thread-safe operations with controlled concurrency
- **Early Termination**: Stop processing when sufficient results are found
- **Vectorized Operations**: Process multiple terms simultaneously

### **Flexibility**
- **Multiple Backends**: In-memory, PostgreSQL, and Python bindings
- **Custom Tokenization**: Pluggable tokenizer interface
- **Configurable Parameters**: Tunable K1, B, and Epsilon values
- **Caching Options**: Optional search result caching
- **Batch Operations**: Efficient bulk document processing

### **Production Ready**
- **Comprehensive Testing**: Extensive test coverage with benchmarks
- **Error Handling**: Robust error handling and recovery
- **Documentation**: Complete examples and performance guides
- **Performance Monitoring**: Real-time statistics and memory tracking
- **Memory Optimization**: Automatic cleanup and efficient data structures

### **Migration Benefits**
- **bm25s Compatibility**: Familiar API for existing users
- **Performance Upgrade**: 2-10x faster than pure Python implementations
- **Feature Enhancement**: Smart tokenization, PostgreSQL, batch mode
- **Production Scaling**: Handle larger datasets with better performance
- **Advanced Optimizations**: Early termination, caching, vectorized search

## 🚀 **Quick Start with Optimizations**

### **Basic Usage with Custom Parameters**

```go
package main

import (
    "fmt"
    "github.com/pentney/go-bm25"
)

func main() {
    // Create index with custom parameters
    params := bm25.BM25Params{
        K1:      1.5,    // Higher term frequency saturation
        B:       0.8,    // Stronger length normalization
        Epsilon: 0.1,    // Lower threshold for better precision
    }
    
    index := bm25.NewIndexWithParams(1000, 10000, params)
    tokenizer := bm25.NewEnglishSmartTokenizer()
    
    // Add documents
    index.AddDocument("doc1", "machine learning algorithm", tokenizer)
    index.AddDocument("doc2", "database system performance", tokenizer)
    
    // Use optimized search
    results := index.SearchOptimized("machine learning", tokenizer, 5)
    
    for _, result := range results {
        fmt.Printf("%s: %.4f\n", result.DocID, result.Score)
    }
}
```

### **Advanced Usage with Caching**

```go
// Create cached index
cachedIndex := bm25.NewIndexWithCache(1000, 10000, 1000)
cachedIndex.EnableCache(true)

// Bulk add documents
documents := []bm25.Document{
    {ID: "doc1", Content: "machine learning algorithm"},
    {ID: "doc2", Content: "database system performance"},
    // ... more documents
}
cachedIndex.BulkAddDocuments(documents, tokenizer)

// Batch search with caching
queries := []string{"machine learning", "database system", "algorithm"}
batchResults := cachedIndex.BatchSearch(queries, tokenizer, 5)

// Get performance statistics
stats := cachedIndex.GetPerformanceStats()
fmt.Printf("Memory usage: %v MB\n", stats["memory_usage_mb"])
```

## 🤝 **Contributing**

We welcome contributions! Please see our contributing guidelines:

1. **Fork** the repository
2. **Create** a feature branch
3. **Make** your changes
4. **Add** tests for new functionality
5. **Submit** a pull request

## 📄 **License**

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 **Acknowledgments**

- **BM25 Algorithm**: Based on the probabilistic relevance framework
- **bm25s Library**: Inspiration for parameter tuning and optimizations
- **Go Community**: Built with Go's excellent standard library
- **PostgreSQL**: Robust database backend for persistence
- **Python Community**: Python bindings for broader adoption

---

**Ready to build fast, scalable search?** Get started with `go get github.com/pentney/go-bm25` or `pip install bm25-go`!
