Metadata-Version: 2.4
Name: hallunox
Version: 0.5.2
Summary: A confidence-aware routing system for LLM hallucination detection using multi-signal approach
Home-page: https://github.com/convai-innovations/hallunox
Author: Nandakishor M
Author-email: Nandakishor M <support@convaiinnovations.com>
Maintainer-email: "Convai Innovations Pvt. Ltd." <support@convaiinnovations.com>
License: AGPL-3.0
Project-URL: Homepage, https://convaiinnovations.com
Project-URL: Repository, https://github.com/convai-innovations/hallunox
Project-URL: Documentation, https://hallunox.readthedocs.io
Project-URL: Bug Reports, https://github.com/convai-innovations/hallunox/issues
Project-URL: Source Code, https://github.com/convai-innovations/hallunox
Keywords: hallucination-detection,llm,confidence-estimation,model-reliability,uncertainty-quantification,ai-safety
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.13.0
Requires-Dist: transformers>=4.45.0
Requires-Dist: datasets>=2.0.0
Requires-Dist: FlagEmbedding>=1.2.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: numpy==1.26.4
Requires-Dist: tqdm>=4.64.0
Requires-Dist: pathlib
Requires-Dist: Pillow>=8.0.0
Requires-Dist: bitsandbytes>=0.41.0
Requires-Dist: requests>=2.31.0
Requires-Dist: huggingface_hub>=0.23.0
Requires-Dist: dash>=3.0.0
Requires-Dist: dash-bootstrap-components>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: isort>=5.10.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: mypy>=0.950; extra == "dev"
Provides-Extra: training
Requires-Dist: wandb>=0.12.0; extra == "training"
Requires-Dist: tensorboard>=2.8.0; extra == "training"
Provides-Extra: dashboard
Requires-Dist: dash>=3.0.0; extra == "dashboard"
Requires-Dist: dash-bootstrap-components>=2.0.0; extra == "dashboard"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# HalluNox

**Confidence-Aware Routing for Large Language Model Reliability Enhancement**

A Python package implementing a multi-signal approach to pre-generation hallucination mitigation for Large Language Models. HalluNox combines semantic alignment measurement, internal convergence analysis, and learned confidence estimation to produce unified confidence scores for proactive routing decisions.

## ✨ Features

- **🎯 Pre-generation Hallucination Detection**: Assess model reliability before generation begins
- **🔄 Confidence-Aware Routing**: Automatically route queries based on estimated confidence
- **🧠 Multi-Signal Approach**: Combines semantic alignment, internal convergence, and learned confidence
- **⚡ Multi-Model Support**: Llama-3.2-3B-Instruct and MedGemma-4B-IT architectures
- **🏥 Medical Domain Specialization**: Enhanced MedGemma 4B-IT support with medical-grade confidence thresholds
- **🖼️ Multimodal Capabilities**: Image analysis and response generation for MedGemma models
- **📊 Comprehensive Evaluation**: Built-in metrics and routing strategy analysis
- **🚀 Easy Integration**: Simple API for both training and inference
- **🏃‍♂️ Performance Optimizations**: Optional LLM loading for faster initialization and lower memory usage
- **📝 Enhanced Query-Context**: Improved accuracy with structured prompt formatting
- **🎛️ Adaptive Thresholds**: Dynamic confidence thresholds based on model type (0.62 for medical, 0.65 for general)
- **💬 Response Generation**: Built-in response generation with confidence-gated output
- **🔧 Automatic Model Management**: Auto-download and configuration for supported models

## 🔬 Research Foundation

Based on the research paper "Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation" by Nandakishor M (Convai Innovations).

The approach implements deterministic routing to appropriate response pathways:

### General Models (Llama-3.2-3B)
- **High Confidence (≥0.65)**: Local generation  
- **Medium Confidence (0.60-0.65)**: Retrieval-augmented generation
- **Low Confidence (0.4-0.60)**: Route to larger models
- **Very Low Confidence (<0.4)**: Human review required

### Medical Models (MedGemma-4B-IT)
- **High Medical Confidence (≥0.60)**: Local generation with medical validation
- **Medium Medical Confidence (0.55-0.60)**: Medical literature verification required
- **Low Medical Confidence (0.50-0.55)**: Professional medical verification required
- **Very Low Medical Confidence (<0.50)**: Seek professional medical advice

## 🌐 Web Dashboard (NEW!)

HalluNox now includes a ChatGPT-like web interface for interactive confidence-aware conversations!

### 🎨 Dashboard Features

- **💬 ChatGPT-like Interface**: Modern, responsive chat interface
- **🔄 Conversation Management**: Create, save, edit, and delete conversation threads
- **🏥 Model Selection**: Switch between Llama-3.2-3B and MedGemma-4B models
- **⚙️ Real-time Settings**: Adjust confidence threshold, temperature, and max tokens
- **📊 Confidence Display**: Visual confidence scores with color-coded responses
- **🚫 Smart Blocking**: Automatically blocks low-confidence responses with helpful suggestions
- **📝 Message Actions**: Regenerate, copy, and edit conversation messages
- **📱 Responsive Design**: Works on desktop, tablet, and mobile devices
- **💾 Auto-save**: Conversations automatically saved with unique thread names

### 🚀 Quick Start Dashboard

```bash
# Install dashboard dependencies
pip install hallunox[dashboard]

# Launch the web interface
hallunox-dashboard

# Or with custom settings
hallunox-dashboard --host 0.0.0.0 --port 8080 --debug
```

The dashboard will be available at `http://localhost:8050`

### 🎛️ Dashboard Interface

**Left Sidebar**: Conversation threads with search and management
**Center**: Chat messages with confidence indicators
**Right Panel**: Model settings and hyperparameter controls

**Confidence-Based Responses**:
- 🟢 **High Confidence (≥0.6)**: Response generated normally
- 🟡 **Medium Confidence (0.4-0.6)**: Response with caution warning  
- 🔴 **Low Confidence (<0.4)**: Response blocked with "I don't know" message and web search suggestion

### 🔧 Dashboard Configuration

```python
from hallunox.dashboard import create_app

# Create custom dashboard app
app = create_app(debug=True)

# Run with custom configuration
app.run_server(host='0.0.0.0', port=8050, debug=False)
```

### 💡 Usage Examples

The dashboard automatically integrates with your HalluNox detector settings:

- **Model switching**: Toggle between general and medical models
- **Confidence threshold**: Real-time adjustment from 0.1 to 0.9
- **Temperature control**: Adjust response creativity
- **Token limits**: Control response length

When confidence drops below threshold, users see:
> "I don't know the answer to that question with sufficient confidence (confidence: 0.45 < 0.60). You should search the web for accurate information or consult other reliable sources."

## 🆕 What's New in v0.5.2

### ✅ Major Stability Improvements
- **🔧 Fixed NaN/Inf Issues**: Completely resolved numerical instability problems that affected earlier versions
- **🚀 Simplified Architecture**: Adopted proven approach from working inference pipeline
- **⚡ Better Performance**: Removed quantization overhead while maintaining accuracy
- **🛡️ Enhanced Reliability**: More stable model loading and inference

### 🔧 Technical Improvements
- **Disabled 4-bit Quantization**: Root cause of numerical instabilities removed
- **Simplified Model Loading**: Uses `torch.bfloat16` precision for optimal stability
- **Clean Inference Pipeline**: Removed complex stability measures that were interfering
- **Consistent Architecture**: Now matches proven `inference_gemma.py` approach

### 🎯 Generation Quality Improvements
- **Fixed Repetitive Text**: Resolved issues with repetitive output and unwanted artifacts
- **Deterministic Generation**: Uses `do_sample=False` for clean, consistent responses
- **Proper Chat Formatting**: Adopted exact Jupyter notebook message formatting approach
- **Clean Output**: Eliminated unwanted code blocks and repetitive patterns

### 📦 Backward Compatibility
- **Zero Breaking Changes**: Existing code continues to work without modifications
- **Automatic Upgrades**: `for_low_memory()` method now uses stable approach by default
- **Seamless Migration**: No code changes required for existing implementations

## 🚀 Installation

### Requirements

- Python 3.8+
- PyTorch 1.13+
- CUDA-compatible GPU (recommended)
- At least 8GB GPU memory for inference (improved efficiency in v0.5.2+)
- 16GB RAM minimum (32GB recommended for training)

### Install from PyPI

```bash
pip install hallunox
```

### Install from Source

```bash
git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e .
```

### MedGemma Model Access

HalluNox uses the open-access `convaiinnovations/gemma-finetuned-4b-it` model, which doesn't require authentication. The model will be automatically downloaded on first use.

### Core Dependencies

HalluNox automatically installs:

- `torch>=1.13.0` - PyTorch framework
- `transformers>=4.21.0` - Hugging Face Transformers
- `FlagEmbedding>=1.2.0` - BGE-M3 embedding model
- `datasets>=2.0.0` - Dataset loading utilities
- `scikit-learn>=1.0.0` - Evaluation metrics
- `numpy>=1.21.0` - Numerical computations
- `tqdm>=4.64.0` - Progress bars
- `Pillow>=8.0.0` - Image processing for multimodal capabilities
- `bitsandbytes>=0.41.0` - 4-bit quantization for memory optimization

## 📖 Quick Start

### Basic Usage (Llama-3.2-3B)

```python
from hallunox import HallucinationDetector

# Initialize detector (downloads pre-trained model automatically)
detector = HallucinationDetector()

# Analyze text for hallucination risk
results = detector.predict([
    "The capital of France is Paris.",  # High confidence
    "Your password is 12345678.",       # Low confidence  
    "The Moon is made of cheese."       # Very low confidence
])

# View results
for pred in results["predictions"]:
    print(f"Text: {pred['text']}")
    print(f"Confidence: {pred['confidence_score']:.3f}")
    print(f"Risk Level: {pred['risk_level']}")
    print(f"Routing Action: {pred['routing_action']}")
    print()
```

### 🏥 MedGemma Medical Domain Usage

For medical applications using MedGemma 4B-IT with multimodal capabilities:

```python
from hallunox import HallucinationDetector
from PIL import Image

# Initialize MedGemma detector (auto-downloads medical model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    confidence_threshold=0.60,  # Medical-grade threshold
    enable_response_generation=True,  # Enable response generation
    enable_inference=True,
    mode="text"  # Text-only mode (default)
)

# Medical text analysis
medical_results = detector.predict([
    "Aspirin can help reduce heart attack risk when prescribed by a doctor.",
    "Drinking bleach will cure COVID-19.",  # Dangerous misinformation
    "Type 2 diabetes requires insulin injections in all cases.",  # Partially incorrect
])

for pred in medical_results["predictions"]:
    print(f"Medical Text: {pred['text'][:60]}...")
    print(f"Confidence: {pred['confidence_score']:.3f}")
    print(f"Risk Level: {pred['risk_level']}")
    print(f"Medical Action: {pred['routing_action']}")
    print(f"Description: {pred['description']}")
    print("-" * 50)

# Response generation with confidence checking
question = "What are the symptoms of pneumonia?"
response = detector.generate_response(question, check_confidence=True)

if response["should_generate"]:
    print(f"✅ Medical Response Generated (confidence: {response['confidence_score']:.3f})")
    print(f"Response: {response['response']}")
    print(f"Meets threshold: {response['meets_threshold']}")
    if response.get('forced_generation'):
        print("⚠️ Note: Response was generated despite low confidence")
else:
    print(f"⚠️ Response blocked (confidence: {response['confidence_score']:.3f})")
    print(f"Reason: {response['reason']}")
    print(f"Recommendation: {response['recommendation']}")

# Force generation for reference regardless of confidence
forced_response = detector.generate_response(
    question, 
    check_confidence=True, 
    force_generate=True  # Generate even if confidence is low
)
print(f"🔬 Reference Response (forced): {forced_response['response']}")
print(f"📊 Confidence: {forced_response['confidence_score']:.3f}")
print(f"🎯 Forced Generation: {forced_response['forced_generation']}")

# Multimodal image analysis (MedGemma 4B-IT only)
if detector.is_multimodal:
    print("\n🖼️ Multimodal Image Analysis")
    
    # Load medical image (replace with actual medical image)
    try:
        image = Image.open("chest_xray.jpg")
    except:
        # Create demo image for testing
        image = Image.new('RGB', (224, 224), color='lightgray')
    
    # Analyze image confidence
    image_results = detector.predict_images([image], ["Chest X-ray"])
    
    for pred in image_results["predictions"]:
        print(f"Image: {pred['image_description']}")
        print(f"Confidence: {pred['confidence_score']:.3f}")
        print(f"Interpretation: {pred['interpretation']}")
        print(f"Risk Level: {pred['risk_level']}")
    
    # Generate image description
    description = detector.generate_image_response(
        image, 
        "Describe the findings in this chest X-ray."
    )
    print(f"Generated Description: {description}")
```

### 🔧 Advanced Configuration

```python
from hallunox import HallucinationDetector

# Full configuration example
detector = HallucinationDetector(
    # Model selection
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",  # or "unsloth/Llama-3.2-3B-Instruct"
    embed_model_id="BAAI/bge-m3",
    
    # Custom model weights (optional)
    model_path="/path/to/custom/model.pt",  # None = auto-download
    
    # Hardware configuration
    device="cuda",  # or "cpu"
    use_fp16=True,  # Mixed precision for faster inference
    
    # Sequence lengths
    max_length=512,      # LLM context length
    bge_max_length=512,  # BGE-M3 context length
    
    # Feature toggles
    load_llm=True,                    # Load LLM for embeddings
    enable_inference=True,            # Enable LLM inference
    enable_response_generation=True,  # Enable response generation
    
    # Confidence settings
    confidence_threshold=0.60,  # Custom threshold (auto-detected by model type)
    
    # Operation mode
    mode="text",  # "text", "image", "both", or "auto"
)

# Check model capabilities
print(f"Model type: {'Medical' if detector.is_medgemma_4b else 'General'}")
print(f"Multimodal support: {detector.is_multimodal}")
print(f"Operation mode: {detector.effective_mode} (requested: {detector.mode})")
print(f"Confidence threshold: {detector.confidence_threshold}")
```

### 🎛️ Operation Mode Configuration

The `mode` parameter controls what types of input the detector can process:

```python
from hallunox import HallucinationDetector

# Text mode (default) - processes text inputs only
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="text"  # Text-only processing (default)
)

# Auto mode - detects capabilities from model
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="auto"  # Auto: detects based on model capabilities
)

# Image-only mode - processes images only (requires multimodal model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="image"  # Image processing only
)

# Both mode - processes text and images (requires multimodal model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="both"  # Explicit multimodal mode
)
```

#### Mode Validation

- **Text mode**: Available for all models (default)
- **Image mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
- **Both mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
- **Auto mode**: Automatically selects based on model capabilities
  - Multimodal models → `effective_mode = "both"`
  - Text-only models → `effective_mode = "text"`

#### Error Examples

```python
# This will raise an error - image mode requires multimodal model
detector = HallucinationDetector(
    llm_model_id="unsloth/Llama-3.2-3B-Instruct",
    mode="image"  # ❌ Error: Image mode requires multimodal model
)

# This will raise an error - calling image methods in text mode
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="text"
)
detector.predict_images([image])  # ❌ Error: Current mode is 'text'
```

### ⚡ Performance Optimized Usage

For faster initialization when only doing embedding comparisons:

```python
from hallunox import HallucinationDetector

# Option 1: Factory method for embedding-only usage
detector = HallucinationDetector.for_embedding_only(
    device="cuda",
    use_fp16=True
)

# Option 2: Explicit parameter control
detector = HallucinationDetector(
    load_llm=False,         # Skip expensive LLM loading
    enable_inference=False, # Disable inference capabilities
    use_fp16=True          # Use mixed precision
)

# Note: This configuration cannot perform predictions
# Use for preprocessing or embedding extraction only
```

### 🧠 Memory Optimization with Quantization

For GPUs with limited VRAM (8-16GB), use 4-bit quantization:

```python
from hallunox import HallucinationDetector

# Option 1: Auto-optimized for low memory (recommended)
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",  # Or any supported model
    device="cuda",
    enable_response_generation=True,  # Enable response generation for evaluation
    verbose=True  # Show loading progress (optional)
)

# Option 2: Manual quantization configuration
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    use_quantization=True,  # Enable 4-bit quantization
    enable_response_generation=True,
    device="cuda"
)

# Option 3: Custom quantization settings
from transformers import BitsAndBytesConfig
import torch

custom_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 quantization type
    bnb_4bit_use_double_quant=True,     # Double quantization for extra savings
    bnb_4bit_compute_dtype=torch.bfloat16  # Compute in bfloat16
)

detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    quantization_config=custom_quant_config,
    device="cuda"
)

print(f"✅ Memory optimized: {detector.use_quantization}")
print(f"🔧 Quantization: 4-bit NF4 with double quantization")
```

## 🤖 Response Generation & Evaluation

### Enabling Response Generation

When `enable_response_generation=True`, HalluNox can generate responses for evaluation and display the model's actual output alongside confidence scores:

```python
from hallunox import HallucinationDetector

# Enable response generation for evaluation
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    device="cuda",
    enable_response_generation=True,  # Enable response generation
    verbose=False  # Clean logs for evaluation
)

# Test questions for evaluation
test_questions = [
    "What are the symptoms of diabetes?",
    "Drinking bleach will cure COVID-19.",  # Dangerous misinformation
    "How does aspirin help prevent heart attacks?",
    "All vaccines cause autism in children.",  # Medical misinformation
]

# Analyze with response generation
for question in test_questions:
    # The model will generate a response and analyze it
    results = detector.predict([question])
    prediction = results["predictions"][0]
    
    print(f"Question: {question}")
    print(f"Confidence: {prediction['confidence_score']:.3f}")
    print(f"Risk Level: {prediction['risk_level']}")
    print(f"Action: {prediction['medical_action']}")
    print(f"Description: {prediction['description']}")
    print("-" * 50)
```

### Response Generation Modes

```python
# Generate and analyze responses with confidence checking
response = detector.generate_response(
    "What are the side effects of ibuprofen?", 
    check_confidence=True
)

if response["should_generate"]:
    print(f"✅ Generated Response: {response['response']}")
    print(f"Confidence: {response['confidence_score']:.3f}")
    print(f"Meets threshold: {response['meets_threshold']}")
else:
    print(f"⚠️ Response blocked (confidence: {response['confidence_score']:.3f})")
    print(f"Reason: {response['reason']}")
    print(f"Recommendation: {response['recommendation']}")

# Force generation for reference (useful for evaluation)
forced_response = detector.generate_response(
    "What are the side effects of ibuprofen?", 
    check_confidence=True, 
    force_generate=True
)
print(f"🔬 Reference Response: {forced_response['response']}")
print(f"📊 Confidence: {forced_response['confidence_score']:.3f}")
print(f"🎯 Forced Generation: {forced_response['forced_generation']}")
```

### Evaluation Output Example

```
Question: What are the symptoms of diabetes?
Generated Response: Common symptoms of diabetes include increased thirst, frequent urination, excessive hunger, unexplained weight loss, fatigue, and blurred vision. It's important to consult a healthcare provider for proper diagnosis.
Confidence: 0.857
Risk Level: LOW_MEDICAL_RISK
Action: ✅ Information can be used as reference
--------------------------------------------------
Question: Drinking bleach will cure COVID-19.
Generated Response: [Response blocked - confidence too low]
Confidence: 0.123
Risk Level: VERY_HIGH_MEDICAL_RISK
Action: ⛔ Do not use - seek professional medical advice
--------------------------------------------------
```

### 💾 Memory Usage Comparison

| Configuration | Model Size | VRAM Usage | Performance |
|--------------|------------|------------|-------------|
| **Full Precision** | ~16GB | ~14GB | 100% speed |
| **FP16 Mixed Precision** | ~8GB | ~7GB | 95% speed |
| **4-bit Quantization** | ~4GB | ~3.5GB | 85-90% speed |
| **4-bit + Double Quant** | ~3.5GB | ~3GB | 85-90% speed |

**Recommendation**: Use `HallucinationDetector.for_low_memory()` for GPUs with 8GB or less VRAM.

### 📝 Enhanced Query-Context Formatting

For better accuracy with contextual information:

```python
from hallunox import HallucinationDetector

detector = HallucinationDetector()

# Use query-context pairs for improved embedding accuracy
query_context_pairs = [
    {
        "query": "What is the capital of France?",
        "context": "France is a European country with rich history and culture."
    },
    {
        "query": "The Moon is made of green cheese",
        "context": "The Moon is Earth's natural satellite composed of rock and metal."
    }
]

# Method 1: Direct query-context prediction
results = detector.predict_with_query_context(query_context_pairs)

# Method 2: Using the predict method with context parameter
texts = [pair["query"] for pair in query_context_pairs]
results = detector.predict(texts, query_context_pairs=query_context_pairs)

# Enhanced accuracy for contextual queries
for pred in results["predictions"]:
    print(f"Query: {pred['text']}")
    print(f"Enhanced Confidence: {pred['confidence_score']:.3f}")
```

## 🖥️ Command Line Interface

HalluNox provides a comprehensive CLI for various use cases:

### Web Dashboard
```bash
# Launch web interface (recommended)
hallunox-dashboard

# With custom configuration
hallunox-dashboard --host 0.0.0.0 --port 8080 --debug
```

### Interactive Mode
```bash
# General model interactive mode
hallunox-infer --interactive

# MedGemma medical interactive mode
hallunox-infer --llm_model_id convaiinnovations/gemma-finetuned-4b-it --interactive --show_generated_text
```

### Batch Processing
```bash
# Process file with general model
hallunox-infer --input_file medical_texts.txt --output_file results.json

# Process with MedGemma and medical settings
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --input_file medical_texts.txt \
    --output_file medical_results.json \
    --show_routing \
    --show_generated_text
```

### Image Analysis (Multimodal models only)
```bash
# Single image analysis
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --image_path chest_xray.jpg \
    --show_generated_text

# Batch image analysis
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --image_folder /path/to/medical/images \
    --output_file image_analysis.json
```

### Demo Mode
```bash
# General demo
hallunox-infer --demo --show_routing

# Medical demo with MedGemma
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --demo \
    --mode both \
    --show_routing

# Text-only demo (faster initialization)
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --demo \
    --mode text \
    --show_routing
```

## 🔨 Training Your Own Model

### Quick Training

```python
from hallunox import Trainer, TrainingConfig

# Configure training
config = TrainingConfig(
    # Model selection
    model_id="convaiinnovations/gemma-finetuned-4b-it",  # or "unsloth/Llama-3.2-3B-Instruct"
    embed_model_id="BAAI/bge-m3",
    
    # Training parameters
    batch_size=8,
    learning_rate=5e-4,
    max_epochs=6,
    warmup_steps=300,
    
    # Dataset configuration
    use_truthfulqa=True,
    use_halueval=True,
    use_fever=True,
    max_samples_per_dataset=3000,
    
    # Output
    output_dir="./models/my_medical_model"
)

# Train model
trainer = Trainer(config)
trainer.train()
```

### Command Line Training
```bash
# Train general model
hallunox-train --batch_size 8 --learning_rate 5e-4 --max_epochs 6

# Train medical model
hallunox-train \
    --model_id convaiinnovations/gemma-finetuned-4b-it \
    --batch_size 4 \
    --learning_rate 3e-4 \
    --max_epochs 8 \
    --output_dir ./models/custom_medgemma
```

## 🏗️ Model Architecture

HalluNox supports two main architectures:

### General Architecture (Llama-3.2-3B)
1. **LLM Component**: Llama-3.2-3B-Instruct
   - Extracts internal hidden representations (3072D)
   - Supports any Llama-architecture model
   
2. **Embedding Model**: BGE-M3 (fixed)
   - Provides reference semantic embeddings
   - 1024-dimensional dense vectors

3. **Projection Network**: Standard ProjectionHead
   - Maps LLM hidden states to embedding space
   - 3-layer MLP with ReLU activations and dropout

### Medical Architecture (MedGemma-4B-IT)
1. **Unified Multimodal Model**: 
   - **Single Model**: AutoModelForImageTextToText handles both text and images
   - **Memory Optimized**: Avoids double loading (saves ~8GB VRAM)
   - **Fallback Support**: Graceful degradation to text-only if needed
   
2. **Embedding Model**: BGE-M3 (same as general)
   - Enhanced with medical context formatting
   
3. **Projection Network**: UltraStableProjectionHead
   - Ultra-stable architecture with heavy normalization
   - Conservative weight initialization for medical precision
   - Tanh activations for stability
   - Enhanced dropout and layer normalization

4. **Multimodal Processor**: AutoProcessor
   - Handles image + text inputs
   - Supports chat template formatting

5. **Quantization Support**: 4-bit NF4 with double quantization
   - Reduces memory usage by ~75%
   - Maintains 85-90% performance
   - Automatic fallback for CPU

## 📊 API Reference

### HallucinationDetector

#### Constructor Parameters

```python
HallucinationDetector(
    model_path: str = None,                    # Path to trained model (None = auto-download)
    llm_model_id: str = "unsloth/Llama-3.2-3B-Instruct",  # LLM model ID
    embed_model_id: str = "BAAI/bge-m3",      # Embedding model ID
    device: str = None,                        # Device (None = auto-detect)
    max_length: int = 512,                     # LLM sequence length
    bge_max_length: int = 512,                # BGE-M3 sequence length
    use_fp16: bool = True,                     # Mixed precision
    load_llm: bool = True,                     # Load LLM
    enable_inference: bool = False,            # Enable LLM inference
    confidence_threshold: float = None,        # Custom threshold (auto-detected)
    enable_response_generation: bool = False,  # Enable response generation
    use_quantization: bool = False,            # Enable 4-bit quantization for memory savings
    quantization_config: BitsAndBytesConfig = None,  # Custom quantization config
    mode: str = "text",                        # Operation mode: "text", "image", "both", "auto" (default: "text")
)
```

#### Core Methods

**Text Analysis:**
- `predict(texts, query_context_pairs=None)` - Analyze texts for hallucination confidence
- `predict_with_query_context(query_context_pairs)` - Query-context prediction
- `batch_predict(texts, batch_size=16)` - Efficient batch processing

**Response Generation:**
- `generate_response(prompt, max_length=512, check_confidence=True, force_generate=False)` - Generate responses with confidence checking

**Multimodal (MedGemma only):**
- `predict_images(images, image_descriptions=None)` - Analyze image confidence
- `generate_image_response(image, prompt, max_length=200)` - Generate image descriptions

**Analysis:**
- `evaluate_routing_strategy(texts)` - Analyze routing decisions

**Factory Methods:**
- `for_embedding_only()` - Create embedding-only detector
- `for_low_memory()` - Create memory-optimized detector with 4-bit quantization

#### Response Format

```python
{
    "predictions": [
        {
            "text": "input text",
            "confidence_score": 0.85,           # 0.0 to 1.0
            "similarity_score": 0.92,          # Cosine similarity
            "interpretation": "HIGH_CONFIDENCE", # or HIGH_MEDICAL_CONFIDENCE
            "risk_level": "LOW_RISK",          # or LOW_MEDICAL_RISK
            "routing_action": "LOCAL_GENERATION",
            "description": "This response appears to be factual and reliable."
        }
    ],
    "summary": {
        "total_texts": 1,
        "avg_confidence": 0.85,
        "high_confidence_count": 1,
        "medium_confidence_count": 0,
        "low_confidence_count": 0,
        "very_low_confidence_count": 0
    }
}
```

#### Response Generation Format

```python
{
    "response": "Generated response text",
    "confidence_score": 0.85,
    "should_generate": True,
    "meets_threshold": True,
    "forced_generation": False,  # True if generated despite low confidence
    # Or when blocked:
    "reason": "Confidence 0.45 below threshold 0.60",
    "recommendation": "RAG_RETRIEVAL"
}
```

### Training Classes

- **`TrainingConfig`**: Configuration dataclass for training parameters
- **`Trainer`**: Main training class with dataset loading and model training
- **`MultiDatasetLoader`**: Loads and combines multiple hallucination detection datasets

### Utility Functions

- **`download_model()`**: Download general pre-trained model
- **`download_medgemma_model(model_name)`**: Download MedGemma medical model
- **`setup_logging(level)`**: Configure logging
- **`check_gpu_availability()`**: Check CUDA compatibility
- **`validate_model_requirements()`**: Verify dependencies

## 📈 Performance

Our confidence-aware routing system demonstrates:

- **74% hallucination detection rate** (vs 42% baseline)
- **9% false positive rate** (vs 15% baseline)  
- **40% reduction in computational cost** vs post-hoc methods
- **1.6x cost multiplier** vs always using expensive operations (4.2x)

### Medical Domain Performance (MedGemma)
- **Enhanced medical accuracy** with 0.62 confidence threshold
- **Multimodal capability** for medical image analysis
- **Safety-first approach** with conservative thresholds
- **Professional verification workflow** for low-confidence cases

## 🖥️ Hardware Requirements

### Minimum (Inference Only)
- **CPU**: Modern multi-core processor
- **RAM**: 16GB system memory
- **GPU**: 8GB VRAM (RTX 3070, RTX 4060 Ti+)
- **Storage**: 15GB free space
- **Models**: ~5GB each (Llama/MedGemma)

### Recommended (Inference)
- **CPU**: Intel i7/AMD Ryzen 7+
- **RAM**: 32GB system memory  
- **GPU**: 12GB+ VRAM (RTX 4070, RTX 3080+)
- **Storage**: NVMe SSD, 25GB+ free
- **CUDA**: 11.8+ compatible driver

### Training Requirements
- **CPU**: High-performance multi-core (i9/Ryzen 9)
- **RAM**: 64GB+ system memory
- **GPU**: 24GB+ VRAM (RTX 4090, A100, H100)
- **Storage**: 200GB+ NVMe SSD
  - Model checkpoints: ~10GB per epoch
  - Training datasets: ~30GB
  - Logs and outputs: ~50GB
- **Network**: High-speed internet for downloads

### MedGemma Specific
- **Additional storage**: +10GB for multimodal models
- **Image processing**: PIL/Pillow for image capabilities
- **Memory**: +4GB RAM for image processing pipeline

### CPU-Only Mode
- **RAM**: 32GB minimum (64GB recommended)
- **Performance**: 10-50x slower than GPU
- **Not recommended**: For production medical applications

## 🔒 Safety Considerations

### Medical Applications
- **Professional oversight required**: HalluNox is a research tool, not medical advice
- **Validation needed**: All medical outputs should be verified by qualified professionals
- **Conservative thresholds**: 0.62 threshold ensures high precision for medical content
- **Clear disclaimers**: Always include appropriate medical disclaimers in applications

### General Use
- **Confidence-based routing**: Use routing recommendations for appropriate escalation
- **Human oversight**: Very low confidence predictions require human review
- **Regular evaluation**: Monitor performance on your specific use cases

## 🛠️ Troubleshooting

### Common Issues and Solutions

#### CUDA Out of Memory Error
```
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB...
```
**Solution**: Use 4-bit quantization
```python
detector = HallucinationDetector.for_low_memory()
```

#### Deprecated torch_dtype Warning
```
`torch_dtype` is deprecated! Use `dtype` instead!
```
**Solution**: Already fixed in HalluNox v0.3.2+ - the package now uses the correct `dtype` parameter.

#### Double Model Loading (MedGemma)
```
Loading checkpoint shards: 100% 2/2 [00:37<00:00, 18.20s/it]
Loading checkpoint shards: 100% 2/2 [00:36<00:00, 17.88s/it]
```
**Solution**: Already optimized in HalluNox v0.3.2+ - MedGemma now uses a unified model approach that avoids double loading.

#### Accelerate Warning
```
WARNING:accelerate.big_modeling:Some parameters are on the meta device...
```
**Solution**: This is normal with quantization - parameters are automatically moved to GPU during inference.

#### Dependency Version Conflict (AutoProcessor)
```
⚠️ Could not load AutoProcessor: module 'requests' has no attribute 'exceptions'
AttributeError: module 'requests' has no attribute 'exceptions'
```
**Solution**: This is a compatibility issue between transformers and requests versions.
```bash
pip install --upgrade transformers requests huggingface_hub
# Or force reinstall
pip install --force-reinstall transformers>=4.45.0 requests>=2.31.0
```
**Fallback**: HalluNox automatically falls back to text-only mode when this occurs.

#### Model Hidden States NaN/Inf Issues ✅ RESOLVED
```
⚠️ Warning: NaN/Inf detected in model hidden states
   Hidden shape: torch.Size([3, 16, 2560])
   NaN count: 122880
```
**✅ FIXED in HalluNox v0.5.2+**: This issue has been completely resolved by adopting the proven approach from our working inference pipeline:

**Root Cause**: 4-bit quantization was causing numerical instabilities with certain model architectures.

**Solution Applied**:
- **Disabled Quantization**: Removed 4-bit quantization that was causing NaN issues
- **Simplified Model Loading**: Now uses the same approach as our proven `inference_gemma.py` 
- **Clean Architecture**: Removed complex stability measures that were interfering
- **Stable Precision**: Uses `torch.bfloat16` for optimal performance without instabilities

#### Repetitive Text and Unwanted Artifacts ✅ RESOLVED
```
🔬 Reference Response (forced): I am programmed to be a harmless AI assistant...
g
I am programmed to be a harmless AI assistant...
g
[repetitive output continues...]
```
**✅ FIXED in HalluNox v0.5.2+**: Repetitive text generation and unwanted artifacts have been completely resolved:

**Root Cause**: Improper message formatting and sampling parameters causing the model to not understand conversation boundaries.

**Solution Applied**:
- **Deterministic Generation**: Changed from `do_sample=True` to `do_sample=False` matching Jupyter notebook approach
- **Proper Chat Templates**: Adopted exact message formatting from working Jupyter notebook implementation  
- **Removed Sampling Parameters**: Eliminated `temperature`, `top_p`, `repetition_penalty` that were causing issues
- **Clean Tokenization**: Uses `tokenizer.apply_chat_template()` with proper parameters for conversation structure

**Current Recommended Usage** (v0.5.2+):
```python
# Standard usage - now stable by default
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    device="cuda"
)

# Both NaN issues and repetitive text are now automatically resolved
```

**Migration from v0.4.9 and earlier**: No code changes needed - existing code will automatically use the stable approach.

#### Environment Optimization
For better memory management, set:
```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

### Memory Requirements by Configuration

| GPU VRAM | Recommended Configuration | Expected Performance |
|----------|--------------------------|---------------------|
| **4-6GB** | `for_low_memory()` + reduce batch size | Basic functionality |
| **8-12GB** | `for_low_memory()` | Full functionality |
| **16GB+** | Standard configuration | Optimal performance |
| **24GB+** | Multiple models + training | Development/research |

## 📄 License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

## 📚 Citation

If you use HalluNox in your research, please cite:

```bibtex
@article{nandakishor2024hallunox,
    title={Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation},
    author={Nandakishor M},
    journal={AI Safety Research},
    year={2024},
    organization={Convai Innovations}
}
```

## 🤝 Contributing

We welcome contributions! Please see our contributing guidelines and submit pull requests to our repository.

### Development Setup
```bash
git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e ".[dev]"
```

## 📞 Support

For technical support and questions:
- **Email**: support@convaiinnovations.com  
- **Issues**: [GitHub Issues](https://github.com/convai-innovations/hallunox/issues)
- **Documentation**: Full API docs available online

## 👨‍💻 Author

**Nandakishor M**  
AI Safety Research  
Convai Innovations Pvt. Ltd.  
Email: support@convaiinnovations.com

---

**Disclaimer**: HalluNox is a research tool for hallucination detection and should not be used as the sole basis for critical decisions, especially in medical contexts. Always seek professional advice for medical applications.
