Metadata-Version: 2.4
Name: pyragix
Version: 0.1.1
Summary: A document ingestion and RAG query system with FAISS indexing and OCR support
Author-email: psarno <psarno@users.noreply.github.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/psarno/PyRagix
Project-URL: Repository, https://github.com/psarno/PyRagix
Project-URL: Issues, https://github.com/psarno/PyRagix/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: sentence-transformers>=2.5.0
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: tokenizers>=0.21.0
Requires-Dist: faiss-cpu>=1.8.0
Requires-Dist: paddleocr>=2.8.0
Requires-Dist: paddlepaddle>=2.6.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: requests>=2.31.0
Requires-Dist: fastapi>=0.104.0
Requires-Dist: uvicorn[standard]>=0.24.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: psutil>=5.9.0

# PyRagix

A clean, typed, Pythonic pipeline for Retrieval-Augmented Generation (RAG).
Ingest HTML, PDF, and image-based documents, build a FAISS vector store, and
search with ease using Ollama for answer generation. Designed for developers
learning RAG, vector search, and document processing in Python.

PyRagix is a lightweight, educational project to help you explore how to process
diverse documents (HTML, PDF, images) and enable intelligent search using modern
AI tools. It's tuned for modest hardware (e.g., 16GB RAM / 6GB VRAM) with memory
optimizations, but can be customized via `settings.json`. This project is meant
to be a practical, well-structured example for Python developers diving into
RAG.

## Features

- **Cross-Platform**: Runs natively on Windows, Linux, and macOS with identical
  functionality. Uses `pathlib` for universal file handling.
- **Document Ingestion**: Extract text from HTML, PDF, and images using
  `PaddleOCR` for OCR fallback, `PyMuPDF` for PDFs, and BeautifulSoup for HTML.
- **Vector Store**: Build a FAISS index with Sentence Transformers embeddings.
  Supports both Flat and IVF (Inverted File) indexing for optimal performance
  scaling.
- **Console Search**: Query your document collection via an interactive
  command-line interface, with Ollama generating human-like answers from
  retrieved contexts.
- **Web Interface**: Modern, responsive web UI for searching documents with
  real-time status indicators, configurable options, and beautiful results
  presentation.
- **Pythonic Design**: Clean, typed, idiomatic Python code with protocols,
  context managers, and memory cleanup for clarity and maintainability.
- **Memory Optimizations**: Adaptive memory settings based on system RAM, tiled
  OCR for large pages, batch embedding with retry logic, and automatic garbage
  collection.
- **Modular Architecture**: Separate classes for OCR processing and
  configuration management for better code organization and testing.
- **Advanced Indexing**: Configurable FAISS indexing with IVF support for faster
  search on large datasets, with intelligent fallback for robust operation.
- **Hybrid CPU/GPU Support**: Automatic detection of GPU FAISS capabilities with
  graceful fallback to CPU-only operation for universal compatibility.
- **Modern Web Interface**: Complete TypeScript/FastAPI web application with
  professional dark theme, real-time search, and responsive design.

## Project Structure

```
PyRagix/
├── ingest_folder.py        # Main ingestion script
├── query_rag.py           # RAG query interface (console)
├── web_server.py          # FastAPI web server
├── start_web.bat          # Web interface startup script
├── config.py              # Configuration loader and validation
├── settings.json          # User configuration file (auto-generated)
├── classes/
│   ├── ProcessingConfig.py # Data class for processing configuration
│   └── OCRProcessor.py     # OCR operations handler
├── web/                   # Web interface files
│   ├── index.html         # Main web interface
│   ├── style.css          # Modern dark theme styling
│   ├── script.ts          # TypeScript source (ES2024)
│   ├── script.js          # Compiled JavaScript
│   ├── tsconfig.json      # TypeScript configuration
│   └── dev.bat           # TypeScript development script
├── requirements.in         # Package dependencies (source)
├── requirements.txt        # Compiled dependencies
├── local_faiss.index      # Generated FAISS vector index
├── documents.pkl          # Document metadata
├── processed_files.txt    # Log of processed files
├── ingestion.log         # Processing logs
└── crash_log.txt         # Error logs (when failures occur)
```

## Installation

1. **Clone the Repository**:

   ```bash
   git clone https://github.com/<your-username>/PyRagix.git
   cd PyRagix
   ```

2. **Set Up a Virtual Environment** (recommended):

   ```bash
   # Linux/Mac
   python -m venv venv
   source venv/bin/activate
   
   # Windows
   python -m venv rag-env
   rag-env\Scripts\activate.bat
   ```

3. **Install Dependencies**: PyRagix uses a `requirements.in` file for
   dependency management. Ensure you have `pip` and `pip-tools` installed, then
   run:

   ```bash
   pip install pip-tools
   pip-compile requirements.in  # Generates requirements.txt
   pip install -r requirements.txt
   ```

   **Note**: The dependency list includes `torch`, `transformers`, `faiss-cpu`,
   `paddleocr`, `paddlepaddle`, `sentence-transformers`, `fitz` (PyMuPDF),
   `fastapi`, `uvicorn`, and others. Ensure you have sufficient disk space and a
   compatible Python version (3.8+ recommended). For GPU acceleration, install
   CUDA-enabled versions where applicable.

4. **Ollama Setup** (for Querying):

   - Install Ollama: Follow instructions at [ollama.com](https://ollama.com).
   - Pull the default model: `ollama pull llama3.2:3b-instruct-q4_0`.
   - Start the Ollama server: `ollama serve`.

   Customize the Ollama model or URL in `query_rag.py` if needed.

## Usage

PyRagix provides both console and web interfaces for document search:

- `ingest_folder.py`: Processes a folder of documents (HTML, PDF, images) and
  builds a FAISS vector store.
- `query_rag.py`: Interactive console-based search interface.
- `web_server.py`: Modern web interface with REST API backend.

### Step 1: Ingest Documents

Run the ingestion script to process a folder and create a FAISS index:

```bash
python ingest_folder.py [path/to/documents]
```

- If no folder is provided, it uses the default from `config.py` (e.g.,
  `./docs`).
- Supported formats: PDF, HTML/HTM, images (via OCR).
- Outputs: `local_faiss.index` (FAISS index), `documents.pkl` (metadata),
  `processed_files.txt` (processed file log), `ingestion.log` (processing log),
  and `crash_log.txt` (errors if any).
- Resumes from existing index if available; skips already processed files.

**Customization**: Edit `settings.json` for hardware tuning (e.g., batch size,
thread counts, index type). The file is auto-generated on first run with optimal
defaults for your system. IVF indexing is enabled by default for better
performance scaling.

**Example**:

```bash
python ingest_folder.py ./my_documents
```

This scans `./my_documents` and subfolders, extracts text (with OCR fallback for
images/scans), chunks it, embeds with `all-MiniLM-L6-v2`, and adds to a FAISS
IVF index optimized for fast retrieval.

### Step 2: Search Documents

PyRagix offers two search interfaces:

#### Option A: Web Interface (Recommended)

Launch the modern web interface:

```bash
# Windows (using convenience script)
start_web.bat

# Linux/Mac/Windows (direct command)
python web_server.py
```

Then open your browser to:
- **Web Interface**: http://localhost:8000/web/
- **API Documentation**: http://localhost:8000/docs
- **Health Check**: http://localhost:8000/health

**Web Interface Features:**
- Modern, responsive dark theme design
- Real-time server status indicator
- Configurable search options (results count, sources, debug mode)
- Beautiful answer presentation with source highlighting
- TypeScript-powered frontend with ES2024 features
- REST API backend for integration

#### Option B: Console Interface

Launch the interactive console-based search interface:

```bash
python query_rag.py
```

- Loads the FAISS index and metadata.
- Enter queries at the prompt; get generated answers from Ollama based on
  retrieved contexts.
- Shows sources with scores and chunk indices.
- Type 'quit' or 'exit' to stop.

**Example Interaction**:

```
Query: What is machine learning?

Answer:
===========
Machine learning is a subset of AI that focuses on building systems that learn from data...
(Generated from Ollama using retrieved contexts)
===========

Sources:
1. intro.pdf (chunk 0, score: 0.920)
2. ml_basics.html (chunk 1, score: 0.850)
...
```

**Platform Notes**:

- **All platforms**: Core Python functionality is identical across Windows, Linux, and macOS
- **Windows users**: Convenience `.bat` scripts are provided (`start_web.bat`, `ingest.bat`, `query.bat`)  
- **Linux/Mac users**: Run Python commands directly or adapt `.bat` scripts to shell scripts
- **TypeScript development**: Requires `npm install -g typescript` for compilation
- Ensure Ollama is running before starting queries on any platform

## Configuration

- **settings.json**: Main configuration file for hardware tuning (e.g., thread
  limits, batch size, CUDA settings, FAISS index type). Auto-generated with
  system-appropriate defaults.
- **classes/ProcessingConfig.py**: Adaptive configuration that automatically
  adjusts memory settings based on available system RAM.
- **query_rag.py**: Ollama API settings loaded from `settings.json` via
  `config.py`.

### FAISS Index Types

PyRagix supports two FAISS index types via the `INDEX_TYPE` setting:

- **"ivf"** (default): IVF (Inverted File) indexing for faster searches on large
  datasets. Configurable via `NLIST` (clusters, default: 1024) and `NPROBE`
  (search clusters, default: 16). Recommended for >10k documents.
- **"flat"**: Flat indexing for exhaustive search. Slower but more accurate.
  Recommended for smaller datasets or when maximum precision is required.

Optimal settings for modest hardware (16GB RAM, 6GB VRAM):

```json
{
  "INDEX_TYPE": "ivf",
  "NLIST": 1024,
  "NPROBE": 16
}
```

### GPU Acceleration

PyRagix includes intelligent GPU detection and hybrid CPU/GPU support:

- **Automatic Detection**: Detects if GPU FAISS functions are available
- **Graceful Fallback**: Uses CPU when GPU unavailable (default behavior)
- **Configurable**: Enable GPU acceleration via `settings.json`:

```json
{
  "GPU_ENABLED": true,
  "GPU_DEVICE": 0,
  "GPU_MEMORY_FRACTION": 0.8
}
```

**Note**: GPU FAISS requires compatible hardware and special installation. The system works perfectly with CPU-only FAISS (default) and will automatically utilize GPU capabilities when available.

For larger setups: Increase `NLIST` (more clusters) and `NPROBE` values.

## Advanced Configuration

PyRagix provides extensive configuration options in `settings.json` for fine-tuning performance and behavior. Here's a breakdown of the more technical parameters:

### Performance & Threading

- **`TORCH_NUM_THREADS`, `OPENBLAS_NUM_THREADS`, `MKL_NUM_THREADS`, `OMP_NUM_THREADS`, `NUMEXPR_MAX_THREADS`**: Control CPU parallelism for different math libraries. Default is 6 threads. Increase for high-core CPUs, decrease for shared systems or to reduce memory usage.

- **`BATCH_SIZE`**: Number of documents processed simultaneously during embedding (default: 16). Larger values use more memory but can be faster. Reduce if you encounter out-of-memory errors.

- **`BATCH_SIZE_RETRY_DIVISOR`**: When batch processing fails due to memory, the batch size is divided by this value (default: 4) and retried. Higher values mean more aggressive fallback.

### CUDA Memory Management

- **`PYTORCH_CUDA_ALLOC_CONF`**: Advanced CUDA memory allocation settings:
  - `max_split_size_mb:1024`: Maximum size (MB) for memory block splitting. Larger values reduce fragmentation but use more memory.
  - `garbage_collection_threshold:0.9`: Triggers cleanup when 90% of allocated memory is used. Lower values free memory more aggressively.

### OCR Processing

- **`BASE_DPI`**: Resolution for OCR processing (default: 150). Higher values (200-300) improve text recognition accuracy but increase processing time and memory usage. Lower values (100-120) speed up processing for simple documents.

### Document Processing

- **`SKIP_FILES`**: Array of file patterns to ignore during ingestion (e.g., `["*.tmp", "backup_*"]`). Supports glob patterns.

- **`INGESTION_LOG_FILE`, `CRASH_LOG_FILE`**: Customize log file names for processing events and errors.

### LLM Generation Parameters

- **`TEMPERATURE`**: Controls response creativity (0.0-1.0, default: 0.1). Lower values produce more focused, deterministic answers. Higher values increase creativity but may reduce accuracy.

- **`TOP_P`**: Nucleus sampling parameter (default: 0.9). Controls diversity by only considering tokens comprising the top 90% probability mass. Lower values make responses more focused.

- **`MAX_TOKENS`**: Maximum length of generated answers (default: 500). Increase for longer responses, decrease to save time and tokens.

- **`DEFAULT_TOP_K`**: Number of document chunks retrieved for each query (default: 7). More chunks provide richer context but may include less relevant information.

- **`REQUEST_TIMEOUT`**: Ollama API timeout in seconds (default: 60). Increase for complex queries or slower models.

### Tuning Tips

- **Memory-constrained systems**: Reduce `BATCH_SIZE` to 8 or lower, decrease thread counts to 2-4, and set `BASE_DPI` to 100.
- **High-performance systems**: Increase thread counts to match CPU cores, raise `BATCH_SIZE` to 32+, and use `BASE_DPI` 200-300 for better OCR.
- **Better answers**: Increase `DEFAULT_TOP_K` to 10-15, raise `MAX_TOKENS` to 800-1000, and fine-tune `TEMPERATURE` (0.2-0.3 for creative but focused responses).

## Requirements

PyRagix depends on a robust set of Python libraries for AI, document processing,
and vector search. Key dependencies include:

- `torch` and `transformers`/`sentence-transformers` for embedding models
- `faiss-cpu` for vector storage and search (with optional GPU support detection)
- `paddleocr` and `paddlepaddle` for OCR operations
- `fitz` (PyMuPDF) for PDF processing
- `beautifulsoup4` (with optional `lxml`) for HTML parsing
- `requests` for Ollama API calls
- `fastapi` and `uvicorn` for the web interface and REST API
- `psutil` for system memory detection

See [requirements.in](requirements.in) for the complete dependency list and
`requirements.txt` for pinned versions. The system automatically adapts memory
settings based on available RAM (16GB+ recommended for optimal performance).

## Contributing

We welcome contributions! If you’re learning RAG or want to enhance PyRagix,
here’s how to get started:

1. Fork the repo and create a feature branch.
2. Follow the installation steps above.
3. Submit a pull request with clear descriptions of your changes.

Ideas for contributions:

- Add support for more document formats (e.g., DOCX).
- Implement a web interface (planned for future releases).
- Optimize for different hardware (e.g., high-end GPUs or cloud).
- Enhance OCR handling or embedding models.

Please adhere to Python’s PEP 8 style guide and include type hints for
consistency.

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) for
details.

## Acknowledgements

- Built with love for the Python and AI communities.
- Thanks to the creators of `faiss`, `sentence-transformers`, `paddleocr`,
  `ollama`, and `langchain` for their amazing tools.

Happy learning, and enjoy searching your documents with PyRagix! 🚀
