# PyUlysses

**PyUlysses** is a Python library for seamless connectivity to Dremio DataHub using Apache Arrow Flight. It provides an intuitive interface for executing SQL queries and managing data operations with built-in support for DuckDB integration.

[![Python Version](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE)

## Features

✨ **Key Features:**
- 🚀 High-performance Arrow Flight connectivity to Dremio
- 🦆 Native DuckDB integration for query results
- 🔄 Automatic retry logic with configurable timeouts
- 🛡️ Robust error handling and logging
- 🔐 Secure authentication via PAT tokens
- ⚡ Zero-copy data transfer with Apache Arrow
- 📊 Direct integration with analytical workflows

## Installation

### Virtual Environment (Recommended)

While optional, we **strongly recommend** using a virtual environment to isolate PyUlysses dependencies from your system Python installation.

#### Linux/macOS (Cortex)

```bash
# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Upgrade pip
pip install --upgrade pip
```

#### Windows

```bash
# Create virtual environment
python -m venv venv

# Activate virtual environment (Command Prompt)
venv\Scripts\activate.bat

# Or PowerShell
venv\Scripts\Activate.ps1

# Upgrade pip
python -m pip install --upgrade pip
```

### Using pip

```bash
pip install -r requirements.txt
```

### Using Poetry

```bash
poetry install --all-extras
```

## Quick Start

### 1. Configure Environment Variables

Create a `.env` file in your project root:

```bash
DREMIO_USERNAME=your_username
DREMIO_ACCESS_TOKEN=your_personal_access_token
DREMIO_HOST=dremio.example.com
DREMIO_PORT=9047
```

### 2. Basic Usage

```python
import os
from dotenv import load_dotenv
from pyulysses.datahub_connector import Client

# Load credentials
load_dotenv()
username = os.getenv("DREMIO_USERNAME")
token = os.getenv("DREMIO_ACCESS_TOKEN")

# Initialize client
client = Client(username=username, token=token)

# Execute query
query = "SELECT * FROM your_schema.your_table LIMIT 10"
result = client.query(
    query=query,
    retries=3,
    delay=3,
    query_timeout=60,
    arrow_format=False,  # Returns DuckDB relation
)

# Display results
print(result)
```

### 3. Using the CLI Tool

PyUlysses includes a convenient command-line interface:

```bash
python main.py
```

Then enter your SQL query when prompted. Type `quit` to exit.

## Usage Examples

### Execute a Simple Query

```python
from pyulysses.datahub_connector import Client

client = Client(username="user", token="token")

# Simple SELECT query
result = client.query("SELECT 1 as test")
print(result)
```

### Query with DuckDB Integration

```python
# Get results as DuckDB relation
result = client.query(
    query="SELECT * FROM my_table", arrow_format=False  # Returns DuckDB relation
)

# You can now use DuckDB operations
filtered = result.filter("column > 100")
aggregated = filtered.aggregate("SUM(value)")
```

### Advanced Configuration

```python
client = Client(username="user", token="token")

# Custom host and port
client.set_host(host="custom-dremio.company.com", port=32010)

# Query with custom timeout and retries
result = client.query(
    query="SELECT * FROM large_table", retries=5, delay=5, query_timeout=120
)
```

### List Available Tables

```python
# Get all tables
tables = client.list_tables()
print(tables)

# Get columns for a specific table
columns = client.list_columns("my_table")
print(columns)
```

## Architecture

PyUlysses leverages modern data infrastructure components:

- **Apache Arrow Flight**: High-performance RPC framework for large datasets
- **PyArrow**: Python bindings for Apache Arrow
- **DuckDB**: In-process analytical database for query results
- **Python-dotenv**: Environment variable management

## Configuration

### Environment Variables

| Variable | Description | Required | Default |
|----------|-------------|----------|----------|
| `DREMIO_USERNAME` | Dremio username | Yes | - |
| `DREMIO_ACCESS_TOKEN` | Personal Access Token | Yes | - |
| `DREMIO_HOST` | Dremio server hostname | No | dremio.example.com |
| `DREMIO_PORT` | Dremio server port | No | 9047 |
| `ULYSSES_LOGGING_LEVEL` | Logging level (DEBUG, INFO, WARNING, ERROR) | No | INFO |

### Client Configuration

```python
# Client will read DREMIO_HOST and DREMIO_PORT from .env
client = Client(username="user", token="token")

# Or override via constructor
client = Client(username="user", token="token", host="dremio.example.com", port=9047)

# Or change after initialization
client.set_host(host="dremio.example.com", port=9047)
```

## Testing

PyUlysses uses pytest for testing. The test suite includes comprehensive tests for Dremio connectivity and query execution.

### Running Tests

```bash
# Run all tests
pytest tests/test_connection_datahub.py -v

# Run only fast tests (exclude slow tests)
pytest tests/test_connection_datahub.py -v -m "not slow"

# Run only slow tests
pytest tests/test_connection_datahub.py -v -m "slow"

# Run with detailed output
pytest tests/test_connection_datahub.py -v -s
```

### Test Structure

The test suite includes:

1. **Client Initialization** - Validates client setup and configuration
2. **Simple Queries** - Tests basic SQL query execution
3. **Timeout Handling** - Verifies timeout behavior
4. **Real Data Queries** - Tests queries against actual Dremio tables (marked as slow)

### Prerequisites for Testing

Ensure you have a `.env` file with valid credentials:

```bash
DREMIO_USERNAME=your_username
DREMIO_ACCESS_TOKEN=your_token
```

If credentials are not configured, tests will be automatically skipped.

## Documentation

Full documentation is available at `docs/` directory. To build and serve locally:

```bash
mkdocs serve
```

Then visit `http://localhost:8000`

## Project Structure

```
pyulysses/
├── src/
│   ├── main.py                           # CLI interface
│   └── pyulysses/
│       ├── connector/
│       │   ├── __init__.py
│       │   ├── datahub_connector.py          # Main Dremio client
│       │   └── pyarrow_client.py             # Arrow Flight client
│       ├── env_var/
│       │   ├── __init__.py
│       │   └── ulysses_env_var.py            # Environment utilities
│       └── logger/
│           ├── __init__.py
│           └── ulysses_logger.py             # Logging utilities
├── scripts/
│   └── query_sample.sql                  # Sample SQL queries
├── docs/                                 # MkDocs documentation
│   ├── index.md
│   ├── getting-started.md
│   ├── user-guide.md
│   ├── configuration.md
│   └── examples.md
├── .env                                  # Configuration (not in git)
├── pyproject.toml                        # Project metadata & dependencies
├── mkdocs.yml                            # Documentation configuration
├── LICENSE                               # Apache 2.0 License
└── README.md                             # This file
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

## Authors

- Wallace Camargo - *Initial work*

## Acknowledgments

- Built on top of Apache Arrow Flight and Dremio
- Inspired by modern data engineering best practices

## Support

For issues, questions, or contributions, please open an issue on GitLab.
