Metadata-Version: 2.4
Name: dbqt
Version: 0.1.12
Summary: DataBase Quality Tool
Project-URL: Repository, https://github.com/namilinklabs/dbqt
Project-URL: Documentation, https://github.com/namilinklabs/dbqt
Author: NamiLink
License: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.9
Requires-Dist: boto3>=1.28.30
Requires-Dist: openpyxl>=3.1.2
Requires-Dist: polars>=1.25.2
Requires-Dist: pyarrow>=18.1.0
Requires-Dist: pyodbc>=5.2.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: reladiff>=0.6.0
Requires-Dist: sqeleton[mysql,postgresql,snowflake]>=0.1.7
Requires-Dist: tqdm>=4.67.1
Description-Content-Type: text/markdown

# DBQT (DataBase Quality Tool) 🎯

DBQT is a lightweight, Python-first data quality testing framework that helps data teams maintain high-quality data through automated checks and intelligent suggestions. 

## 🛠️ Current Tools

### Column Comparison Tool (dbqt compare)
Compare schemas between databases or files:
- Table-level comparison
- Column-level comparison with data type compatibility checks
- Support for CSV and Parquet files
- Handles nested Parquet schemas (arrays, structs, maps)
- Intelligent data type compatibility checking
- Generates detailed Excel report with:
  - Table differences
  - Column differences
  - Data type mismatches
  - Formatted worksheets for easy analysis

Usage:
```bash
dbqt compare source_schema.csv target_schema.csv
# Or compare Parquet files directly:
dbqt compare source.parquet target.parquet
```

To generate CSV schema files from your database, run this query:
```sql
SELECT
    upper(table_schema) as SCH, --optional
    upper(table_name) as TABLE_NAME,
    upper(column_name) as COL_NAME,
    upper(data_type) as DATA_TYPE --optional
FROM information_schema.columns
where UPPER(table_schema) = UPPER('YOUR_SCHEMA')
order by table_name, ordinal_position;
```

Export the results to CSV format to use with the compare tool.

### Parquet Combine Tool (dbqt combine)
Combine multiple Parquet files into a single file:
- Validates schema compatibility
- Preserves nested data structures
- Handles large datasets efficiently

Usage:
```bash
dbqt combine [output.parquet]  # Combines all .parquet files in current directory
```

### Database Statistics Tool (dbqt dbstats)
Collect and analyze database statistics:
- Fetches table row counts in parallel for faster execution.
- Updates statistics in a CSV file.
- Configurable through YAML.

Usage:
```bash
dbqt dbstats config.yaml
```

Example config.yaml:
```yaml
# Database connection configuration
connection:
  type: mysql  # mysql, snowflake, duckdb, csv, parquet, s3parquet
  host: localhost
  user: myuser
  password: mypassword
  database: mydb
  # Optional AWS configs for s3parquet
  # aws_profile: default
  # aws_region: us-west-2
  # bucket: my-bucket

  # Snowflake-specific configs
  # type: snowflake
  # account: your_account.region
  # warehouse: YOUR_WAREHOUSE
  # database: YOUR_DB
  # schema: YOUR_SCHEMA
  # role: YOUR_ROLE
  # authenticator: externalbrowser  # Optional: use SSO authentication
  # user: your_username
  # password: your_password  # Not needed if using externalbrowser auth

# Path to CSV file containing table names to analyze
tables_file: tables.csv
```

The tables.csv file should contain at minimum a `table_name` column. The tool will add/update a `row_count` column with the results.

### Null Column Check Tool (dbqt nullcheck)
Check for columns where all records are null across multiple tables in Snowflake.
- Identifies completely empty columns.
- Reports on columns with low-distinct values (<=5).
- Efficiently checks multiple tables in parallel.
- Generates a markdown report summarizing the findings.

Usage:
```bash
dbqt nullcheck --config snowflake_config.yaml
```
This tool currently only supports Snowflake.

### Dynamic Query Tool (dbqt dynamic-query)
Run a dynamic SQL query against Athena for a list of values from a CSV file.
- Substitutes values from a CSV into a query template.
- Executes queries sequentially and writes results to an output file.
- Useful for running the same query against multiple tables or with different parameters.

Usage:
```bash
dbqt dynamic-query --config athena_config.yaml --csv values.csv --query "SELECT COUNT(1) FROM {var_from_csv}"
```
This tool currently only supports AWS Athena.

### Parquetizer Tool (dbqt parquetizer)
A utility to recursively find files that are Parquet but lack the `.parquet` extension and rename them.
- Scans a directory for files without extensions.
- Validates if a file is a Parquet file by checking its magic bytes.
- Renames valid Parquet files to include the `.parquet` extension.

Usage:
```bash
dbqt parquetizer [directory] # Scans from the specified directory (or current if not provided)
```

## 🚀 Future Plans

### Core DBQT Features (Coming Soon)
- AI-Powered column classification using Qwen2 0.5B
- Automatic check suggestions
- 20+ built-in data quality checks
- Python-first API
- No backend required
- Customizable check framework

### Planned Checks
- Completeness checks (null values)
- Uniqueness validation
- Format validation (regex, dates, emails)
- Range/boundary checks
- Value validation
- Statistical analysis
- Dependency checks

### Integration Plans
- Data pipeline integration
- Scheduled runs
- Parallel check execution
- Multiple database backend support

## 📄 License

This project is licensed under the MIT License.
