Metadata-Version: 2.4
Name: pydantic-glue
Version: 0.7.0
Summary: Convert pydantic model to aws glue schema for terraform
License: MIT
License-File: LICENSE
Keywords: pydantic,glue,athena,types,convert
Author: Serhii Dimchenko
Author-email: svdimchenko@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: jsonref (>=1.1.0,<2.0.0)
Requires-Dist: pydantic (>=2.7.1,<3.0.0)
Project-URL: Bug Tracker, https://github.com/svdimchenko/pydantic-glue/issues
Project-URL: Repository, https://github.com/svdimchenko/pydantic-glue
Project-URL: Releases, https://github.com/svdimchenko/pydantic-glue/releases
Description-Content-Type: text/markdown

# JSON Schema to AWS Glue schema converter

<!-- markdownlint-disable MD013 -->

[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
[![image](https://img.shields.io/pypi/v/pydantic-glue.svg)](https://pypi.python.org/pypi/pydantic-glue)
[![image](https://img.shields.io/pypi/l/pydantic-glue.svg)](https://github.com/svdimchenko/pydantic-glue/blob/main/LICENSE)
[![image](https://img.shields.io/pypi/pyversions/pydantic-glue.svg)](https://pypi.python.org/pypi/pydantic-glue)
[![Actions status](https://github.com/svdimchenko/pydantic-glue/actions/workflows/ci.yml/badge.svg)](https://github.com/svdimchenko/pydantic-glue/actions)

<!-- markdownlint-restore -->

<!-- TOC -->
* [JSON Schema to AWS Glue schema converter](#json-schema-to-aws-glue-schema-converter)
  * [Installation](#installation)
  * [What?](#what)
  * [Why?](#why)
  * [Example](#example)
  * [Override the type for the AWS Glue Schema](#override-the-type-for-the-aws-glue-schema)
  * [How it works?](#how-it-works)
  * [Future work](#future-work)
<!-- TOC -->

## Installation

```bash
pip install pydantic-glue
```

## What?

Converts `pydantic` schemas to `json schema` and then to `AWS glue schema`,
so in theory anything that can be converted to JSON Schema *could* also work.

## Why?

When using `AWS Kinesis Firehose` in a configuration that receives JSONs and writes `parquet` files on S3,
one needs to define a `AWS Glue` table so Firehose knows what schema to use when creating the parquet files.

AWS Glue lets you define a schema using `Avro` or `JSON Schema` and then to create a table from that schema,
but as of *May 2022*
there are limitations on AWS that tables that are created that way can't be used with Kinesis Firehose.

<https://stackoverflow.com/questions/68125501/invalid-schema-error-in-aws-glue-created-via-terraform>

This is also confirmed by AWS support.

What one could do is create a table set the columns manually,
but this means you now have two sources of truth to maintain.

This tool allows you to define a table in `pydantic`
and generate a JSON with column types that can be used with `terraform` to create a Glue table.

## Example

Take the following pydantic class

```python title="example.py"
from pydantic import BaseModel
from typing import List


class Bar(BaseModel):
    name: str
    age: int


class Foo(BaseModel):
    nums: List[int]
    bars: List[Bar]
    other: str

```

Running `pydantic-glue`

```bash
pydantic-glue -f example.py -c Foo
```

you get this JSON in the terminal:

```json
{
  "//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
  "columns": {
    "nums": "array<int>",
    "bars": "array<struct<name:string,age:int>>",
    "other": "string"
  }
}
```

and can be used in terraform like that

```terraform
locals {
  columns = jsondecode(file("${path.module}/glue_schema.json")).columns
}

resource "aws_glue_catalog_table" "table" {
  name          = "table_name"
  database_name = "db_name"

  storage_descriptor {
    dynamic "columns" {
      for_each = local.columns

      content {
        name = columns.key
        type = columns.value
      }
    }
  }
}
```

Alternatively you can run CLI with `-o` flag to set output file location:

```bash
pydantic-glue -f example.py -c Foo -o example.json -l
```

If your Pydantic models use field aliases, but you prefer to display the field names in the JSON schema,
you can enable this behavior by using the `--schema-by-name` flag.

Here you can find the details regarding [pydantic aliases](https://docs.pydantic.dev/latest/concepts/alias/).

The following model will be converted differently with `--schema-by-name` argument.

```python
from pydantic import BaseModel, Field

class A(BaseModel):
    hey: str = Field(alias="h")
    ho: str
```

```bash
pydantic-glue -f tests/data/input.py -c A

2025-02-01 00:08:45,046 - INFO - Generated file content:
{
  "//": "Generated by pydantic-glue at 2025-01-31 23:08:45.046012+00:00. DO NOT EDIT",
  "columns": {
    "h": "string",
    "ho": "string"
  }
}
```

```bash
 pydantic-glue -f tests/data/input.py -c A --schema-by-name
2025-02-01 00:09:18,381 - INFO - Generated file content:
{
  "//": "Generated by pydantic-glue at 2025-01-31 23:09:18.380586+00:00. DO NOT EDIT",
  "columns": {
    "hey": "string",
    "ho": "string"
  }
}
```

## Override the type for the AWS Glue Schema

Wherever there is a `type` key in the input JSON Schema, an additional key `glue_type` may be
defined to override the type that is used in the AWS Glue Schema. This is, for example, useful for
a pydantic model that has a field of type `int` that is unix epoch time, while the column type you
would like in Glue is of type `timestamp`.

Additional JSON Schema keys to a pydantic model can be added by using the
[`Field` function](https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field)
with the argument `json_schema_extra` like so:

```python
from pydantic import BaseModel, Field

class A(BaseModel):
    epoch_time: int = Field(
        ...,
        json_schema_extra={
            "glue_type": "timestamp",
        },
    )
```

The resulting JSON Schema will be:

```json
{
    "properties": {
        "epoch_time": {
            "glue_type": "timestamp",
            "title": "Epoch Time",
            "type": "integer"
        }
    },
    "required": [
        "epoch_time"
    ],
    "title": "A",
    "type": "object"
}
```

And the result after processing with pydantic-glue:

```json
{
  "//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
  "columns": {
    "epoch_time": "timestamp",
  }
}
```

Recursing through object properties terminates when you supply a `glue_type` to use. If the type is
complex, you must supply the full complex type yourself.

## How it works?

* `pydantic` gets converted to JSON Schema
* the JSON Schema types get mapped to Glue types recursively

## Future work

* Not all types are supported, I just add types as I need them, but adding types is very easy,
  feel free to open issues or send a PR if you stumbled upon a non-supported use case
* the tool could be easily extended to working with JSON Schema directly
* thus, anything that can be converted to a JSON Schema should also work.

