Metadata-Version: 2.1
Name: aligned
Version: 0.0.23
Summary: A scalable feature store that makes it easy to align offline and online ML systems
Home-page: https://github.com/otovo/aligned
License: Apache-2.0
Keywords: python,typed,ml,prediction,feature,store,feature-store,feast,tecton
Author: Mats E. Mollestad
Author-email: mats@mollestad.no
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Provides-Extra: aws
Provides-Extra: image
Provides-Extra: kafka
Provides-Extra: pandera
Provides-Extra: psql
Provides-Extra: redis
Provides-Extra: server
Requires-Dist: Jinja2 (>=3.1.2,<4.0.0)
Requires-Dist: aioaws (>=0.12,<0.13) ; extra == "aws"
Requires-Dist: asgi-correlation-id (>=3.0.0,<4.0.0) ; extra == "server"
Requires-Dist: click (>=8.1.3,<9.0.0)
Requires-Dist: connectorx (>=0.3.2,<0.4.0) ; extra == "aws" or extra == "psql"
Requires-Dist: dill (>=0.3.4,<0.4.0)
Requires-Dist: fastapi (>=0.95.2,<0.96.0) ; extra == "server"
Requires-Dist: httpx (>=0.23.0,<0.24.0)
Requires-Dist: kafka-python (>=2.0.2,<3.0.0) ; extra == "kafka"
Requires-Dist: mashumaro (>=3.0.1,<4.0.0)
Requires-Dist: nest-asyncio (>=1.5.5,<2.0.0)
Requires-Dist: pandas (>=1.3.1,<2.0.0)
Requires-Dist: pandera (>=0.13.3,<0.14.0) ; extra == "pandera"
Requires-Dist: pillow (>=9.4.0,<10.0.0) ; extra == "image"
Requires-Dist: polars[all] (>=0.17.15,<0.18.0)
Requires-Dist: prometheus-fastapi-instrumentator (>=5.9.1,<6.0.0) ; extra == "server"
Requires-Dist: prometheus_client (>=0.16.0,<0.17.0)
Requires-Dist: pyarrow (>=12.0.0,<13.0.0)
Requires-Dist: pydantic (>=1.10.2,<2.0.0)
Requires-Dist: python-dotenv (>=0.21.0,<0.22.0)
Requires-Dist: redis (>=4.3.1,<5.0.0) ; extra == "redis"
Requires-Dist: uvicorn (>=0.17.6,<0.18.0) ; extra == "server"
Project-URL: Repository, https://github.com/otovo/aligned
Description-Content-Type: text/markdown

# Aligned

Aligned helps improving ML system visibility, while also reducing technical, and data debt, as described in [Sculley et al. [2015]](https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf).

Want to look at examples of how to use `aligned`?
View the [`MatsMoll/aligned-example` repo](https://github.com/MatsMoll/aligned-example).

This is done by providing an new innovative way of describing feature transformations, and data flow in ML systems. While also collecting dependency metadata that would otherwise be too inconvenient and error prone to manually type out.

Therefore, you get the following:
- [Feature Store](https://matsmoll.github.io/posts/understanding-the-chaotic-landscape-of-mlops#feature-store)
- [Feature Server](#feature-server)
- [Stream Processing](#stream-worker)
- Model Performance Monitoring - Documentation coming soon
- Data Catalog - Documentation coming soon
- Data Lineage - Documentation coming soon
- [Data Quality Assurance](#data-quality)
- [Easy Data Loading](#access-data)
- [Load Form Multiple Sources](#fast-development)


All from the simple API of defining
- [Data Sources](#data-sources)
- [Feature Views](#feature-views)
- [Models](#describe-models)

As a result, loading model features is as easy as:

```python
entities = {"passenger_id": [1, 2, 3, 4]}
await store.model("titanic").features_for(entities).to_pandas()
```

Aligned is still in active development, so changes are likely.

## Feature Views

Write features as the should be, as data models.
Then get code completion and typesafety by referencing them in other features.

This makes the features light weight, data source indipendent, and flexible.

```python
class TitanicPassenger(FeatureView):

    metadata = FeatureView.metadata_with(
        name="passenger",
        description="Some features from the titanic dataset",
        batch_source=FileSource.csv_at("titanic.csv"),
        stream_source=HttpStreamSource(topic_name="titanic")
    )

    passenger_id = Int32().as_entity()

    # Input values
    age = (
        Float()
            .description("A float as some have decimals")
            .is_required()
            .lower_bound(0)
            .upper_bound(110)
    )

    name = String()
    sex = String().accepted_values(["male", "female"])
    survived = Bool().description("If the passenger survived")
    sibsp = Int32().lower_bound(0, is_inclusive=True).description("Number of siblings on titanic")
    cabin = String()

    # Creates two one hot encoded values
    is_male, is_female = sex.one_hot_encode(['male', 'female'])
```

## Data sources

Alinged makes handling data sources easy, as you do not have to think about how it is done.
Only define where the data is, and we handle the dirty work.

```python
my_db = PostgreSQLConfig(env_var="DATABASE_URL")
redis = RedisConfig(env_var="REDIS_URL")

class TitanicPassenger(FeatureView):

    metadata = FeatureView.metadata_with(
        name="passenger",
        description="Some features from the titanic dataset",
        batch_source=my_db.table(
            "passenger",
            mapping_keys={
                "Passenger_Id": "passenger_id"
            }
        ),
        stream_source=redis.stream(topic="titanic")
    )

    passenger_id = Int32().as_entity()
```

### Fast development

Making iterativ and fast exploration in ML is important. This is why Aligned also makes it super easy to combine, and test multiple sources.

```python
my_db = PostgreSQLConfig.localhost()

aws_bucket = AwsS3Config(...)

class SomeFeatures(FeatureView):

    metadata = FeatureViewMetadata(
        name="some_features",
        description="...",
        batch_source=my_db.table("local_features")
    )

    # Some features
    ...

class AwsFeatures(FeatureView):

    metadata = FeatureViewMetadata(
        name="aws",
        description="...",
        batch_source=aws_bucket.file_at("path/to/file.parquet")
    )

    # Some features
    ...
```

## Describe Models

Usually will you need to combine multiple features for each model.
This is where a `Model` comes in.
Here can you define which features should be exposed.

```python
class Titanic(Model):

    passenger = TitanicPassenger()
    location = LocationFeatures()

    metadata = Model.metadata_with(
        name="titanic",
        features=[
            passenger.constant_filled_age,
            passenger.ordinal_sex,
            passenger.sibsp,

            location.distance_to_shore,
            location.distance_to_closest_boat
        ]
    )

    # Referencing the passenger's survived feature as the target
    did_survive = passenger.survived.as_classification_target()
```


## Data Enrichers

In manny cases will extra data be needed in order to generate some features.
We therefore need some way of enriching the data.
This can easily be done with Alinged's `DataEnricher`s.

```python
my_db = PostgreSQLConfig.localhost()
redis = RedisConfig.localhost()

user_location = my_db.data_enricher( # Fetch all user locations
    sql="SELECT * FROM user_location"
).cache( # Cache them for one day
    ttl=timedelta(days=1),
    cache_key="user_location_cache"
).lock( # Make sure only one processer fetches the data at a time
    lock_name="user_location_lock",
    redis_config=redis
)


async def distance_to_users(df: DataFrame) -> Series:
    user_location_df = await user_location.load()
    ...
    return distances

class SomeFeatures(FeatureView):

    metadata = FeatureViewMetadata(...)

    latitude = Float()
    longitude = Float()

    distance_to_users = Float().transformed_using_features_pandas(
        [latitude, longitude],
        distance_to_users
    )
```


## Access Data

You can easily create a feature store that contains all your feature definitions.
This can then be used to genreate data sets, setup an instce to serve features, DAG's etc.

```python
store = await FileSource.json_at("./feature-store.json").feature_store()

# Select all features from a single feature view
df = await store.all_for("passenger", limit=100).to_pandas()
```

### Centraliced Feature Store Definition
You would often share the features with other coworkers, or split them into different stages, like `staging`, `shadow`, or `production`.
One option is therefore to reference the storage you use, and load the `FeatureStore` from there.

```python
aws_bucket = AwsS3Config(...)
store = await aws_bucket.json_at("production.json").feature_store()

# This switches from the production online store to the offline store
# Aka. the batch sources defined on the feature views
experimental_store = store.offline_store()
```
This json file can be generated by running `aligned apply`.

### Select multiple feature views

```python
df = await store.features_for({
    "passenger_id": [1, 50, 110]
}, features=[
    "passenger:scaled_age",
    "passenger:is_male",
    "passenger:sibsp"

    "other_features:distance_to_closest_boat",
]).to_polars()
```

### Model Service

Selecting features for a model is super simple.


```python
df = await store.model("titanic_model").features_for({
    "passenger_id": [1, 50, 110]
}).to_pandas()
```

### Feature View

If you want to only select features for a specific feature view, then this is also possible.

```python
prev_30_days = await store.feature_view("match").previous(days=30).to_pandas()
sample_of_20 = await store.feature_view("match").all(limit=20).to_pandas()
```

## Data quality
Alinged will make sure all the different features gets formatted as the correct datatype.
In addition will aligned also make sure that the returend features aligne with defined constraints.

```python
class TitanicPassenger(FeatureView):

    ...

    age = (
        Float()
            .is_required()
            .lower_bound(0)
            .upper_bound(110)
    )
    sibsp = Int32().lower_bound(0, is_inclusive=True)
```

Then since our feature view have a `is_required` and a `lower_bound`, will the `.validate(...)` command filter out the entites that do not follow that behavior.

```python
from aligned.validation.pandera import PanderaValidator

df = await store.model("titanic_model").features_for({
    "passenger_id": [1, 50, 110]
}).validate(
    PanderaValidator()  # Validates all features
).to_pandas()
```

## Feature Server

You can define how to serve your features with the `FeatureServer`. Here can you define where you want to load, and potentially write your features to.

By default will it `aligned` look for a file called `server.py`, and a `FeatureServer` object called `server`. However, this can be defined manually as well.

```python
from aligned import RedisConfig, FileSource
from aligned.schemas.repo_definition import FeatureServer

store = FileSource.json_at("feature-store.json")

server = FeatureServer.from_reference(
    store,
    RedisConfig.localhost()
)
```

Then run `aligned serve`, and a FastAPI server will start. Here can you push new features, which then transforms and stores the features, or just fetch them.

## Stream Worker

You can also setup stream processing with a similar structure. However, here will a `StreamWorker` be used.

by default will `aligned` look for a `worker.py` file with an object called `worker`. An example would be the following.

```python
from aligned import RedisConfig, FileSource
from aligned.schemas.repo_definition import FeatureServer

store = FileSource.json_at("feature-store.json")

server = FeatureServer.from_reference(
    store,
    RedisConfig.localhost()
)
```

