# -*- coding: utf-8 -*-
from setuptools import setup

package_dir = \
{'': 'src'}

packages = \
['instrumentum',
 'instrumentum.analysis',
 'instrumentum.feature_generation',
 'instrumentum.feature_preprocess',
 'instrumentum.feature_selection',
 'instrumentum.image_processing',
 'instrumentum.model_tuning',
 'instrumentum.time_series',
 'instrumentum.utils']

package_data = \
{'': ['*']}

install_requires = \
['fastcluster>=1.2.6,<2.0.0',
 'joblib>=1.1.0,<2.0.0',
 'numpy>=1.21.2,<2.0.0',
 'optbinning>=0.13.0,<0.14.0',
 'optuna>=2.10.0,<3.0.0',
 'pandas>=1.3.3,<2.0.0',
 'sklearn>=0.0,<0.1']

setup_kwargs = {
    'name': 'instrumentum',
    'version': '0.8.16',
    'description': 'General utilities for data science projects',
    'long_description': '# Instrumentum\n\n**General utilities for data science projects**\n\n\n\nThe goal of this repository is to provide functionalities that are tipically not found in other packages, and can facilitate some steps during a data science project.\n\n\nThe classes created in `instrumentum` implement sklearn interfaces, which makes it easier to work with them since the general familiarity with sklearn design. Classes use parallelism whenever possible to speed-up the execution time.\n\n\n\n## Table of Content <!-- omit in toc -->\n  - [Feature Selection](#feature-selection)\n    - [Dynamic Stepwise](#dynamic-stepwise)\n    - [Clustering](#clustering)\n  - [Model Tuning](#model-tuning)\n  - [Features Interaction](#features-interaction)\n  - [Dashboards & plots](#dashboards-&-plots)\n  - [Contributing](#contributing)\n  - [License](#license)\n  - [Credits](#credits)\n\n\n# Feature Selection #\n\n## Dynamyc Stepwise\n\n\nStepwise is a method used to reduce the dimensionality of datasets, its idea consists of iteratively adding and removing predictors, one by one. One of its limitations is that there might exist variables that interact with each other, or are highly correlated, and the real inference power of those variables will not be realized if evaluations are performed individually.\nTo illustrate this, the following code artificially creates 2 variables that when combined have a great prediction power, but individually do not. A third variable is created with minimum prediction power.\n\n```python\ndf = pd.DataFrame(np.random.choice([0, 1], size=(10000, 2)), columns=["x1", "x2"])\n\n# y is created based on x1 and x2, ~70% prediction power, when combined\ndf["y"] = df.apply(\n    lambda x: np.logical_xor(x["x1"], x["x2"]) * 1\n    if np.random.ranf() < 0.7\n    else np.random.choice([0, 1]),\n    axis=1,\n)\n\n# another predictor, with ~20% prediction power\ndf["x0"] = df.apply(\n    lambda x: x.y if np.random.ranf() < 0.2 else np.random.choice([0, 1]),\n    axis=1,\n)\n\nX, y = df[["x0", "x1", "x2"]], df["y"]\n```\nPlotting the correlation of the matrix, clearly shows that individually, only x0 presents a relationship with y:\n\n<img src="images/correlation.png" width=35% height=35%>\n\nIf the classic forward stepwise is used in this scenario, and assuming that the stepwise stops when there is no improvement (which doesn\'t make sense in this small dataset, but it does in gigantic ones), the two interactions will not be discovered.\nThe class `DynamicStepwise` in this library allows to keep adding/removing predictors iteratively, with an arbitrary number of predictors at each iteration (parameter `combo`). When the number is 1, it is the tipical stepwise, if it is largen than 1 it becomes dyanmic and will try all combinations of predictors up to that number.\nIf, for example, combo is equal to 3, the library will try all possible combinations between 1, 2 and 3 variables from the set of variables not yet added, and will select the best combination of all those which added to the already selected yield the best result. And will keep adding the next best 3 combinations until the end condition is met (end condition is highly customizable).\n\nContinuing with the previous dataset example, if `DynamicStepwise` is run with just 1 combination at each time (and so becoming the classic stepwise), it will discover only x0:\n\n```python\ncombs = 1\n\nstepw = DynamicStepwise(\n    estimator=os,\n    rounding=rounding,\n    n_combs=combs,\n    verbose=logging.INFO,\n    direction="forward",\n)\nstepw.fit(X, y)\n```\nThe output is:\n\n```\n22-05-22 01:19 | INFO | Number of cores to be used: 8, total available: 8\n\n22-05-22 01:19 | INFO | Remaining columns to test: 3\n22-05-22 01:19 | INFO | Combinations to test: 3\n22-05-22 01:20 | INFO | Best score from combinations: 0.52, global score 0\n22-05-22 01:20 | INFO | Best score comes from adding columns: [\'x0\']\n22-05-22 01:20 | INFO | Best columns were added. All columns added so far [\'x0\']\n\n22-05-22 01:20 | INFO | Remaining columns to test: 2\n22-05-22 01:20 | INFO | Combinations to test: 2\n22-05-22 01:20 | INFO | Best score from combinations: 0.52, global score 0.52\n22-05-22 01:20 | INFO | Best score comes from adding columns: [\'x1\']\n22-05-22 01:20 | INFO | Columns were not added as they do not improve the score. Finishing\n\n22-05-22 01:20 | INFO | Function fit executed in 12.009625673294067 seconds\n\nForward Best Features:  [\'x0\']\n\n    Score Columns Added\n0   0.52          [x0]\n```\n\nThe score obtained by using x0 is only 0.52.\n\nIf we make a simple change to the parameters, and indicate that the combos to be evaluated (i.e. how many combinations of remaining predictors to evaluate at each step) is 2:\n\n```python\ncombs = 2\n```\n\nThe output is:\n```\n22-05-22 01:49 | INFO | Number of cores to be used: 8, total available: 8\n\n22-05-22 01:49 | INFO | Remaining columns to test: 3\n22-05-22 01:49 | INFO | Combinations to test: 6\n22-05-22 01:50 | INFO | Best score from combinations: 0.85, global score 0\n22-05-22 01:50 | INFO | Best score comes from adding columns: [\'x1\' \'x2\']\n22-05-22 01:50 | INFO | Best columns were added. All columns added so far [\'x1\' \'x2\']\n\n22-05-22 01:50 | INFO | Remaining columns to test: 1\n22-05-22 01:50 | INFO | Combinations to test: 1\n22-05-22 01:50 | INFO | Best score from combinations: 0.87, global score 0.85\n22-05-22 01:50 | INFO | Best score comes from adding columns: [\'x0\']\n22-05-22 01:50 | INFO | Best columns were added. All columns added so far [\'x0\' \'x1\' \'x2\']\n\n22-05-22 01:50 | INFO | All columns were added. Finishing.\n\n22-05-22 01:50 | INFO | Function fit executed in 13.473716974258423 seconds\n\nForward Best Features:  [\'x0\' \'x1\' \'x2\']\n\n    Score Columns Added\n0   0.85      [x1, x2]\n1   0.87          [x0]\n```\n\nThe prediction power increased to 0.87 and both x1 and x2 were selected correctly in the first iteration, as they both are the two combinations that yield the best result.\n\nIndeed, the larger the combo parameter, the best selection of features, and the exponential increase in time it will take to complete. There is a tradeoff between prediction power and performance, tipically a value of 2 or 3 for the combo would be enough. The combo can be defined as large as the total number of predictors, which will cause the algorithm to guarantee to find the best predictors, but it will do so by evaluating absolutely all possible combinations, which becomes infeasible with any dataset with more than a handful of predictors.\n\n\n## Clustering Selection\n\nOne of the promises of a good set of predictors, is that they have to be highly correlated to the target, without having correlation among themselves.\nThe class `ClusterSelection` tries to obtain that ideal scenario, which is desirable in datasets with high dimensionality, by clustering all predictors based on their similarity ("similarity" can be parameterized as a correlation matrix, with Pearson as the default). Once the clusters are identified the class allows to select the best predictors within each. The selection of best predictors (be it one by cluster or n) can be performed using `DynamicStepwise`.\n\n`DynamicStepwise` can be passed to `ClusterSelection` with a pre configuration of the max n columns, and by doing so it will get the best n combination of variables within each cluster. Combining `DynamicStepwise` with `ClusetrSelection` produces a sophisticated pipeline that is heuristic in nature yet it yields results close the the global optimal solution.\n\nOne of the key parameters of this class is how to find the clusters once the correlation matrix is created. There are many techniques to form flat clusters, `ClusterSelection` uses [fcluster](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html) method behind the scene, where pameter *t* is passed over to this function from the class.\nRoughly, the parameter t can define a threshold or a fixed number of clusters. If using a threshold, a dendogram analysis can help visualize the best possible "cut" line. See the docs for details. In this plot the dendogram presents the clusters created, and the two selected due to the threshold defined at 0.8\n\n<img src="images/dendogram.png">\n\n\n# Model Tuning\n\nClass `OptunaSearchCV` implements a sklearn wrapper for the great Optuna class. It provides a set of distribution parameters that can be easily extended. In this example it makes use of the dispatcher by fetching a decision tree (which is named after the Sklearn class)\n\n```python\nsearch_function = optuna_param_disp[DecisionTreeClassifier.__name__]\ncv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2)\n\nos = OptunaSearchCV(\n    estimator=DecisionTreeClassifier(),\n    scoring="roc_auc",\n    cv=cv,\n    search_space=search_function,\n    n_iter=5,\n)\nos.fit(X_train, y_train)\n```\n\nThe output presents all the details depending on the verbosity\n\n```\n22-05-22 11:34 | INFO | Trials: 1, Best Score: 0.8791199817742967, Score 0.8791199817742967\n22-05-22 11:34 | INFO | Trials: 2, Best Score: 0.8797784704316944, Score 0.8797784704316944\n22-05-22 11:34 | INFO | Trials: 3, Best Score: 0.9500029865511614, Score 0.9500029865511614\n22-05-22 11:34 | INFO | Trials: 4, Best Score: 0.9505017406869891, Score 0.9505017406869891\n22-05-22 11:34 | INFO | Trials: 5, Best Score: 0.9505017406869891, Score 0.931279172306293\n\nBest parameters:  {\'max_depth\': 3, \'criterion\': \'entropy\'}\nBest score cv:  0.9505017406869891\nScoring with best parameters:  0.9495986837807825\n```\n\n# Features Interaction\n\nClass `Interactions` offers an easy way to create combinatiors of existing features. It is a lightweight class that can be extended with minimum effort.\n\nThis simple example showcase how this class can be used with a small DataFrame. The degree indicates how the different columns will be combined (careful, it grows exponentially)\n\n```python\narr = np.array([[5, 2, 3], [5, 2, 3], [1, 2, 3]])\narr = pd.DataFrame(arr, columns=["a", "b", "c"])\n\ninteractions = Interactions(operations=["sum", "prod"], degree=(2, 3), verbose=logging.DEBUG)\ninteractions.fit(arr)\n\n\npd.DataFrame(interactions.transform(arr), columns=interactions.get_feature_names_out())\n```\nDepending on the verbosity, the output can provide a large degree of information\n\n```\n22-05-16 23:11 | INFO | Total number of interactions calculated: 8\n22-05-16 23:11 | INFO | Function fit executed in 0.004034996032714844 seconds\n22-05-16 23:11 | DEBUG | New feature created: a_sum_b\n22-05-16 23:11 | DEBUG | New feature created: a_prod_b\n22-05-16 23:11 | DEBUG | New feature created: a_sum_c\n22-05-16 23:11 | DEBUG | New feature created: a_prod_c\n22-05-16 23:11 | DEBUG | New feature created: b_sum_c\n22-05-16 23:11 | DEBUG | New feature created: b_prod_c\n22-05-16 23:11 | DEBUG | New feature created: a_sum_b_sum_c\n22-05-16 23:11 | DEBUG | New feature created: a_prod_b_prod_c\n\n```\n| a   | b   | c   | a_sum_b | a_prod_b | a_sum_c | a_prod_c | b_sum_c | b_prod_c | a_sum_b_sum_c | a_prod_b_prod_c |\n| --- | --- | --- | ------- | -------- | ------- | -------- | ------- | -------- | ------------- | --------------- |\n| 5   | 2   | 3   | 7       | 10       | 8       | 15       | 5       | 6        | 10            | 30              |\n| 5   | 2   | 3   | 7       | 10       | 8       | 15       | 5       | 6        | 10            | 30              |\n| 1   | 2   | 3   | 3       | 2        | 4       | 3        | 5       | 6        | 6             | 6               |\n\n# Dashboards & Plots #\n\n`instrumentum` library includes several visuals that can facilitate the quick analysis of predictors. Visuals are created as standalone plots, as well as dashboards that include several plots. There is also a class `DistAnalyzer` which intends to automate the creation of dashboards, by automatically identifying the type of variable (categorical, continuos), the type of target (binary, categorical, continuos) and draw the most appropiate dashboard.\n\nAll the plots, dashboards and classes created in instrumentum tipically include the following parameters:\n- x: the predictor to be plotted\n- y: the targe. If included, it might try to plot visuals including the x and y together\n- cluster: this parameter groups rows that are logically connected. For example, if we have a distribution generated at time a, and another at time a\', that value can be used to separate those entries and have visual comparison on them (think of "hue" in seaborn)\n- target_true: for those cases that y is binary, it indicates what is the true value (defaul is 1)\n\nRead the extensive examples in the [docs](https://github.com/FedericoMontana/instrumentum/blob/master/docs/examples_plots.ipynb)\n\nAdvanced dashboards can be generated with a single line of code, which visualize the different perspectives of a predictor:\n\n<img src="images/plot_01.png">\n\nFull datasets can be analyzied to identy particular patterns, for example the existence of null ("nans") values:\n\n<img src="images/plot_02.png">\n<img src="images/plot_03.png">\n\nVisuals are constantly being enhanced. See the docs for details.\n\n## Contributing\n\nInterested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.\n\n## License\n\n`instrumentum` was created by Federico Montanana. It is licensed under the terms of the MIT license.\n\n## Credits\n\n`instrumentum`  uses:\n- Optbining for bining the visuals: https://github.com/guillermo-navas-palencia/optbinning\n',
    'author': 'Federico Montanana',
    'author_email': None,
    'maintainer': None,
    'maintainer_email': None,
    'url': None,
    'package_dir': package_dir,
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'python_requires': '>=3.8,<3.11',
}


setup(**setup_kwargs)
