Metadata-Version: 2.4
Name: co-datascientist
Version: 0.4.7
Summary: A tool for agentic recursive model improvement
Project-URL: Homepage, https://github.com/TropiFloAI/co-datascientist
Project-URL: Issues, https://github.com/TropiFloAI/co-datascientist/issues
Author-email: David Gedalevich <davidgdalevich7@gmail.com>, Oz Kilim <oz.kilim@tropiflo.io>
License: Copyright (c) 2018 The Python Packaging Authority
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: click>=8.1.8
Requires-Dist: fastmcp>=2.2.5
Requires-Dist: httpx>=0.28.1
Requires-Dist: ipdb>=0.13.13
Requires-Dist: keyring>=25.6.0
Requires-Dist: keyrings-alt>=5.0.0
Requires-Dist: pydantic-settings>=2.9.1
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: yaspin>=3.1.0
Description-Content-Type: text/markdown

# Introducing the Co-DataScientist!

<!-- <div align="center">
  <img src="figures/Co-DataScientist.png" alt="Co-DataScientist Logo" width="420"/>
</div> -->

<p align="center">
  <img src="https://img.shields.io/badge/version-1.0.0-blue.svg" alt="Version"/>
  <img src="https://img.shields.io/badge/license-MIT-green.svg" alt="License"/>
  <img src="https://img.shields.io/badge/license-EPIC🔥-orange.svg" alt="Epic License"/>
  <img src="https://img.shields.io/badge/license-ML%20Beast-red.svg" alt="ML Beast License"/>
</p>

> **Kick back, relax, and tomorrow morning greet a shiny KPI you can parade at ML stand-up. 🎉**

---

## Why is everyone talking about the Co-DataScientist?

- 🧪 **Idea Explosion** — Launches a swarm of models, feature recipes & hyper-parameters you never knew existed.
- 🌌 **Full-Map Exploration** — Charts the entire optimization galaxy so you can stop guessing and start winning.
- ☕ **Hands-Free Mode** — Hit *run*, kick back with a latte (or snooze) and let the search party work through the night.
- 📈 **KPI Fanatic** — Every evolutionary step is laser-focused on cranking that one number sky-high.
- 🔒 **Data Stays Home** — Your training and testing data **never leaves your server**; everything runs 100 % locally.
- 🤑 **Zero-Surprise Costs** — Live token & dollar tracking keeps the finance goblins happy.

Fast-track your ML pipelines from 😩 _painful_ to 🏆 _heroic_
---

## 🔧 Quickstart — ⏱️ *30-Second Setup*

## 1. **Install**

```bash
pip install co-datascientist
```

## 2. **Write a tiny script** (e.g. `xor.py`). The _only_ rule: **print your KPI**! 🏷️

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# XOR toy-set
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0,1,1,0])

# CO_DATASCIENTIST_BLOCK_START

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(random_state=0))
])

pipe.fit(X, y)
acc = accuracy_score(y, pipe.predict(X))

# CO_DATASCIENTIST_BLOCK_END


print(f"KPI: {acc:.4f}")  # Tag your metric!

# comments
# This is the classic XOR problem — it's not linearly separable!
# A linear model like LogisticRegression can't solve it perfectly,
# because no straight line can separate the classes in 2D.
# This makes it a great test for feature engineering or non-linear models.
```

Sanity check the baseline runs in your environment e.g. (you might need to pip install scikit-learn first!)

```bash
python xor.py
```

## 3. **Set your API Token (one time only!)**

Before running any commands, you need to set your Co-DataScientist API token. You only need to do this once per machine.

```bash
co-datascientist set-token --token <YOUR_TOKEN>
```


## 4. **Run the magic!** ✨


Then run the co-datascientist!

```bash
co-datascientist run --script-path xor.py --parallel 3
```

Watch accuracy jump from `0.5` 🫠 to `1.0` 🏆! 
##### (when you get to KPI=1.0 you can stop it)

You will find the new glowed up code in the `co_datascientist_checkpoints` directory.

### Yes, it's that simple.

<h2 align="center"><b>
Try it on <i>your</i> toughest problem and see how your KPI improves.<br>
<span style="font-size:2em;">🎯🚀</span><br>
<b>Co-DataScientist helps you get better results—no matter how big your challenge.</b>
</b></h2>

---

> **Important Notes About Your Input Script**


## 🎯 KPI Tagging

Co-DataScientist scans your stdout for the pattern `KPI: <number>` — that’s the metric it maximizes. Use **anything**: accuracy, F1, revenue per click, unicorns-per-second… you name it!

---

## 📁 Hardcode Your Data Paths

> **Important:** Please hardcode any data file paths directly in your script.  
> For example, use `data = np.loadtxt("full/path/to/my/my_data.csv")` or similar.  
> Do **not** use `input()` or command-line arguments to specify file paths.  
> This ensures Co-DataScientist can run your script automatically without manual intervention.


## 🧬 Blocks to evolve

As you will see in the XOR example, Co-DataScientist uses **# CO_DATASCIENTIST_BLOCK_START** and **# CO_DATASCIENTIST_BLOCK_END** tags to identify the parts of the system you want it to improve. Make sure to tag parts of your system you care about improving! It will help to Co-DataScientist stay focused on its job.

---

## 🗂️ One File Only: Self-Contained Scripts Required

> **Note:** Co-DataScientist currently supports only scripts written as a **single, self-contained Python file**. Please put all your code in one `.py` file—multi-file projects are not supported (yet!). Everything your workflow needs should be in that one file.

---

## 📝 Add Domain-Specific Notes for Best Results

After your code, add **comments** with any extra context, known issues, or ideas you have about your problem. This helps Co-DataScientist understand your goals and constraints! The Co-Datascientist UNDERSTANDS your problem. It's not just doing a blind search! 


> **Other helpful stuff**

## 💰 Cost Tracking

Stay on budget with one-liners:

```bash
co-datascientist costs            # summary
co-datascientist costs --detailed # per-run breakdown
```

Powered by **LiteLLM**’s real-time pricing.

---

## 📝 Before vs After
<table>
<tr>
<th>📥 "Meh" Pipeline <br><sub>KPI ≈ 0.50</sub></th>
<th>🚀 Turbocharged by Co-DataScientist <br><sub>KPI 🚀 1.00</sub></th>
</tr>
<tr>
<td>

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np

# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'Accuracy: {accuracy:.2f}')
print(f'KPI: {accuracy:.4f}')
```

</td>
<td>

```python
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from tqdm import tqdm

class ChebyshevPolyExpansion(BaseEstimator, TransformerMixin):
    def __init__(self, degree=3):
        self.degree = degree
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = np.asarray(X)
        X_scaled = 2 * X - 1
        n_samples, n_features = X_scaled.shape
        features = []
        for f in tqdm(range(n_features), desc='Chebyshev features'):
            x = X_scaled[:, f]
            T = np.empty((self.degree + 1, n_samples))
            T[0] = 1
            if self.degree >= 1:
                T[1] = x
            for d in range(2, self.degree + 1):
                T[d] = 2 * x * T[d - 1] - T[d - 2]
            features.append(T.T)
        return np.hstack(features)

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('cheb', ChebyshevPolyExpansion(degree=3)),
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'Accuracy: {accuracy:.2f}')
print(f'KPI: {accuracy:.4f}')
```

</td>
</tr>
</table>

---


## We now support Databricks!

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAATYAAACjCAMAAAA3vsLfAAAAxlBMVEX///8AAAD/OSff39+QkJCJiYn/NiP/IAC9vb2fn5/u7u7/MRzb29v6+vr/NSJgYGCvr69zc3P/LhjT09P/fXT+5eP/KAtaWlr09PT/9vX/QC//KxPn5+fIyMiqqqr/ZVp+fn7/wLz/oJr/xsNJSUn/1NH/cGb/d27/hn7/VUj/tbD/6ej/8O//2NX/r6r/393/STo/Pz//lY7/aV7/pZ//wr7/VEcaGhopKSltbW0jIyP/X1P/j4f/mZL/RjYREREyMjJRT085yGswAAALf0lEQVR4nO2cC1viOhCGuRShCAKyQl0Kdr2s63V1jyuC7iL//0+dNtfJJC0X6yIm3/Ocs9A2JXk7SWYyqYWCk5OTk5OTk5OTk5OTk5OTk9OGdHKz6RpsoQ6OqtUv15uuxZbp8Ed1t1TarR5vuiJbpYfefqm0t1cqDXYuN12XrdHvr71Saad6dNHbif95PNt0fbZCP5+rO6VS7+tJoXC5O4iNrvr9atN1+vC6/lKN++Z+74F8a/6iQ9yv5oar9cH1h2L6ccgPXJ8nGAeD203W6oOLdcr/DuDBm8e40+70/p5sqlYfXGd3hM+dNgV8Kw3IFPFzE7X64Lr6TnrjntHhOE4cEth3nYgOX+ig9iftPHF/+UzhRHW7n7i31fOMYOrnEfVLvv27Wn1wndwRv/ZiQeh+8pd4wc8uwk9EzWhQWsKMbnsLjdISsUGrutygxYdA2wPVh+qKUySZcHsv71mnDy/ikO2t6JCdPSJ/2DLdXFD3dmX33+Y5gQWb+28INs9+Wef+XiVje2n/DQ2/Oq8OqjnWaCt0UI2p7b5haCdLStZhO0wGttL+1zXXNS7348nEQmyx00+Xvi/WWNc4+5uEFTtWYhNO/6ppvQMaVvy92LET28J1D2MZthZyXPiyaym22HL+y1hlM+lhQMKKxEItxpas6a6Q1mOpwGcyHlqNbYW0HkgFJrIcm8hXvWSm9XgqUIQVdmKDvZJGWgCJLo5WhhU2YrsuVY/gOgaJ62UHxDJ15HMLscXBFVpm+8Z2fhjc37NHPRV49X3fQneXxKQoE/XA0nrI/TWlArnH9w9q+qFEsCW98jc4KF1ZcMzkEt/SPntu3Z6aBNvuQMtEafkYUyrw5O9Sma7PqBjb/p/bqh6UsrQeZWICxMjuWJkwjbENjsVeU7UH8rTewTM1PUM/tnV3KsVGljPIRiwlKGXDmSEVuHKm65OJY+NBKfYuEpgaILb16NnerUcSm3lbW+Kq7Rg8Yss3ukFsYq/pCzStyz0F0BWNv96S6foESrCBUYtBUfaaKuOXHpLGYO+sm01jbHvKGJUZlKZ04z0ro4Q9NODvpAz46ZOGddiapp1Gxwb3l6+b7+ohqYXYRAJK2demv2qV4RDbaG2JUkInuNzxYMgJsvDr6GjPTmwpgbrMybAMtCHYT1ZObFzdZRLLQjCNwGEKQFLXX+SOcYuxiUVIw4i/u6ctZCrvJ1iNLW3JO4FpDkl5xGU5tlRvFoWk+N0rK7EdKktF2pt9sS71kFR5089GbNfVqhJHwbFel/G90u8WYouDKxRHoa0KUGQTIMqSnt3ZmPC7qoqdQ0IncGOMlGlzDYu4ev+iqh9KpuSeKShggNQYbI19cZ9GppetcAhq/Hsgtr96xaLLRxg8KRsFH9iSCJxg+Z5fC7OkQrcD02BPxrK7m5OSPtbxieM3vpNlMrkW1P3t9bSZVf1zF1aLObJKUEph4uRC4djgFNurM8M7a8lGQZRxYe8DWv16nyqSRkBEbi7UGHXd9wE/tY6zdymw5aV9219b1pQ52r+4P3WUKuZblDTfgrm37g9rpcjoyZp8YidVtwMUlK7w5y5slhqlG+N9J5MO5MZwtnXc/UHUpcT2fFzcGZffnFJFl3OzXotxMuoP3Vlj9ybANRQHpQte+XMy6szF7E5OTk5OTk5OTk5OTk5OTk5OTk5OTk5braBdq9Xa3Xe4c0juHK5SpF5bWJEuuWuwfrXyUaOYqPIOd26TO9dWKOHF14/r2ddE5K7lN9UsB1Fs3jvceWVsNVLgKfuiyhZiC4iWvPPK2FrFJZBsIbYmuXay5J1XxnZKsWUPb1uLrbPknVfG5lFs2Ul8h027/TQp0M6+yGHT5bc6izwQh20tOWxryWFbSxvE1h1Vosino4gBW9ePosqIDTJN9l+iOrm2RQ6D+a45JAUa4EjyP4itQX9RmySb/AZeCH4HnCdVrbTlvc3YeHXCtmxZvmp6syJTJdCx1SN+tug1uede49WVuucFwgk/NPbjrz75mDRTYhs98UsiGaL6vP1BJ/kwVJ4KVVfcOi7JSCFswZSeJvftnoqq502tpjS+hrH5yunuYmzK8WnZgC18hZegXyoXuvS4ji1oqb84CXRswRhQ88DFM2D7OShSq1IclZXGoJoWhwuwNe/R4TLGNiyjK+YKtpCf1rB1i5pqGBunVsfUiouCjbdRix8LxDbXq8rra8b2arwcYsNchZ1SbG1+B4xtmFkPiq3Jeiih1sAXL1hKWUFtcc9+p3MKfoJik2hanQlsb1xdf97vsxKtfiwynwrjHMcFIMKG8mMxqklHGnIEsAkhbKE4MW1N+jNZS4BNoVbo09LdsMzM7r6QkwJeEzq9NWXNCTb+vO6H5HRdUuReBPkiHRA+TvbpQBJ2UrBVaMOG/El0Eba+3x6FKrZXtWiY1MWXj7YMqIWgboD5ODfXhzVLtjvgtSPYTsFnIjEssRpgv42dlVGk6FkKNuktQDsQ2Pq8NwFsrOyTnHiD/oj8K7A1nxRq1DzpA08MYPQmUlDM2ODCT3MsUTUwNdlVzNhqmJocxyE2uCzOuCXmxrHJpwCwUUMaGxZDODZEjT1ibmG1HN+FGJE7z5RjgFUkLUGonYWtrz0FwaWhlWWi/TQqCGyv8pzE1pV3waowWIgae8Ljd0gy0GYi620JbE/CEICeMrDpxiQOSmxojZsCmRYEtqE8J7F5hgfCRLGN7rUfZ8br5w6Odkh02xrHRrvwGJXx07FRQz1FBSKEzU+rhK/9nsQ2x0SlFEcIPjIxVs7zzWtRLnharnNsDeMT7qZjE8AV1RA27HVO+HnazEirSoKN2rhxfILY1EcCHKC5kfh6CkStFPHGD40U6unY6FCJF2O7CBt2Oj1uSL5WXGKjfpqxFYq1KYNfAB3H+5Xys1laChvuUwuxYe9oRWzQLFbHhkYcJbzKy+AoNjwWBRxbV2HCpczrpk6KOeNOipd4qO/Y5dhgH5bYpguxdegY+Yoa44PYJi97K5oq0+XY6qZ6qGuNKrau0XorCBvuxK/cSrKwnWrn0P097jlps2044kHj3FR8DZ1CBFwRx8bmuHJGGTVPSr/hmXmMsCHrptabzJ9Z2CibyNQI7rdxtxBbe0GuAOYUy3sGe6JVJdg6ButhfqeCTTxFilTdPsIaA9xd1WZawmCzsA3T2y1jUraaYxrDOqln1lAoEQmBOHSoIKKaqcfIFxFmjHQu7ClAbEpUUpNNzcLGxhM8DicCKyCs7mIM8175J8VterPmml3z1RyCcoq5iUVIBZvsx4ARVX0GDvGY9F76X4xaEiRkY2NzYh+c9anxwIUj9lzo/X3Qsm6e1iZWhlrsAXXFGr8HGlWMAqWRANtcNQLumHsMzEgUUBeOWPEggt8zsTEixRkfHtpTdi1cpmTtIfWhH9kERK0hN9dNLO6eeu1RRUDjHVesJPbj09FMnubYuGMUVSqdpIhwMCd+25erbdoy5Swatb0+/9YCzFOwgUc28XxetIGw8Z+IZNP6w6A+nJOPuS1T8gUIKGVRfKqdHivYQnCmVZBDmXY/vLoLNS0sgQ1nBqgCnIJhsEbSPqXyzNBibr6SggkwN5mCUaopmhei6/UUTBstfhenwVLYDNxmycUo4Sc3d+FUz7JZ8OXUUe49wgk/NXU1LCBsBZmAoM0LYEaiODMl/EbwCungLMKmJWGop4OwcSOLfZXyDF6dL7V4HpjLJoR80pHO11DaYzI1aDsShBHwWa4mLdQT5xNsImYN5bO4l7NbGjbg3END7bABHu/d5UbWhHUrPuW4BMIVjib9fn8yIt5ks5wIOpahn5zutMn0GJDTSiQw9D3P98H+goaf5LIiyrZOCjTVovVRJ/lFH85t8kIhckSZ/xqkLvG9m6nF6uUwFmtBLYqvbnn55padnJycnJycnJycnJycnJyctlr/A6tV92rnvlz6AAAAAElFTkSuQmCC" alt="Databricks Logo" width="300"/>


Databricks setup

1. Download the databricks CLI package
```
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sudo sh
```

2. Get a databricks token : and test the CLI works
[Get your Databricks token here](https://docs.databricks.com/aws/en/dev-tools/auth/pat)
3. Prepare a config file with all of your compute/environmental requirements in  databricks_config.yaml example below

```yaml
# Enable Databricks integration
databricks: true

# Databricks configuration for XOR demo
databricks:
  cli: "databricks"  # databricks CLI command (optional, defaults to "databricks")
  volume_uri: "dbfs:/Volumes/workspace/default/volume"  # DBFS volume URI for file uploads
  code_path: "dbfs:/Volumes/workspace/default/volume/xor.py"  # Specific code path (optional, overrides volume_uri + temp filename)
  timeout: "30m"  # Job timeout duration
  
  job:
    name: "run-<script-stem>-<timestamp>"  # Job name template (supports <script-stem> and <timestamp>)
    tasks:
      - task_key: "t"
        spark_python_task:
          python_file: "<remote_path>"  # Will be automatically replaced with actual remote path
        environment_key: "default"
    environments:
      - environment_key: "default"
        spec:
          client: "1"
          dependencies:
            - "scikit-learn>=1.0.0"
            - "numpy>=1.20.0"
```

Then run the co-datascientist with: 
```bash
co-datascientist run --cloud-config databricks_config.yaml
```

Now your new optimized model checkpoints will save in : dbfs:/Volumes/workspace/default/volume/co-datascientist-checkpoints

## 🙋‍♀️ Need help?

We’d love to chat: [oz.kilim@tropiflo.io](mailto:oz.kilim@tropiflo.io)

---

<p align="center"><strong>All set? Ignite your pipelines and watch them soar! 🚀</strong></p>

<p align="center"><em>⚠️  Disclaimer: Co-DataScientist executes your scripts on your own machine. Make sure you trust the code you feed it!</em></p>

<p align="center">Made with ❤️ by the Tropiflo team</p>
