# Introducing the Co-DataScientist!

<!-- <div align="center">
  <img src="figures/Co-DataScientist.png" alt="Co-DataScientist Logo" width="420"/>
</div> -->

<p align="center">
  <img src="https://img.shields.io/badge/version-1.0.0-blue.svg" alt="Version"/>
  <img src="https://img.shields.io/badge/license-MIT-green.svg" alt="License"/>
  <img src="https://img.shields.io/badge/license-EPIC🔥-orange.svg" alt="Epic License"/>
  <img src="https://img.shields.io/badge/license-ML%20Beast-red.svg" alt="ML Beast License"/>
</p>

> **Kick back, relax, and tomorrow morning greet a shiny KPI you can parade at ML stand-up. 🎉**

---

## Why is everyone talking about the Co-DataScientist?

- 🧪 **Idea Explosion** — Launches a swarm of models, feature recipes & hyper-parameters you never knew existed.
- 🌌 **Full-Map Exploration** — Charts the entire optimization galaxy so you can stop guessing and start winning.
- ☕ **Hands-Free Mode** — Hit *run*, kick back with a latte (or snooze) and let the search party work through the night.
- 📈 **KPI Fanatic** — Every evolutionary step is laser-focused on cranking that one number sky-high.
- 🔒 **Data Stays Home** — Your training and testing data **never leaves your server**; everything runs 100 % locally.
- 🤑 **Zero-Surprise Costs** — Live token & dollar tracking keeps the finance goblins happy.

Fast-track your ML pipelines from 😩 _painful_ to 🏆 _heroic_
---

## 🔧 Quickstart — ⏱️ *30-Second Setup*

## 1. **Install**

```bash
pip install co-datascientist
```

## 2. **Write a tiny script** (e.g. `xor.py`). The _only_ rule: **print your KPI**! 🏷️

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# XOR toy-set
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0,1,1,0])

# CO_DATASCIENTIST_BLOCK_START

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(random_state=0))
])

pipe.fit(X, y)
acc = accuracy_score(y, pipe.predict(X))

# CO_DATASCIENTIST_BLOCK_END


print(f"KPI: {acc:.4f}")  # Tag your metric!

# comments
# This is the classic XOR problem — it's not linearly separable!
# A linear model like LogisticRegression can't solve it perfectly,
# because no straight line can separate the classes in 2D.
# This makes it a great test for feature engineering or non-linear models.
```

Sanity check the baseline runs in your environment e.g. (you might need to pip install scikit-learn first!)

```bash
python xor.py
```

## 3. **Set your API Token (one time only!)**

Before running any commands, you need to set your Co-DataScientist API token. You only need to do this once per machine.

```bash
co-datascientist set-token --token <YOUR_TOKEN>
```


## 4. **Run the magic!** ✨


Then run the co-datascientist!

```bash
co-datascientist run --script-path xor.py --parallel 3
```

Watch accuracy jump from `0.5` 🫠 to `1.0` 🏆! 
##### (when you get to KPI=1.0 you can stop it)

You will find the new glowed up code in the `co_datascientist_checkpoints` directory.

### Yes, it's that simple.

<h2 align="center"><b>
Try it on <i>your</i> toughest problem and see how your KPI improves.<br>
<span style="font-size:2em;">🎯🚀</span><br>
<b>Co-DataScientist helps you get better results—no matter how big your challenge.</b>
</b></h2>

---

> **Important Notes About Your Input Script**


## 🎯 KPI Tagging

Co-DataScientist scans your stdout for the pattern `KPI: <number>` — that’s the metric it maximizes. Use **anything**: accuracy, F1, revenue per click, unicorns-per-second… you name it!

---

## 📁 Hardcode Your Data Paths

> **Important:** Please hardcode any data file paths directly in your script.  
> For example, use `data = np.loadtxt("full/path/to/my/my_data.csv")` or similar.  
> Do **not** use `input()` or command-line arguments to specify file paths.  
> This ensures Co-DataScientist can run your script automatically without manual intervention.


## 🧬 Blocks to evolve

As you will see in the XOR example, Co-DataScientist uses **# CO_DATASCIENTIST_BLOCK_START** and **# CO_DATASCIENTIST_BLOCK_END** tags to identify the parts of the system you want it to improve. Make sure to tag parts of your system you care about improving! It will help to Co-DataScientist stay focused on its job.

---

## 🗂️ One File Only: Self-Contained Scripts Required

> **Note:** Co-DataScientist currently supports only scripts written as a **single, self-contained Python file**. Please put all your code in one `.py` file—multi-file projects are not supported (yet!). Everything your workflow needs should be in that one file.

---

## 📝 Add Domain-Specific Notes for Best Results

After your code, add **comments** with any extra context, known issues, or ideas you have about your problem. This helps Co-DataScientist understand your goals and constraints! The Co-Datascientist UNDERSTANDS your problem. It's not just doing a blind search! 


> **Other helpful stuff**

## 💰 Cost Tracking

Stay on budget with one-liners:

```bash
co-datascientist costs            # summary
co-datascientist costs --detailed # per-run breakdown
```

Powered by **LiteLLM**’s real-time pricing.

---

## 📝 Before vs After
<table>
<tr>
<th>📥 "Meh" Pipeline <br><sub>KPI ≈ 0.50</sub></th>
<th>🚀 Turbocharged by Co-DataScientist <br><sub>KPI 🚀 1.00</sub></th>
</tr>
<tr>
<td>

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np

# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'Accuracy: {accuracy:.2f}')
print(f'KPI: {accuracy:.4f}')
```

</td>
<td>

```python
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from tqdm import tqdm

class ChebyshevPolyExpansion(BaseEstimator, TransformerMixin):
    def __init__(self, degree=3):
        self.degree = degree
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = np.asarray(X)
        X_scaled = 2 * X - 1
        n_samples, n_features = X_scaled.shape
        features = []
        for f in tqdm(range(n_features), desc='Chebyshev features'):
            x = X_scaled[:, f]
            T = np.empty((self.degree + 1, n_samples))
            T[0] = 1
            if self.degree >= 1:
                T[1] = x
            for d in range(2, self.degree + 1):
                T[d] = 2 * x * T[d - 1] - T[d - 2]
            features.append(T.T)
        return np.hstack(features)

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('cheb', ChebyshevPolyExpansion(degree=3)),
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'Accuracy: {accuracy:.2f}')
print(f'KPI: {accuracy:.4f}')
```

</td>
</tr>
</table>

---


## We now support Databricks!

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAATYAAACjCAMAAAA3vsLfAAAAxlBMVEX///8AAAD/OSff39+QkJCJiYn/NiP/IAC9vb2fn5/u7u7/MRzb29v6+vr/NSJgYGCvr69zc3P/LhjT09P/fXT+5eP/KAtaWlr09PT/9vX/QC//KxPn5+fIyMiqqqr/ZVp+fn7/wLz/oJr/xsNJSUn/1NH/cGb/d27/hn7/VUj/tbD/6ej/8O//2NX/r6r/393/STo/Pz//lY7/aV7/pZ//wr7/VEcaGhopKSltbW0jIyP/X1P/j4f/mZL/RjYREREyMjJRT085yGswAAALf0lEQVR4nO2cC1viOhCGuRShCAKyQl0Kdr2s63V1jyuC7iL//0+dNtfJJC0X6yIm3/Ocs9A2JXk7SWYyqYWCk5OTk5OTk5OTk5OTk5OTk9OGdHKz6RpsoQ6OqtUv15uuxZbp8Ed1t1TarR5vuiJbpYfefqm0t1cqDXYuN12XrdHvr71Saad6dNHbif95PNt0fbZCP5+rO6VS7+tJoXC5O4iNrvr9atN1+vC6/lKN++Z+74F8a/6iQ9yv5oar9cH1h2L6ccgPXJ8nGAeD203W6oOLdcr/DuDBm8e40+70/p5sqlYfXGd3hM+dNgV8Kw3IFPFzE7X64Lr6TnrjntHhOE4cEth3nYgOX+ig9iftPHF/+UzhRHW7n7i31fOMYOrnEfVLvv27Wn1wndwRv/ZiQeh+8pd4wc8uwk9EzWhQWsKMbnsLjdISsUGrutygxYdA2wPVh+qKUySZcHsv71mnDy/ikO2t6JCdPSJ/2DLdXFD3dmX33+Y5gQWb+28INs9+Wef+XiVje2n/DQ2/Oq8OqjnWaCt0UI2p7b5haCdLStZhO0wGttL+1zXXNS7348nEQmyx00+Xvi/WWNc4+5uEFTtWYhNO/6ppvQMaVvy92LET28J1D2MZthZyXPiyaym22HL+y1hlM+lhQMKKxEItxpas6a6Q1mOpwGcyHlqNbYW0HkgFJrIcm8hXvWSm9XgqUIQVdmKDvZJGWgCJLo5WhhU2YrsuVY/gOgaJ62UHxDJ15HMLscXBFVpm+8Z2fhjc37NHPRV49X3fQneXxKQoE/XA0nrI/TWlArnH9w9q+qFEsCW98jc4KF1ZcMzkEt/SPntu3Z6aBNvuQMtEafkYUyrw5O9Sma7PqBjb/p/bqh6UsrQeZWICxMjuWJkwjbENjsVeU7UH8rTewTM1PUM/tnV3KsVGljPIRiwlKGXDmSEVuHKm65OJY+NBKfYuEpgaILb16NnerUcSm3lbW+Kq7Rg8Yss3ukFsYq/pCzStyz0F0BWNv96S6foESrCBUYtBUfaaKuOXHpLGYO+sm01jbHvKGJUZlKZ04z0ro4Q9NODvpAz46ZOGddiapp1Gxwb3l6+b7+ohqYXYRAJK2demv2qV4RDbaG2JUkInuNzxYMgJsvDr6GjPTmwpgbrMybAMtCHYT1ZObFzdZRLLQjCNwGEKQFLXX+SOcYuxiUVIw4i/u6ctZCrvJ1iNLW3JO4FpDkl5xGU5tlRvFoWk+N0rK7EdKktF2pt9sS71kFR5089GbNfVqhJHwbFel/G90u8WYouDKxRHoa0KUGQTIMqSnt3ZmPC7qoqdQ0IncGOMlGlzDYu4ev+iqh9KpuSeKShggNQYbI19cZ9GppetcAhq/Hsgtr96xaLLRxg8KRsFH9iSCJxg+Z5fC7OkQrcD02BPxrK7m5OSPtbxieM3vpNlMrkW1P3t9bSZVf1zF1aLObJKUEph4uRC4djgFNurM8M7a8lGQZRxYe8DWv16nyqSRkBEbi7UGHXd9wE/tY6zdymw5aV9219b1pQ52r+4P3WUKuZblDTfgrm37g9rpcjoyZp8YidVtwMUlK7w5y5slhqlG+N9J5MO5MZwtnXc/UHUpcT2fFzcGZffnFJFl3OzXotxMuoP3Vlj9ybANRQHpQte+XMy6szF7E5OTk5OTk5OTk5OTk5OTk5OTk5OTk5braBdq9Xa3Xe4c0juHK5SpF5bWJEuuWuwfrXyUaOYqPIOd26TO9dWKOHF14/r2ddE5K7lN9UsB1Fs3jvceWVsNVLgKfuiyhZiC4iWvPPK2FrFJZBsIbYmuXay5J1XxnZKsWUPb1uLrbPknVfG5lFs2Ul8h027/TQp0M6+yGHT5bc6izwQh20tOWxryWFbSxvE1h1Vosino4gBW9ePosqIDTJN9l+iOrm2RQ6D+a45JAUa4EjyP4itQX9RmySb/AZeCH4HnCdVrbTlvc3YeHXCtmxZvmp6syJTJdCx1SN+tug1uede49WVuucFwgk/NPbjrz75mDRTYhs98UsiGaL6vP1BJ/kwVJ4KVVfcOi7JSCFswZSeJvftnoqq502tpjS+hrH5yunuYmzK8WnZgC18hZegXyoXuvS4ji1oqb84CXRswRhQ88DFM2D7OShSq1IclZXGoJoWhwuwNe/R4TLGNiyjK+YKtpCf1rB1i5pqGBunVsfUiouCjbdRix8LxDbXq8rra8b2arwcYsNchZ1SbG1+B4xtmFkPiq3Jeiih1sAXL1hKWUFtcc9+p3MKfoJik2hanQlsb1xdf97vsxKtfiwynwrjHMcFIMKG8mMxqklHGnIEsAkhbKE4MW1N+jNZS4BNoVbo09LdsMzM7r6QkwJeEzq9NWXNCTb+vO6H5HRdUuReBPkiHRA+TvbpQBJ2UrBVaMOG/El0Eba+3x6FKrZXtWiY1MWXj7YMqIWgboD5ODfXhzVLtjvgtSPYTsFnIjEssRpgv42dlVGk6FkKNuktQDsQ2Pq8NwFsrOyTnHiD/oj8K7A1nxRq1DzpA08MYPQmUlDM2ODCT3MsUTUwNdlVzNhqmJocxyE2uCzOuCXmxrHJpwCwUUMaGxZDODZEjT1ibmG1HN+FGJE7z5RjgFUkLUGonYWtrz0FwaWhlWWi/TQqCGyv8pzE1pV3waowWIgae8Ljd0gy0GYi620JbE/CEICeMrDpxiQOSmxojZsCmRYEtqE8J7F5hgfCRLGN7rUfZ8br5w6Odkh02xrHRrvwGJXx07FRQz1FBSKEzU+rhK/9nsQ2x0SlFEcIPjIxVs7zzWtRLnharnNsDeMT7qZjE8AV1RA27HVO+HnazEirSoKN2rhxfILY1EcCHKC5kfh6CkStFPHGD40U6unY6FCJF2O7CBt2Oj1uSL5WXGKjfpqxFYq1KYNfAB3H+5Xys1laChvuUwuxYe9oRWzQLFbHhkYcJbzKy+AoNjwWBRxbV2HCpczrpk6KOeNOipd4qO/Y5dhgH5bYpguxdegY+Yoa44PYJi97K5oq0+XY6qZ6qGuNKrau0XorCBvuxK/cSrKwnWrn0P097jlps2044kHj3FR8DZ1CBFwRx8bmuHJGGTVPSr/hmXmMsCHrptabzJ9Z2CibyNQI7rdxtxBbe0GuAOYUy3sGe6JVJdg6ButhfqeCTTxFilTdPsIaA9xd1WZawmCzsA3T2y1jUraaYxrDOqln1lAoEQmBOHSoIKKaqcfIFxFmjHQu7ClAbEpUUpNNzcLGxhM8DicCKyCs7mIM8175J8VterPmml3z1RyCcoq5iUVIBZvsx4ARVX0GDvGY9F76X4xaEiRkY2NzYh+c9anxwIUj9lzo/X3Qsm6e1iZWhlrsAXXFGr8HGlWMAqWRANtcNQLumHsMzEgUUBeOWPEggt8zsTEixRkfHtpTdi1cpmTtIfWhH9kERK0hN9dNLO6eeu1RRUDjHVesJPbj09FMnubYuGMUVSqdpIhwMCd+25erbdoy5Swatb0+/9YCzFOwgUc28XxetIGw8Z+IZNP6w6A+nJOPuS1T8gUIKGVRfKqdHivYQnCmVZBDmXY/vLoLNS0sgQ1nBqgCnIJhsEbSPqXyzNBibr6SggkwN5mCUaopmhei6/UUTBstfhenwVLYDNxmycUo4Sc3d+FUz7JZ8OXUUe49wgk/NXU1LCBsBZmAoM0LYEaiODMl/EbwCungLMKmJWGop4OwcSOLfZXyDF6dL7V4HpjLJoR80pHO11DaYzI1aDsShBHwWa4mLdQT5xNsImYN5bO4l7NbGjbg3END7bABHu/d5UbWhHUrPuW4BMIVjib9fn8yIt5ks5wIOpahn5zutMn0GJDTSiQw9D3P98H+goaf5LIiyrZOCjTVovVRJ/lFH85t8kIhckSZ/xqkLvG9m6nF6uUwFmtBLYqvbnn55padnJycnJycnJycnJycnJyctlr/A6tV92rnvlz6AAAAAElFTkSuQmCC" alt="Databricks Logo" width="300"/>


Databricks setup

1. Download the databricks CLI package
```
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sudo sh
```

2. Get a databricks token : and test the CLI works
[Get your Databricks token here](https://docs.databricks.com/aws/en/dev-tools/auth/pat)
3. Prepare a config file with all of your compute/environmental requirements in  databricks_config.yaml example below

```yaml
# Enable Databricks integration
databricks: true

# Databricks configuration for XOR demo
databricks:
  cli: "databricks"  # databricks CLI command (optional, defaults to "databricks")
  volume_uri: "dbfs:/Volumes/workspace/default/volume"  # DBFS volume URI for file uploads
  code_path: "dbfs:/Volumes/workspace/default/volume/xor.py"  # Specific code path (optional, overrides volume_uri + temp filename)
  timeout: "30m"  # Job timeout duration
  
  job:
    name: "run-<script-stem>-<timestamp>"  # Job name template (supports <script-stem> and <timestamp>)
    tasks:
      - task_key: "t"
        spark_python_task:
          python_file: "<remote_path>"  # Will be automatically replaced with actual remote path
        environment_key: "default"
    environments:
      - environment_key: "default"
        spec:
          client: "1"
          dependencies:
            - "scikit-learn>=1.0.0"
            - "numpy>=1.20.0"
```

Then run the co-datascientist with: 
```bash
co-datascientist run --cloud-config databricks_config.yaml
```

Now your new optimized model checkpoints will save in : dbfs:/Volumes/workspace/default/volume/co-datascientist-checkpoints

## 🙋‍♀️ Need help?

We’d love to chat: [oz.kilim@tropiflo.io](mailto:oz.kilim@tropiflo.io)

---

<p align="center"><strong>All set? Ignite your pipelines and watch them soar! 🚀</strong></p>

<p align="center"><em>⚠️  Disclaimer: Co-DataScientist executes your scripts on your own machine. Make sure you trust the code you feed it!</em></p>

<p align="center">Made with ❤️ by the Tropiflo team</p>
