Metadata-Version: 2.1
Name: persine
Version: 0.1.1
Summary: Persine is an automated tool to study and reverse-engineer algorithmic recommendation systems. It has a simple interface and encourages reproducible results.
Home-page: https://github.com/jsoma/persine
License: MIT
Keywords: algorithmic accountability,recommendation systems,scraping
Author: Jonathan Soma
Author-email: jonathan.soma@gmail.com
Requires-Python: >=3.6.3
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: Pillow (>=7.0.0)
Requires-Dist: beautifulsoup4 (>=4.6.3)
Requires-Dist: pandas (>=1.1.5,<2.0.0)
Requires-Dist: selenium (>=3.141.0,<4.0.0)
Project-URL: Repository, https://github.com/jsoma/persine
Description-Content-Type: text/markdown

# Persine, the Persona Engine

Persine is an **automated tool to study and reverse-engineer algorithmic recommendation systems**. It has a simple interface and encourages reproducible results. You tell Persine to drive around YouTube and it gives back a spreadsheet of what else YouTube suggests you watch!

> Persine => **Pers**[ona Eng]**ine**

### For example!

People have suggested that if you watch a few lightly political videos, YouTube starts suggesting more and more extreme content – _but does it really?_

The theory is difficult to test since it involves a lot of boring clicking and YouTube already knows what you usually watch. **Persine to the rescue!**

1. Persine starts a new fresh-as-snow Chrome
2. You provide a list of videos to watch and buttons to click (like, dislike, "next up" etc)
3. As it watches and clicks more and more, YouTube customizes and customizes
4. When you're all done, Persine will save your winding path and the video/playlist/channel recommendations to nice neat CSV files.

Beyond analysis, these files can be used to repeat the experiment again later, seeing if recommendations change by time, location, user history, etc.

If you didn't quite get enough data, don't worry – you can resume your exploration later, picking up right where you left off. Since each "persona" is based on Chrome profiles, all your cookies and history will be safely stored until your next run.

### An actual example

See Persine in action [on Google Colab](https://colab.research.google.com/drive/1eAbfwV9mL34LVVIzW4AgwZt5NZJ21LwT?usp=sharing). Includes a few examples for analysis, too.

## Installation

```
pip install persine
```

Persine will automatically install Selenium and BeautifulSoup for browsing/scraping, pandas for data analysis, and pillow for processing screenshots.

You will need to install [chromedriver](https://chromedriver.chromium.org/) to allow Selenium to control Chrome. **Persine won't work without it!**

* **Installing chromedriver on OS X:** I hear you can install it [using homebrew](https://formulae.brew.sh/cask/chromedriver), but I've never done it! You can also follow the link above and click the "latest stable release" link, then download `chromedriver_mac64.zip`. Unzip it, then move the `chromedriver` file into your `PATH`. I typically put it in `/usr/local/bin`.
* **Installing chromedriver on Windows:** Follow the link above, click the "latest stable release" link. Download `chromedriver_win32.zip`, unzip it, and move `chromedriver.exe` into your `PATH` (in the spirit of anarchy I just put it in `C:\Windows`).
* **Installing chromedriver on Debian/Ubuntu:** Just run `apt install chromium-chromedriver` and it'll work.

## Quickstart

In this example, we start a new session by visiting a YouTube video and clicking the "next up" video three times to see where it leads us. We then save the results for later analysis.

```python
from persine import PersonaEngine

engine = PersonaEngine(headless=False)

with engine.persona() as persona:
    persona.run("https://www.youtube.com/watch?v=hZw23sWlyG0")
    persona.run("youtube:next_up#3")
    persona.history.to_csv("history.csv")
    persona.recommendations.to_csv("recs.csv")
```

We turn off headless mode because it's fun to watch!

## Persine basics

Persine is built around an **engine** that stores all of your global settings, and **personas** that represent the individual users who browse the web.

### Creating Personas

Personas are always generated by an engine.

```python
from persine import PersonaEngine

engine = PersonaEngine()
persona = engine.persona()
```

By default, personas are single-use and their browsing history will be discarded after your script is run. If you give them a name, though, they'll save their browsing/recommendation history so you can resume them later.

```python
persona = engine.persona('Mulberry')
```

This is useful in conjunction with signing in to YouTube (see below), allowing you to imitate a real user watching videos over multiple sessions.

### Launching Chrome and visiting pages

You can use `with` to automatically start/stop Chrome. Makes life easy.

```python
with engine.persona() as persona:
    persona.run("https://www.youtube.com/watch?v=hZw23sWlyG0")
    persona.run("youtube:next_up#3")
```

If you prefer more control or to visit sites one-by-one, you can manually call `.quit()` when you're done.

```python
persona.run("https://www.youtube.com/watch?v=hZw23sWlyG0")
persona.run("youtube:next_up#3")

# Quit Chrome
persona.quit()
```

We can turn headless mode off or on depending on whether we want to actually watch what Chrome is up to. When running in non-headless mode, Persine automatically installs [uBlock Origin](https://chrome.google.com/webstore/detail/ublock-origin/cjpalhdlnbpafiamejdnhcphjbkeiagm) so you don't have to deal with ads.

```python
engine = PersonaEngine(headless=False)
```

> Headless mode doesn't support extensions, so by default our invisible Chrome is unfortunately watching ads. We should probably switch to Firefox but it has [its own problems](https://firefox-source-docs.mozilla.org/testing/geckodriver/Notarization.html).
 
### Seeing and saving results

**History** is all of your commands you've run and the pages you've visited, while **recommendations** are what you've been recommended. Recommendations include video sidebars, homepage listings, and search results.

> Right now recommendations also include ads and unrelated promoted content. I'm on the fence about whether they should stay or go.

For convenience, you can use `.to_df()` to see history and recommendations as pandas DataFrames.

```python
persona.recommendations.to_df()
persona.history.to_df()
```

If you'd prefer to do your analysis elsewhere, you can save them to CSV files.

```python
persona.recommendations.to_csv('recs.csv')
persona.history.to_csv('hist.csv')
```

## Bridges

**Bridges** are site-specific scrapers that tell Persine what to click, what to scrape, and other site-specific commands. Right now the only completed bridge we have is for **YouTube**, while an Amazon one is in the works.

### YouTube commands

Tthe YouTube bridge supports the following custom commands:

|command|action|
|---|---|
|`youtube:homepage`|Visits youtube.com|
|`youtube:search?SEARCHTERM`|Searches YouTube for the specified term|
|`youtube:next_up`|When on a video page, clicks the "next up" video|
|`youtube:like`|Clicks the like button|
|`youtube:dislike`|Clicks the dislike button|
|`youtube:subscribe`|Clicks the subscribe button|
|`youtube:unsubscribe`|Clicks the unsubscribe button|
|`youtube:sign_in`|Begins the signin process. You'll need to complete the process manually, but Persine will resume as soon as it notices you're logged in.|

### Repeating commands

If you'd like to repeat a command multiple times, you can append `#[NUMBER]` to it. For example, `youtube:next_up#50` will watch the next fifty "next up" videos.
