Skip to main content

Persine is an automated tool to study and reverse-engineer algorithmic recommendation systems. It has a simple interface and encourages reproducible results.

Project description

Persine, the Persona Engine

Persine is an automated tool to study and reverse-engineer algorithmic recommendation systems. It has a simple interface and encourages reproducible results. You tell Persine to drive around YouTube and it gives back a spreadsheet of what else YouTube suggests you watch!

Persine => Per[sona Eng]ine

For example!

People have suggested that if you watch a few lightly political videos, YouTube starts suggesting more and more extreme content – but does it really?

The theory is difficult to test since it involves a lot of boring clicking and YouTube already knows what you usually watch. Persine to the rescue!

  1. Persine starts a new fresh-as-snow Chrome
  2. You provide a list of videos to watch and buttons to click (like, dislike, "next up" etc)
  3. As it watches and clicks more and more, YouTube customizes and customizes
  4. When you're all done, Persine will save your winding path and the video/playlist/channel recommendations to nice neat CSV files.

Beyond analysis, these files can be used to repeat the experiment again later, seeing if recommendations change by time, location, user history, etc.

If you didn't quite get enough data, don't worry – you can resume your exploration later, picking up right where you left off. Since Persona is on Chrome profiles, all your cookies and history will be safely stored in the meantime.

An actual example

See Persine in action on Google Colab.

Installation

pip install persine

Persine will automatically install Selenium and BeautifulSoup for browsing/scraping, pandas for data analysis, and pillow for processing screenshots.

You will need to install chromedriver to allow Selenium to control Chrome. Persine won't work without chromedriver!

  • Installing chromedriver on OS X: Follow the link above, click the "latest stable release" link. Download chromedriver_mac64.zip, unzip it, and move the chromedriver file into your PATH. I typically put it in /usr/local/bin.
  • Installing chromedriver on Windows: Follow the link above, click the "latest stable release" link. Download chromedriver_win32.zip, unzip it, and move chromedriver.exe into your PATH (in the spirit of anarchy I just put it into C:\Windows).

Quickstart

In this example, we start a new session by visiting a YouTube video and clicking the "next up" video three times to see where it leads us. We then save the results for later analysis.

from persine import PersonaEngine

engine = PersonaEngine(headless=False)

with engine.persona() as persona:
    persona.run("https://www.youtube.com/watch?v=hZw23sWlyG0")
    persona.run("youtube:next_up#3")
    persona.history.to_csv("history.csv")
    persona.recommendations.to_csv("recs.csv")

We turn off headless mode because it's fun to watch!

Persine basics

Persine is built around an Engine that stores all of your global settings, and Personas that represent the individual users who browse the web.

Creating Personas

Personas are always generated by an engine.

from persine import PersonaEngine

engine = PersonaEngine()
persona = engine.persona()

By default, personas are single-use and their browsing history will be discarded after your script is run. If you give them a name, though, they'll save their browsing/recommendation history so you can resume them later.

persona = engine.persona('Mulberry')

Launching Chrome and visiting pages

You can use with to automatically start/stop Chrome.

with engine.persona() as persona:
    persona.run("https://www.youtube.com/watch?v=hZw23sWlyG0")
    persona.run("youtube:next_up#3")

If you prefer more control or to visit sites one-by-one, you can manually call .quit() when you're done.

persona.run("https://www.youtube.com/watch?v=hZw23sWlyG0")
persona.run("youtube:next_up#3")

# Quit Chrome
persona.quit()

We can turn off headless mode if we want to actually watch what Chrome is up to. When running in this mode, Persine automatically installs uBlock Origin so you don't have to deal with ads.

engine = PersonaEngine(headless=False)

Headless mode doesn't support extensions, so by default our invisible Chrome is unfortunately watching ads. We should probably switch to Firefox but it has its own problems.

Seeing and saving results

History is all of your commands and the pages visited, while recommendations are what you've been recommended to watch. It includes video sidebars, homepage listings, and search results.

For convenience, you can use .to_df() to see these as pandas DataFrames.

persona.recommendations.to_df()
persona.history.to_df()

If you'd prefer to do your analysis elsewhere, you can save them to CSV files.

persona.recommendations.to_csv('recs.csv')
persona.history.to_csv('hist.csv')

Bridges

Bridges are site-specific scrapers that tell Persine what to click, what to scrape, and other site-specific commands. Right now the only bridge we have is for YouTube (add more, please?).

YouTube commands

Tthe YouTube bridge supports the following custom commands:

command action
youtube:homepage Visits youtube.com
youtube:search?SEARCHTERM Searches YouTube for the specified term
youtube:next_up When on a video page, clicks the "next up" video
youtube:like Clicks the like button
youtube:dislike Clicks the dislike button
youtube:subscribe Clicks the subscribe button
youtube:unsubscribe Clicks the unsubscribe button
youtube:sign_in Begins the signin process. You'll need to complete it manually

Repeating commands

If you'd like to repeat a command multiple times, you can append #[NUMBER] to it. For example, youtube:next_up#50 will watch the next fifty "next up" videos.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

persine-0.1.0.tar.gz (2.8 MB view hashes)

Uploaded Source

Built Distribution

persine-0.1.0-py3-none-any.whl (2.8 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page