Metadata-Version: 2.3
Name: segment-downloader
Version: 0.1.0
Summary: llows to fully download a Strava leaderboard and saves it into a CSV file for further statistical analysis
Requires-Dist: selenium~=4.25.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# About

## Author

This script was written by Dominik Rappaport. You can contact me via 
email: [dominik@rappaport.at](mailto:dominik@rappaport.at?subject=SegmentDownloader).

## Introduction

The Strava Segment Downloader is a Python-based script to download the full leaderboard of a 
given Strava segment. The data is stored in a CSV file, which is the de facto for exchanging
statistical data.

## Why do you want to use this script?

Strava does not provide their uses with advanced analysis methods for the segment leaderboards.
You cannot apply advanced filters or calculate statistical values like mean, median, or standard
deviation. All this can be easily done using software like R or Excel. The CSV file generated by
this script can be easily imported into these tools.

## Background details

Strava implements a public API to programmatically interact with their data. That would be the most
natural way of fetching the leaderboard data. Unfortunately, Strava deprecated the API endpoint to
download leaderboards in the year 2020. This link provides you with more information:

https://developers.strava.com/docs/segment-changes/

The background of that controversial decision is described in an article of the well-known cycling
blogger DC Rainmaker:

https://www.dcrainmaker.com/2020/05/strava-leaderboard-reduces.html

As a consequence, traditional Screen scraping is the only way to still get that data. As Strava's
website make extensive use of JavaScript, libraries like BeautifulSoup are not able to parse the
data, and we have to use Selenium to remote control the browser.

## Challenges that come with screen scraping

Screen scraping is a fragile method to get data from a website. The website's structure may change
anytime and the script may break as a consequence.

In addition, Strava imposes measure to prevent people from doing exactly that. In particular,
they apply a rate limit to the number of requests you can make to their website. If you exceed
that limit, you will be blocked from accessing the leaderboard data for a certain period of time
(typically 24 hours).

To make the script work in such a condition, the user can interrupt the script using Ctrl+C (SIGINT)
and continue another day. With the switch `--resume` it continues where it left off. Obviously that
may introduce inconsistencies in the data as the leaderboard may have changed in the meantime.

Furthermore, Strava enforces a rate limit to prevent people from accessing their site too frequently. Again, that 
works against us. If that rate limit kicks in, the website freezes in state "Loading". The script will typically 
throw the following error message:

```bash
Error: Can't navigate to the next page (Element <a href="/segments/..."> is not clickable at point (856,935) because another element <div class="loading-panel"> obscures it).
```

Please refer to section [Usage with large segments](#usage-with-large-segments) for further details how to deal with
such challenges.

# How to use the script

## Installation

The segment_downloader is distributed as a Python package. Several installation methods are available.

### Using pip

Executing `pip` installs the package in your current Python environment. Global installation was once possible, but
modern Linux distributions no longer permit this approach.

```bash
pip install segment_downloader
```

### Using pipx or uv

Both `pipx` and `uv` enable global tool installation. The package can be installed as follows:

```bash
pipx install segment_downloader
```

or

```bash
uv tool install segment_downloader
```

## Usage

Selenium starts the browser with a blank profile, and we therefore have to log in to Strava first.
If you use the script more often Strava may temporarily block your account. To avoid this, we
use an authentication script that logins to Strava and saves the credentials in a cookie file.
This file is then used by the main script to authenticate.

Username and password are stored in environment variables. I decided to use environment variables
instead of command line parameters to make it easier to use the script programmatically like in 
GitHub actions together with the GitHub secrets.

```bash
export STRAVA_USERNAME="your_username"
export STRAVA_PASSWORD="your_password"
segment_downloader_authenticate
```

The script saves the cookies in a file `cookies.pkl`. As of today, the filename is hardcoded.

Then you can run the main script passing the segment ID as a command line parameter:

```bash
segment_downloader 12345678
```

The script will download the leaderboard of the segment with the ID 12345678. It creates a CSV file with 
the name `leaderboard_12345678.csv`.

You can interrupt the script at any time using Ctrl+C as described above the paragraph
[Challenges that come with screen scraping](#challenges-that-come-with-screen-scraping). 
If you want to continue where you left off, you can use the `--resume` switch:

```bash
segment_downloader --resume 12345678
```

## Usage with large segments

To work around Strava's rate limit we recommend the following strategy:

1. Download the segment in smaller chunks. At the moment, the script throws an error message when it gets blocked by Strava and the
   effort to download was in vain.
2. You can use Ctrl+C and then the `resume` option to interrupt and resume the download. Then, possibly interrupt again and resume again etc.
3. This can be automated with a tool like `gtimeout`. The following example illustrates how to download a large segment in junks of 10 minutes.

```bash
# First download
gtimeout -f -s INT 10m python segment_downloader.py 2891805
# Call the script with the resume option as often as needed by repeating the following line:
gtimeout -f -s INT 10m python segment_downloader.py --resume 2891805
```

Note: By default `gtimeout` repeats the signal if the script doesn't exit instantly. As we catch SIGINT and save the data, sending the signal
a second time breaks the script. That will hopefully be fixed in the future. In the meantime, we use the `-f` option to prevent that. Of course
above mentioned approach could also be implemented manually, without tools like gtimeout.

# Notes and Warnings

- The script uses the Firefox browser and expects the Strava page to be in English. It may fail if
  the pages are in a different language because we identify for examples buttons or the categories
  with their labels.
- The script tries to compile a single leaderboard list with all data in one table. In Strava for
  example age groups, sex and weight groups are not included in the full table. You can only
  see if a user is male or female if the leaderboard entry is displayed when the respective filter
  is applied. To get the full data, the scripts downloads the leaderboard for each category separately
  and joins the tables.
- Please note that no user is obliged to specify their sex, weight or age or keep these values up to date. 
  You may end up with missing data or wrong data in these columns.