Classifying Bay Area Strava road segments

I have built an online tool for classifying and visualizing Bay Area Strava road segments according to specific competitive cyclist workout types. Below is a screenshot but click here to use the online training roads classification app.


As a local Bay Area cyclist I have always been a strong advocate for bicycle exploration and adventure. Seeking new terrain, finding unfamiliar roads, and discovering amazing landscapes is what fuels my passion for cycling - constantly pushing the boundaries of my mental and physical space. For me, it is in sharing these experiences and giving others the chance to create their own adventures that makes cycling personal and unique. 

Over the years this zeal for pushing boundaries on a bicycle has also brought about an enthusiasm for bicycle road racing and with it another perspective on seeking new roads. As a devoted road racer following a focused training program, I am constantly asking myself the question:

How do I find the best roads for specific training?

With detailed knowledge of the area it is easy to simply go to the mountains for climbing or go to the coast for flat roads, but I wanted more detail - I wanted to know where to find specific terrain and how cyclist are riding it. With a knowledge of the physical characteristics of certain roads combined with a measure of how other competitive cyclist are riding these roads, one can make better decisions on where to train according to specific workout types like: 

  • Tempo and endurance 
  • Leg speed and cadence
  • Climbing and resistance
  • Sprinting and power

So I decided to use some analytical tools and construct a road classifier to help myself and anyone else, easily find the some of the best roads in the Bay Area for specific workout types. 

Getting some data

To find data on the physical information of a road along with information on how people are riding it, I turned to Strava, the social fitness network based here in San Francisco. Strava allows users to upload and track their rides, but more importantly for my project, they also track information on how cyclists are riding certain Strava road segments. Specifically a Strava segment is a user defined stretch of road with a segment page providing information on the exact physical description of the road segment and a leaderboard of who has ridden that segment the fastest. 

This was all the data I needed in one place. Using Strava's API I could collect physical data sets of the total distance, total elevation gain, and the average gradient for any segment. Furthermore, I could collect rider data sets from the leaderboard and compute the average speed, average time, average heart rate, and average power for any rides completed on any segment. 

So the next question was what segments will I be looking at and how will I collect their specific data? First I started to think about how to collect the physical data sets. When it came to using Strava's API, I had a couple options:

  1. I could collect the data from a segment as long as I had its segment id. 
    • The only problem with this was that I needed to already know all the segment ids for the roads I want to look at across the Bay Area. 
  2. I could collect the data from a segment explorer.
    • However this search engine will only give the 10 most popular segments for given latitude/longitude coordinate window and would only provide more within the same area if you zoom in on the area. Furthermore this would not let me choose specific segments. I would just get any segment that happened to be in that window. 
  3. I could collect the data from segments I have personally starred as segments I would like to follow. 
    • The difficulty with this is the work I would have to do ahead of time of picking segments to be starred. 

After trying out the three different options I decided to go with method number three. I scrapped the first method because I didn't have a way of getting all the ids of segments in the Bay Area. Then I scrapped the second method because implementing some code that would handle the segment explorer was more work than it was worth and in the end I would not have any control over what segments I was collecting data on. So even though the third method meant I would have to personally go through and pick segments, it gave me more control in terms of collecting data on segments that are actually useful for training. By useful I mean segments that are not less than 500 meters long going straight through an intersection in downtown San francisco's market street. Surprisingly there are a lot of segments on Strava that are too short or dangerous for actual training efforts and I wanted to exclude those from my clustering model.

So after taking some time to collect a solid representation of useful road segments for training purposes around the Bay Area, I was able to get my physical data set of the total distance, total elevation gain, and the average gradient for these select Strava road segments. Not only that, I also had all the segment ids, which will help me collect and format the rider data set, and I had the encoded polylines for all the segments, which would allow me to plot these segments on a map. 

The only thing left to do was collect the rider data set, but with the segment ids, this was no problem at all. Using Strava's API and the list of all the segment ids from the physical data set, I was able to collect any data I needed from each segments leaderboard. Furthermore to ensure that I was using data that represented competitive athletes riding efforts over these segments, I specifically looked at the top 20% of rides on the leaderboard to compute my averages of how competitive athletes are riding these Strava segments. The result was a rider data set composed of the average speed, average time, average heart rate, and average power for each segment I collected earlier in the physical data set. Now its time for some modeling. 

Modeling the physical data set

After cleaning up the data set so I could just work with the numerical physical data, I decided to run an unsupervised machine learning algorithm to see what results I would get. Specifically I implemented the unlabeled k-means++ clustering model to make sure that the initial values for the k-means clustering algorithm were picked in a rational way and ensure a more meaningful clustering of the segments.  Also, a model was constructed for a various numbers of clusters to analyze the error and accuracy for finding the optimal number of clusters.

As a measure of error (the distance between all data points and the center of their assigned cluster) I was looking at the models moment of inertia, while as a measure of accuracy (how well defined are the clusters) I was considering the the models silhouette constant, with a value between -1 and 1. When trying to minimize this measure of error and maximize this measure of accuracy, it was obvious that the error will always decrease as the number of clusters increases, which is due to simply overfitting the data. However the accuracy would not continue to increase with more more clusters. Therefore I was able to determine the most optimal number of clusters for this data set to be 4. 

So with this number of clusters I implemented the clustering model, analyzed the resulting cluster centers, and looked at the respective segments labeled to each cluster. To my delight the clusters were very accurate and could be clearly defined by their physical descriptions: moderately flat terrain, steeper climbs, rolling hills, and longer steady climbs.

An online tool

Now that I had a working model and defined labels of useful Strava segments according to their physical data, it was time to visualize the information. I used the google maps API to access a map of the Bay Area and with google's the polyline decoder method I was able to plot all the specific Strava segments in color coordination to models labeling. Furthermore, a user can select any of the four labels and individually click on any segment to obtain an info window containing the exact physical data used in the model and a link to the specific Strava segment page. Check here to use the online physical data app.

With all the work I did for the physical data set, the pipeline was complete. I now had a way to take a data set, run the k-means++ clustering model, analyze its metrics to find the optimal number of clusters, and visualize the data with the models labeling. Thus I had closed the loop of making a useful tool for classifying useful Bay Area Strava road segments according to their physical information. Now it was time to try some more data see some different classifications. 

The rider data set

With the infrastructure in place and the data sets already collected and cleaned up, all I had to do was run through the pipeline with the rider data set. The result was another clustering model optimized with 4 clusters and an online tool for classifying and visualizing useful Bay Area Strava road segments according to how competitive cyclists are riding them. Once again the clusters were accurate and could be clearly defined by the type of effort (in relation to the average time, average speed, average heart rate, and average power): long steady efforts, long and short high speed efforts, steady/medium efforts, and medium/high efforts. Check here to use the online rider data app.

The combined data set

Now its time for the last step, combining both the physical data set and rider data set to make the most effective clustering on our Strava segments according to specific workout types. Again, with the completed pipeline all I had to do was combine the data sets, run the model for optimal clusters, and map the results. Interesting the optimal number of clusters was again 4, and after analyzing the respective clusters centers I was able to clearly define these clusters according to the following workout types (with their respective generalized terrain description):

  • Endurance, cadence, and leg speed intervals - flat and rolling
  • Climbing and resistance intervals - steep
  • Tempo and climbing speed intervals - rolling and climbs
  • Sprints and tempo intervals - flat and rolling.

Thus the end result is a tool for classifying and visualizing useful Bay Area Strava road segments according to specific workout types. Check here to use the online training roads classification app.


Overall I was fairly pleased with the results and the end product, but there are a couple things I would like to add in the future.

  1. I would like to use more data - this would be a much more powerful tool if I could select look at all the segments in the Strava database and select specific roads that are useful for training by some predefined constraints.
  2. I would also like to provide a search engine with my combine data set - users could input the name or id and receive the segments visualisation and its associated label. 
  3. Construct and routing system - allow users to select specific segments of any workout type and the provide suggested routes for riders according to these segments. 

I hope people find this to be a useful tool. I believe everyone can get some use out of this app - from beginner racers who are just starting to train more and are looking for the best roads, to veterans who have been riding these roads for years and are able to either reaffirm their understanding of the terrain or see familiar roads in a new perspective.  

For my code and data sets see the github repo here.