Space-Filling Curves

In an effort to combine some of my new skills with my passion for mathematics, I recently wrote a bit of Python code that builds various space-filling curves by way of complex valued functions and plots them with matplotlib.  Furthermore I wanted to try some of the awesome online graphing tools provided through Plotly using their Python API. The visualizations on this posting are the results. 

What are space-filling curves?

In analytic geometry a curve is defined as a continuous map from a one dimensional space to an n-dimensional space. Intuitively, in 2-dimensions this is a line (not necessarily straight) that represents the graph of some "function" and it is how one defines this "function" that opens up a world of amazing objects.  One class of such objects is called space-filling curves which are curves whose range (outputs) contain the entire 2-dimensional unit square.  Most (not all) of these curves are constructed iteratively as the limit of a sequence of piecewise self-avoiding continuous lines, each become closer and closer to their space-filling limit. See the image below:

  Visualizing the   iterative process of the curves become space-filling in their limit.

Visualizing the iterative process of the curves become space-filling in their limit.

The images above represent Hilbert's space-filling curve. The first visualisation at the beginning of this post is the Plotly graph I created by constructing the actual data points for each iteration of the first five curves using complex values and complex valued functions.  Then with Plotly's Python API I was able to further analyse the data and use some of their visualization tools for constructing the layered final image.

The beauty of these plots is the fact that all the data is actually plotted using complex values and representing the real and imaginary parts as (x,y) ordered pairs. This way I can manipulate the geometric aspects of complex analysis to creatively find other space-filling curves and fractal like visualizations. 

Furthermore, all Ploty graphs are embedded into this page so you can play around with the plots. Check the constructed values, zoom in and out on the graphs, and link to the actual spreadsheets of the data. Take a second and mess around with them. Zoom in to see the fractal like nature of these curves!!!

Feel free to check out the code here on github and my plots at Plotly here.

Road Route Planner - Applied graph theory and some bicycle touring

Here are some notes on a recent tool I build for finding road itineraries between user defined geographical locations in California:

I have always had a 'healthy' obsession with maps for as long as I can remember. There is just something so pleasing to me about the aesthetic display of information communicated through graphic forms. I find myself looking at maps all the time, studying the shape of the terrain and committing to memory the roads that transverse the landscape like fibers of some living organism. This obsession has lead to many adventures and has even resulted in the nickname 'junior cartographer' - yet the junior part is up to debate lately, due to some recent exploits over the past year. 

With a passion for mapping and some newly acquired coding skills, it was only a matter of time until I made my own road route planner. To do this I started with the California road network dataset from the United States Census Bureau and read the data as a graph (nodes/intersections and edges/streets) using a Python library called NetworkX, which was made for studying graphs and networks. Once the graph was built all that was needed was to find the starting point/node and trace through the graph to arrive at the end point/node. This seemed simple enough but it makes a big assumption - that every intersection in the data set in continuously connected to every other intersection by some sequence of roads, i.e. the data was a connected graph. Unfortunately this was not the case and by using the connected_component_subgraph function in NetworkX, I was able to find the largest subgraph within the original graph. It turns out that this was much smaller that the original graph which basically means I was limited by the number of places I can make meaningful routes. However this was just constrains due to a lack of good data and was still determined to make a routing tool. 

So the next step was to figure out what type of routing I was going to make. To make is fairly straight forward I decided to make a route planner that determines the shortest path between points. That means for every pair of nodes in the subgraph I needed to compute the physical distance between the points and then search for the path through my connected subgraph that minimizes the total sum of distances. Such a path was a simple application of Dijkstra's algorithm in which the weights assigned to an edge are the calculated distances between the nodes connected by that edge. Using the shortest_path function in NetworkX, I was able to employ this graph theory algorithm. The only non-trivial part was calculating those distances as accurate distances on the surface of the earth using the lat/long information from the dataset. However, with a bit of trigonometry, an understanding of geodesic lines, and some online research I was able to build the distance function I needed. Now with all the right tools, my route planner was done and I could find the shortest path between any two points in my connected subgraph. 

Now these paths are specifically between existing nodes in the connected subgraph, but how do you know what and where those nodes are on a map just by looking at the lat/long values. So to make this tool a bit more useful I added some code that takes the lat/long coordinates of any starting and ending points and then finds the nodes in the subgraph that are closest to those coordinates - 'closest' is measured by the minimum Euclidean distance since we are not concerned about the actual physical distance. 

Thus, with some given inputs (starting/ending points) this road routing tool implements some graph theory methods and returns the path of shortest distance. Furthermore, with the information obtained from the dataset and the calculated physical distances, the code can give a cue sheet of the road itinerary along with the total distance of the path. To put it all together and visualize the path on a map, I decided to use OpenStreetMap (as opposed to the Google maps I used last time). Below is a one of the resulting maps with some modification to the code for calculating the path between three points instead of just two.  

For the code and dataset see the github repo here. 

 Tail end of a Canada - Mexico bicycle tour: Monterey area => Los Angeles area => California/Mexico border.

Tail end of a Canada - Mexico bicycle tour: Monterey area => Los Angeles area => California/Mexico border.

I picked this route because I wanted to document a path that was part of a amazing bicycle tour I completed back in the summer of 2007. With a completed undergraduate degree and the summer off before heading to graduate school, I decided to do a bicycle tour from Canada to Mexico along the west coast. I had done a lot of camping and overnights, but never bicycle touring before. Furthermore, besides just riding/commuting around San Diego, this was indeed the longest amount of distance I had ever expected to ride. So I placed all my belongings in my parents house, loaded up some panniers with way too much stuff, gathered up some camping necessities, picked up some cumbersome cooking equipment, and hopped on a flight to Seattle. With two other friends joining me and a lot of time on our hands we had a very simple task - ride back to San Diego. Over the next month and a half we rode north from Seattle, ferry hopped the San Juan Islands, rode down Vancouver Island, came back to Washington, headed to the coast, and rode all the way to Mexico. I will leave the many stories and adventures for another time, but without a doubt I can easily say this was one of the best experiences of my life. This map is documents the tail end of the tour along the amazing Big Sur coastline, through the endless rolling hills between San Luis Obispo and Santa Barbara, surviving the chaos of change through Los Angeles, to the home stretch of San Diego and the Mexico border.

Here's to a great adventure and many more to come.

 L to R - Chen, Will, David (me).  Chen came along for the first two weeks and left Will and I to make it the rest of the way home. This was taken when we got off the ferry at Vancouver Island. Still can't believe I carried so much stuff, but some how I just go use to pushing around a 50 pound bike.  

L to R - Chen, Will, David (me).

Chen came along for the first two weeks and left Will and I to make it the rest of the way home. This was taken when we got off the ferry at Vancouver Island. Still can't believe I carried so much stuff, but some how I just go use to pushing around a 50 pound bike.  

Classifying Bay Area Strava road segments

I have built an online tool for classifying and visualizing Bay Area Strava road segments according to specific competitive cyclist workout types. Below is a screenshot but click here to use the online training roads classification app.

race.png

As a local Bay Area cyclist I have always been a strong advocate for bicycle exploration and adventure. Seeking new terrain, finding unfamiliar roads, and discovering amazing landscapes is what fuels my passion for cycling - constantly pushing the boundaries of my mental and physical space. For me, it is in sharing these experiences and giving others the chance to create their own adventures that makes cycling personal and unique. 

Over the years this zeal for pushing boundaries on a bicycle has also brought about an enthusiasm for bicycle road racing and with it another perspective on seeking new roads. As a devoted road racer following a focused training program, I am constantly asking myself the question:

How do I find the best roads for specific training?

With detailed knowledge of the area it is easy to simply go to the mountains for climbing or go to the coast for flat roads, but I wanted more detail - I wanted to know where to find specific terrain and how cyclist are riding it. With a knowledge of the physical characteristics of certain roads combined with a measure of how other competitive cyclist are riding these roads, one can make better decisions on where to train according to specific workout types like: 

  • Tempo and endurance 
  • Leg speed and cadence
  • Climbing and resistance
  • Sprinting and power

So I decided to use some analytical tools and construct a road classifier to help myself and anyone else, easily find the some of the best roads in the Bay Area for specific workout types. 

Getting some data

To find data on the physical information of a road along with information on how people are riding it, I turned to Strava, the social fitness network based here in San Francisco. Strava allows users to upload and track their rides, but more importantly for my project, they also track information on how cyclists are riding certain Strava road segments. Specifically a Strava segment is a user defined stretch of road with a segment page providing information on the exact physical description of the road segment and a leaderboard of who has ridden that segment the fastest. 

This was all the data I needed in one place. Using Strava's API I could collect physical data sets of the total distance, total elevation gain, and the average gradient for any segment. Furthermore, I could collect rider data sets from the leaderboard and compute the average speed, average time, average heart rate, and average power for any rides completed on any segment. 

So the next question was what segments will I be looking at and how will I collect their specific data? First I started to think about how to collect the physical data sets. When it came to using Strava's API, I had a couple options:

  1. I could collect the data from a segment as long as I had its segment id. 
    • The only problem with this was that I needed to already know all the segment ids for the roads I want to look at across the Bay Area. 
  2. I could collect the data from a segment explorer.
    • However this search engine will only give the 10 most popular segments for given latitude/longitude coordinate window and would only provide more within the same area if you zoom in on the area. Furthermore this would not let me choose specific segments. I would just get any segment that happened to be in that window. 
  3. I could collect the data from segments I have personally starred as segments I would like to follow. 
    • The difficulty with this is the work I would have to do ahead of time of picking segments to be starred. 

After trying out the three different options I decided to go with method number three. I scrapped the first method because I didn't have a way of getting all the ids of segments in the Bay Area. Then I scrapped the second method because implementing some code that would handle the segment explorer was more work than it was worth and in the end I would not have any control over what segments I was collecting data on. So even though the third method meant I would have to personally go through and pick segments, it gave me more control in terms of collecting data on segments that are actually useful for training. By useful I mean segments that are not less than 500 meters long going straight through an intersection in downtown San francisco's market street. Surprisingly there are a lot of segments on Strava that are too short or dangerous for actual training efforts and I wanted to exclude those from my clustering model.

So after taking some time to collect a solid representation of useful road segments for training purposes around the Bay Area, I was able to get my physical data set of the total distance, total elevation gain, and the average gradient for these select Strava road segments. Not only that, I also had all the segment ids, which will help me collect and format the rider data set, and I had the encoded polylines for all the segments, which would allow me to plot these segments on a map. 

The only thing left to do was collect the rider data set, but with the segment ids, this was no problem at all. Using Strava's API and the list of all the segment ids from the physical data set, I was able to collect any data I needed from each segments leaderboard. Furthermore to ensure that I was using data that represented competitive athletes riding efforts over these segments, I specifically looked at the top 20% of rides on the leaderboard to compute my averages of how competitive athletes are riding these Strava segments. The result was a rider data set composed of the average speed, average time, average heart rate, and average power for each segment I collected earlier in the physical data set. Now its time for some modeling. 

Modeling the physical data set

After cleaning up the data set so I could just work with the numerical physical data, I decided to run an unsupervised machine learning algorithm to see what results I would get. Specifically I implemented the unlabeled k-means++ clustering model to make sure that the initial values for the k-means clustering algorithm were picked in a rational way and ensure a more meaningful clustering of the segments.  Also, a model was constructed for a various numbers of clusters to analyze the error and accuracy for finding the optimal number of clusters.

As a measure of error (the distance between all data points and the center of their assigned cluster) I was looking at the models moment of inertia, while as a measure of accuracy (how well defined are the clusters) I was considering the the models silhouette constant, with a value between -1 and 1. When trying to minimize this measure of error and maximize this measure of accuracy, it was obvious that the error will always decrease as the number of clusters increases, which is due to simply overfitting the data. However the accuracy would not continue to increase with more more clusters. Therefore I was able to determine the most optimal number of clusters for this data set to be 4. 

So with this number of clusters I implemented the clustering model, analyzed the resulting cluster centers, and looked at the respective segments labeled to each cluster. To my delight the clusters were very accurate and could be clearly defined by their physical descriptions: moderately flat terrain, steeper climbs, rolling hills, and longer steady climbs.

An online tool

Now that I had a working model and defined labels of useful Strava segments according to their physical data, it was time to visualize the information. I used the google maps API to access a map of the Bay Area and with google's the polyline decoder method I was able to plot all the specific Strava segments in color coordination to models labeling. Furthermore, a user can select any of the four labels and individually click on any segment to obtain an info window containing the exact physical data used in the model and a link to the specific Strava segment page. Check here to use the online physical data app.

With all the work I did for the physical data set, the pipeline was complete. I now had a way to take a data set, run the k-means++ clustering model, analyze its metrics to find the optimal number of clusters, and visualize the data with the models labeling. Thus I had closed the loop of making a useful tool for classifying useful Bay Area Strava road segments according to their physical information. Now it was time to try some more data see some different classifications. 

The rider data set

With the infrastructure in place and the data sets already collected and cleaned up, all I had to do was run through the pipeline with the rider data set. The result was another clustering model optimized with 4 clusters and an online tool for classifying and visualizing useful Bay Area Strava road segments according to how competitive cyclists are riding them. Once again the clusters were accurate and could be clearly defined by the type of effort (in relation to the average time, average speed, average heart rate, and average power): long steady efforts, long and short high speed efforts, steady/medium efforts, and medium/high efforts. Check here to use the online rider data app.

The combined data set

Now its time for the last step, combining both the physical data set and rider data set to make the most effective clustering on our Strava segments according to specific workout types. Again, with the completed pipeline all I had to do was combine the data sets, run the model for optimal clusters, and map the results. Interesting the optimal number of clusters was again 4, and after analyzing the respective clusters centers I was able to clearly define these clusters according to the following workout types (with their respective generalized terrain description):

  • Endurance, cadence, and leg speed intervals - flat and rolling
  • Climbing and resistance intervals - steep
  • Tempo and climbing speed intervals - rolling and climbs
  • Sprints and tempo intervals - flat and rolling.

Thus the end result is a tool for classifying and visualizing useful Bay Area Strava road segments according to specific workout types. Check here to use the online training roads classification app.

Conclusions

Overall I was fairly pleased with the results and the end product, but there are a couple things I would like to add in the future.

  1. I would like to use more data - this would be a much more powerful tool if I could select look at all the segments in the Strava database and select specific roads that are useful for training by some predefined constraints.
  2. I would also like to provide a search engine with my combine data set - users could input the name or id and receive the segments visualisation and its associated label. 
  3. Construct and routing system - allow users to select specific segments of any workout type and the provide suggested routes for riders according to these segments. 

I hope people find this to be a useful tool. I believe everyone can get some use out of this app - from beginner racers who are just starting to train more and are looking for the best roads, to veterans who have been riding these roads for years and are able to either reaffirm their understanding of the terrain or see familiar roads in a new perspective.  

For my code and data sets see the github repo here.