Here at Triphappy, we love two things: cool travel analytics and Reddit. Between our 3 founders, we’ve been having meaningless arguments with complete strangers for over 19 years on various subreddits. Our startup strives to make travel better by understanding how exactly people travel around the world. So, we decided it would be fun exercise to try and figure out where the Reddit community is traveling, and if it differs from the typical traveler.
Step 1: Gathering the Data
It’s fairly common for users to ask for advice on their itinerary and we decided this is the most straightforward way to figure out where everyone was going. After a bit of research, we discovered Reddit releases all thread and comment data for free on Google’s BigQuery. Using this, we were able to isolate any thread created after December 2015 with the word ‘itinerary’ in the title.
Using this method, we discovered 6,478 threads across 714 different subreddits with the word itinerary in the title.
Step 2: Initial Cleanup
Clearly, not all the threads we pulled will be relevant to our analysis. Specifically, we are interested in inter-city travel itineraries. As a starting point, we eliminated threads from certain subreddits. For example, we removed any threads from /r/AskNYC or /r/AskSF, because these threads are likely asking questions about city-specific itineraries. We removed any threads from /r/CampingandHiking because they are unlikely to be talking about travel between cities. We removed threads from subreddits like /r/news and /r/AutoNewspaper because these aren’t likely to be user itineraries. /r/the_donald had a surprising amount of itinerary threads, which we removed because…well…whatever they are is probably not relevant. Finally, we removed any subreddits that had less than 10 itinerary threads in order to focus further on travel-specific subreddits.
After this cleanup, we ended up with 4,900 threads across 37 different subreddits. Looking through the list, the results look pretty good. Every thread seemed to be a user asking for advice on their itinerary. Below is a breakdown of the frequency of itinerary threads by subreddit.
Step 3: Parsing Itineraries
Now for the fun part - how can we systematically take a user’s post, analyze it, and isolate the cities of their trip? Ideally, we’d take a thread and identify the unique cities like this:
We realized the only way to do this was to make the assumption that cities mentioned in a specific order corresponded to the stops in a user’s itinerary. Even though this is not always the case, we believed at the macro-level we would see good results.
We knew we would have to use some sort of string matching algorithm, but that means we needed a giant list of cities in the world to compare against first. Luckily, the friendly folks at Geonames have a free, open-source list of 1.5 million cities names, organized by country.
Next, in order to actually detect the cities, we had to decide which string matching algorithm to use. We knew cities were likely to have a variety of forms (e.g. Chiang Mai vs. chiang mai vs. Chang mai), so we initially tried to use a fuzzy-matching algorithm. We soon realized fuzzy matching presented a whole host of issues, so we switched to the Aho-Corasick exact-match algorithm. Since it was exact match, we had to modify our database of places to add common variations to city names. For example, a city like San José had the following variations:
- San José (original)
- San josé (2nd word lowercase)
- san josé (all lower-case)
- San Jose (without accent)
- San jose (2nd word lowercase without accent)
- san jose (all lowercase, without accent)
At the end of this exercise, our database doubled in size to about 3 million place names, but we were finally picking up most results in our suite of test itineraries.
For one final improvement, we found that if we’re able to identify the countries the itinerary was talking about, the results improved significantly. For example, when looking at an itinerary from /r/Vietnam, we only matched against city names in Vietnam. For a general travel subreddit (e.g. /r/travel), we added an additional first step to detect the country the user was talking about, then match against cities in those countries. If we weren’t able to determine the country, we ignored the thread because matching against the whole world resulted in too many false-positives.
At the end of this step, we we’re able to destil this down to 2,582 threads that correspond to 7,725 stops on users' itineraries. Here are the 10 most popular cities, according to our analysis:
Interestingly, if we compare this to the list of the 10 most popular destinations by overnight visitors in 2016 and we only see two places in common:
- New York
- Kuala Lumpur
Like to travel? Help a startup out.
We just launched and would love to hear what you think
Step 4: Visualization
We decided to use a free visualization tool called Gephi after reading /u/flashman’s post on NSFW subreddits. Among other things, Gephi is a great program for visualizing directed graphs - where there are one-way connections between two places, similar to a travel itinerary where someone moves from city A to city B.
This graph represents the connections between the cities in Reddit travel itineraries. Each circle represents a unique city, with the size proportional to the number of mentions. There are also lines connecting each city to any other city that came after it in someone’s itinerary. For example, there’s a connection between Bangkok and Chiang Mai because there was an itinerary going from Bangkok to Chiang Mai.
All of these connections were then analyzed with Louvain Modularity, and assigned a color based on their similarity. With this analysis, cities, and the connections between them, are compared to the connections between all cities, to find which connections are more similar than others. Luckily, this is a built-in feature of Gephi so nothing new had to be coded.
The results are interesting, as the different colored groups represent different parts of the world. This makes sense since travelers are more likely to move between cities in a particular part of the world versus traveling thousands of miles away to a different region. For example, the lavendar section in the top left represents Thailand while the orange section next to it is Vietnam. From this, we can see how the dark blue circles on the left represent India, yet Hong Kong is also included. This implies that there are a lot of travelers stopping over at Hong Kong before traveling into India, possibly because Hong Kong has a major international airport.
The only caveat to this clustering is that there are way more clustered groups than there are discernible colors! So the peach circles in the middle are actually twenty individual groups of 3-4 cities.
There’s also a clear bias towards locations in Western Europe and Southeast Asia, two of the most popular travel destinations in the world, and especially among Redditors. There are virtually no itineraries through China, Australia, and Africa. Morocco and South Africa are the only African countries present.
We also tried some different layouts to see if we can find any other insights.
This graph uses the same data, but emphasizes the connections between the cities. We can now see how Barcelona (near 7 o’clock) is a major hub connecting to other European cities - showing how lots of Europeans travel to Barcelona for vacation. You can also see the intra-region connections, i.e. how each colored part of the circle has multiple connections within itself.
It’s also easier to see the smaller clustered regions in the peach section - for example Indonesia is now visible (near 10:30) with Jakarta and Bali being the two biggest travel destinations there. It’s possible to discern Australia and Brazil this way as well.
Below, we’ve included the .gephi project files with the data, so feel free to play around with it too!
Step 5: Future Improvements
We were very pleased with the way the graphic turned out. It looks super cool and manages to show the most popular destinations. That being said, there are some obvious limitations with our method:
Misspellings aren’t being picked up
Since we ended up using exact match, any misspellings were simply ignored. This could be addressed by implementing fuzzy-matching, but that introduced a huge problem for us when we tried because cities with similar names we’re being picked up.
Not every thread is a straight list of destinations
We made a big assumption that all cities detected in the post are the stops on a user’s itinerary. However, this creates errors when a user says something like, “should I go to Madrid or Seville after Barcelona?” Our program incorrectly detects the itinerary as Madrid -> Seville -> Barcelona.
Our cities database lacks colloquial names
Many cities have a formal name and a more common name. For example, many people just call Rio de Janeiro, “Rio”, or Santiago de Queretaro, “Queretaro”. Our place database only has the more formal version of the name. Furthermore, if we tried to match something like “Rio”, we would incorrectly tag any city with the word Rio in the name to Rio de Janeiro.
Cities with names that are other common words
Cities like Nice, France had to be removed from the analysis because of all the false-positives. There a surprising amount of cities with common names like that, and unfortunately, all had to be excluded.