Create a clustering model for segmenting Toronto neighborhoods to find the best locations for starting a coffee shop business
*Note: The project is documented on GitHub* *This article has also been posted on Towards Data Science*
Built a k-means clustering model in Python and Scikit-Learn to find the best location for starting a coffee shop business in Toronto city.
Scrapped over 100 Toronto postal codes and 10 boroughs from Wikipedia using Python and Beautiful Soup; transformed it into a Pandas data frame.
Defined neighborhoods based on their postal codes and retrieved the coordinate (latitude and longitude) for each using Geocoder module.
Analyzed the pedestrian and vehicle volumes from the Toronto Open Data; visualized it using Folium, leading to 139 lively / busiest roads.
Visualized the crime statistics over the last 5 years (2014-2019) using Seaborn, resulting in 3 boroughs and 19 neighborhoods qualified for the business.
Obtained over 900 venues with 172 unique categories using FourSquare API (Location Data).
Engineered the features from the mean of the frequency based on the venue category’s occurrences for the k-means clustering model.
Used Silhouette Score Elbow to optimize the model; recommended 3 best neighborhoods to stakeholders for starting the coffee shop business.
A. 1. Background and Business Problem
Toronto is Canada’s largest city with a population of more than 2,7 million and a density of 4,334.4 people per square kilometer. The city is renowned as one of the most multicultural cities globally due to its large population of immigrants from all over the globe. This leads the city to become a world leader among other metropolitan and cosmopolitan cities from many sectors, including business.
Now, imagine that you own a coffee shop called Kopiasli (fictitious) that has been doing business successfully in New York. This year, your team plans to expand the business and decide to look for a city that shares the same trait as New York, and one of which is Toronto.
To ensure this project’s success, the team requires insights into the demographics, neighboring businesses, and crime rates. For each neighborhood, we can ask:
How many cafes exist?
What are the most popular venues?
Can we get information about the vehicle and foot traffic?
What is the neighborhood's crime rate? And so on.
Thus, the project goal is to figure out the best locations for opening up a new coffee shop in Toronto City.
Entrepreneurs who are passionate about opening a coffee shop in a metropolitan city would be very interested in this project. The project is also for business owners and stakeholders who want to expand their businesses and wonder how data science could be applied to the questions at hand.
A.2. Data Description
The followings are data sources that we can use for this project:
1st Data: The most updated record of traffic signal vehicle and pedestrian volumes in Toronto City. The data is typically collected between 7:30 a.m. and 6:00 p.m at intersections where there are traffic signals.
2nd Data: The most updated record of crime incidents reported in Toronto City provided by Toronto Police Services. 
3rd Data: The list of Toronto neighborhoods represented by postal codes and their boroughs. We will be using the Geocoder python package to retrieve the postal code’s coordinates. 
4th Data: The popular or most common venues of a given neighborhood in Toronto. This information is stored inside Foursquare Location Data, and we will use Foursquare API to access it. 
To sum up, we will use the 1st and 2nd data to analyze the pedestrian/vehicle volume and crime rates. Then, we load the 3rd data to obtain the exact coordinates for each neighborhood based on the postal code, allowing us to explore and map the city. Finally, we will use the coordinates and Foursquare credentials to access the 4th data source through its API and retrieve the popular venues along with their details, especially for coffee shops. The venue frequency in each neighborhood will be the feature of the clustering model.
B.1. Analytic Approach
We approach the problem using the clustering technique, namely k-Means. This approach enables the audience to see how similar neighborhoods are about their demographics. We can then examine each cluster and determine the discriminating venue categories that distinguish each cluster. We will also display any statistics needed to answer questions concerning crime incidents, and vehicle and foot traffic records.
B.2. Exploratory Data Analysis
B.2.1. Vehicle and Foot Traffic
We begin by analyzing the data about the pedestrian and vehicle volumes. The column Main contains the main street name that appears several times indicating it contains intersections. We can group by the street name and aggregate this either by summing those value up or averaging it. We will choose to average it for simplicity. This returns 248 main roads.