As mentioned in this blogpost, I recently decided to spend some time learning the basics of data analysis using the Pandas library for Python. Having done just that, I thought it was time to put that knowledge to good use in a simple project and to publicise it on my blog.
Olympics dataset that I downloaded
Before starting to code, I thought for a short while on the questions that I wanted to see answered from my data analysis and how I could provide responses using data analysis.
- Which country has hosted the Olympics the most often?
- What are the most successful countries (by gender)?
- Who are the most successful athletes from the highest ranking country (by gender)?
- Who is the greatest Olympic athlete of all time?
- What year brought the largest medal haul for the top ten performing countries?
- Has there been an increase in the number of events since inception of the modern Olympics?
Let’s find out what secrets I was able to tease out from the data…
Checking Data Integrity and Data Cleaning
The first step was to import the necessary libraries into the Jupyter notebook – Pandas, Numpy, Matplotlib and Seaborn. Then it was a simple matter of opening the data with Pandas as a dataframe:
With everything in place, it’s time to check the integrity of the data and see whether there’s any missing information and whether rows contain any duplicates:
Given that there was only four rows with missing data and that this represented a small percentage of the whole dataframe (30K+ rows), I elected to drop these four rows. Looking at the duplicates, there was only two and both of these were dropped too:
Next, I thought I would inspect the data to see whether I could improve performance and memory usage with Categorization. Two columns seemed to fit the bill – Gender and Medals, as they only had a couple of values each per column. Both columns were converted from Dtype object to category to and memory usage was reduced from 478.1KB to 304.5KB, as illustrated below:
Looking at the dataframe, the first thing that struck me stylistically was that the names looked awkward and were difficult to read. To improve readability, I decided to change from SURNAME, First name format to First name SURNAME:
To achieve this, I wrote a small list comprehension that split the name at comma, then a second list comprehension that reversed the order of the surname & first name and then assigned the new variable to the ‘Athlete’ column.
Which country has hosted the Olympics the most often?
To answer this question using data analysis, I took the following approach.
Loop through a list of all of the years taken from the ‘summer’ dataset ‘Year’ column, take the name of the city where the event had taken place and the corresponding year add to separate lists. Use the two lists to create a Pandas dataframe showing the Location and Year of each event. There are probably more eloquent ways of doing this, but this way rendered the desired results.
Then it was simply a matter of creating a second dataframe, using the .value_counts() method to show the frequency that each city appeared in the Location/Year dataframe. Using this second dataframe, it was possible to plot the data and visualise the number of times each city has held the Olympics since 1896.
Evidently, London has hosted the Olympics three times, which is the most frequent. To establish exactly when, the following code filters the initial Location/Year dataframe using a Boolean conditional filter and the Pandas .loc method:
What are the most successful countries (by gender)?
The code to answer this question was quite simple and invovled the following steps.
Filter the ‘summer’ dataset to include only the male gender, using Boolean conditional filtering and then use the .value_counts() method to establish how many times each country appeared in the ‘Country’ column, thereby showing how many medals had been won.
The same process was used to establish the most successful ladies team of all time, but with the the Boolean conditional filter set to eliminate all males. Unsurprisingly, the results look vey similar, with the USA coming in tops:
Who are the most successful athletes from the highest ranking country (by gender)?
Given that the USA is significantly more successful than all other countries, I wanted to see who was the “best of the best” and to establish which American mean and women had brought home the most gold medals.
The first step was to use a Boolean conditional filter on the ‘summer’ dataset, so that it included only American men. Then, the same method outlined in Question 2 above was employed to establish the number of times each male athlete appeared in the dataframe.
Quite clearly from the below, Michael Phelps is the greatest Olympian of all time, with double the amount of his fellow male athletes and almost double the tally of the greatest female athletes.
Again, essentially the same code was used to establish the most successful women, the only difference being the conditional filtering set to exclude men from the ‘summer’ dataset:
Who is the greatest Olympic athlete of all time?
As evidenced above, Michael Phelps is the best Olympian and by a significant margin. With this question answered, I thought it would be interesting to breakdown what medals he won, where, when and whether they were Gold, Silver or Bronze.
This was achieved filtering the ‘summer’ dataset to include only Michael Phelps and then by using Pandas’ very handy .groupby() method to show the medal count per medal type at each location and year. The .unstack() method is then chained to make the table more readable and to fill in any NaN values with 0.
The resulting visualisation give a nice demonstration of Michael’s medal haul over the three Olympics where he competed.
What was the most successful year for each of the top 10 countries?
The first step was to use the .groupby() method by Country and Year and to show the medal count for each year per country. NaN values were replaced with a zero value, as detailed below:
The second step was to the create a new column that takes the highest value from each row, thereby showing what the best medal tally per country since 1896.
Finally, a third column is added, to establish the year in which the highest medal tally the year per country was achieved. This was a little fiddly (and, as always, there’s probably a more Pythonic solution…), but was done by
- Converting each row to a Numpy array
- Using the np.argmax() method to identify the index of the largest value per row
- Using that index number to pull the corresponding year from the list of year columns in the ‘medals’ dataset
- Adding the corresponding year to a list, which is then assigned to the new ‘Year’ column
The code for this and the and resultant dataframe showing the top ten countries and their respective largest medal hauls and the year in which said haul was achieved looks like this:
The graph below gives a nice visualisation of who won what and when.
The full Jupyter Notebook with all data analysis code, as well as the .csv file containing the Olympic data can be found at my Github HERE.