Inside #WorldCup2014

This is a project I completed with my classmates Nikhita Koul and William Lewis as a project for the Storing and Retrieving Data and Data Visualization and Communication classes. Both classes are part of the Master of Data Science program at UC Berkeley. The website can be found here. GitHub repo can be found here.


Inside #WorldCup2014



Social networks can capture the essence of the excitement around the World Cup. As part of this project we used social networking data to create visualizations that might encourage casual fans to learn more about the World Cup and provide them with tools that teach the basics of the tournament. To do this we have aggregated, stored, cleaned and analyzed data from each of the 30 days of the 2014 FIFA World Cup.  The data gathered consisted of over 68 million tweets tweeted by Twitter users.

Project Approach

The goal of this project was two-fold: a) to get hands-on experience working with core technologies used for information retrieval, storage, and data analysis and b) to create visualizations that might encourage casual fans or those with remote interest in the World Cup to learn more about it.

Aggregation: Nearly 60 GB of data was processed and 17GB of data was collected, over a period of 32 days, using Python’s tweepy library and the Twitter Streaming API.

Data Processing: Data was collected over a period of 32 days using Python’s tweepy library to stream data from the Twitter Streaming API within AWS EC2 instance.

Storage: The data was stored in multiple documents in CSV format using AWS S3 virtual servers in the cloud.

Information Retrieval: We indexed the data and configured six instances of AWS EC2 with Apache Solr. On the client side, HTTP Get requests were sharded across multiple Solr instances.

Analysis and Visualization:  The data was represented using d3, Highcharts, and the plotting libraries in R.

Project Diagram

Screen Shot 2015-08-04 at 10.42.30 PM

Twitter Activity Map

Tweet activities are collected via Twitter’s streaming API, while the visualization is made using D3.

Screen Shot 2015-08-04 at 10.35.36 PM Screen Shot 2015-08-04 at 10.36.05 PM Screen Shot 2015-08-04 at 10.36.30 PM

Sentiment Analysis

Screen Shot 2015-08-04 at 10.36.45 PM

Social Graph

Screen Shot 2015-08-04 at 10.37.55 PM


%d bloggers like this: