Baseball Salary Analysis

This analysis takes the data of historical baseball players, and analyze the variation in their salary level.  This project demonstrates my experience using SQL(MySQL) with R.

For the code of this project, please refer to Baseball Salary Analysis Code in the Code Samples Section.


For this analysis, we look at a database that contains information regarding baseball teams, players, and salaries. I’ll look at the length of existence of each team, their game attendances, which are in relation to their income, and player salaries and their relation to player nicknames. I’ll use MySQL and R in order to obtain the information needed for such analysis.


The original data source comes from The Baseball Data Bank (, which is run by volunteers who collect data regarding baseball. I extract the information from the statistics department machines “” and looked at the database “baseball”. The tables that are used are Teams and Salaries under “baseball”, and to collect the data, we looked at the “names” (which gives the name of each teams), “teamID” (which gives each team’s ID), “yearID” (the years in which teams participate in games), attendance (which record the attendance of games for certain year), salary (which returns salaries), and “Master” (which is a table that provides information regarding players, such as their first name, last name, and nicknames).


Part 1: Teams

By counting the distinctive names from the table Teams, I found that there are 137 Teams in all. By looking at the number of the year team names have existed (under “numyear”), it shows that Washington Nationals’ name lasted the longest, from 1872 to 2007, a total of 136 game years. Looking at attendance in descending order, here’s the top five teams with the highest attendance for their games in most recent year (all in 2007):

Team Name Attendance
Colorado Rockies 4483350
Arizona Diamondbacks 3610290
Los Angeles Angels of Anaheim 3404686
Florida Marlins 3064847
Tampa Bay Devil Rays 2506293

Part 2: Player Salary

In part two of the assignment, we look at the salary information of baseball players. We first look at the maximum, minimum, and average of baseball players each year from 1980 to 2007.

From the above table, there are three lines showing the max, min, and average of salaries of player. While the trend of minimum salary fluctuate since some players can receive zero salaries, but the line seems very flat in comparison to the maximum salary line. For maximum salary, the amount increases convexly over time. I then look at the effect of nicknames for players. I separate players into two categories: players with nicknames and players without. This evaluates whether having a nickname makes player more famous, thus more rich, or does nickname not affect salary at all. In the plot below, the two boxplots are placed side-by-side in order to compare their mean and standard deviations.

It can be seen that players with nicknames don’t have higher salary. Actually, players without nicknames have much higher salary, a median of around 7 million dollars, and players with nicknames have a much lower median, lower than 1 million dollars. Players with nicknames also have a much smaller quartile range, with its upper quartile lower then the median of players without nicknames. Players with nicknames have an upper adjacent value that’s lower then the upper   quartile of players with nicknames. So this demonstrates that nicknames is not a causation to salary amount.

%d bloggers like this: