Do MLB Players Earn Their Salary?

Abstract

Baseball offensive hitting statistics were used to determine if higher paid players are actually more offensively productive. Given we are in the golden age of statistical analysis in baseball, what team owner wouldn’t want to know if they are over paying their players? This project first found out the distribution and variances of all numerical variables to see what kind of statistical analysis could be performed. Parametric, non-paramertric, generalized linear models, and predictive analysis were used. From these tools it was found that yes, more productive offensive players were paid more in 2016. The old baseball saying, the money lies in the RBI’s, was the most significant conclusion found in this project.

Introduction

In the age of modern sabermetrics baseball, there are ample datasets to choose from to gather data. For this analysis I pulled data from Lahmans Baseball Git Hub Site, maintained by Chris Dalzell. The Lehman repository had two main R files needed for this project. The batting data and salaries data. These two sets were joined to view salary and batting data in the same data frame. Batting variables that were viewed in this project are batting average(BA), Runs Batted In(RBI), and Home runs(HR). These three variables are the basis for all offensive evaluation metrics in baseball.

Every major league team puts high emphasis on efficiency, owners want to win as many games as possible with out over paying players for their productivity. Given this concern the basis of this project is to evaluate players productivity in relation to their earnings. Statistical analysis will be used to investigate the relationship between 2016 MLB players offensive productivity and their earnings for the same year.

Goals for this project are to discover the discrete individual relationship between earnings and BA, HR’s, and RBI’s. I also aim to find a model that can predict the salaries based on the variables BA, HR, and RBI. From this model we can see what is the strongest predictor than return to the original relationship tests to see if that particular relationship is a statistically significant. From these tests a conclusion can be made about whether players salaries are a good measure of offensive productivity (aka players earning their salaries).

Hypothesis

H₀: There is no relationship between the salary of 2016 MLB players with over 200 at bats and the offensive productivity of 2016 MLB players with over 200 at bats.

H_a: There is a relationship between the salary of 2016 MLB players with over 200 at bats and the offensive productivity of 2016 MLB players with over 200 at bats.

Exploratory Data Analysis

From figure 1 it appears that for both categorical variables League and Salary, RBI per 100 at bats and batting average are normal. Home runs per 100 at bats does not look normal for Salary group and will be explored with further analysis in the next section, statistical methods. All R code for the plots in this section are located in appendix section B.

Figure 1: A pairs.panels plot. This plot shows the relationship between each variable with a scatter plot on the lower triangle. There appears to be a linear relationship between HRper100 and RBIper100. There appears to be no clean linear or non-linear relationship bewteen HRper100 and BA. There is a slight linear relationship between RBIper100 and BA. The upper triangle of the plot shows the correlation coefficent, r. The r value for HRper100 and RBIper100 equals 0.80 indicating co-liniearity between the two variables. It appears that all three variables are normally distributed.

Figure 2: Top row: Shows the box plot of HR per 100 at bats, RBI per 100 at bats, and batting average, grouped by players of low and high salaries. HR per 100 at shows spread of both groups roughly equal. The Low group has a smaller central tendency and indicates right skewed data. RBI per 100 shows one outlier in the high group and the spread is slightly less for Low group, niether indicate skewed data. Batting average for low and high group both have outliers on the low end, the spread and central tendency is smaller for Low group. Bottom Row: the spread and central tendency are very similar for all three variables when split by league. The American League(AL) has one outlier for RBI per 100 at bats, and the National League(NL) has one outlier for Batting Average.

Figure 3: Box plot above showed skewed HRper100 data, to further visualize Homeruns per 100 at bats histograms have been made for another visual perspective. Histograms of MLB home runs broken up into the 2 categorical variables of the study, Salary and League. Low paid home runs exhibits right skew and does not look normal. Both National and American league homeruns exhibit right skew, more predominately in National League

Statistical Methods

Manipulation of the data. As shown above in Fig1, Fig2, and Fig3 the home runs and RBI’s are adjusted per 100 at bats. This adjustment was made because the spread between at bats varies from 200 to 672, naturally a person with same skills would hit more homeruns in 672 at bats than 200. This adjustment does not apply to batting average because it is inherently an average. Salary data was changed into binary values to make two groups. High (value = 1) is players paid about 1,000,000 US dollar, the low (value =0) is those paid less that or equal to 1,000,000 US dollars. All code for statistical analysis can be found in appendix section C. The types three tests used in the statistical analysis were two sample t-test, Wilcox Rank Sum, and generalized linear model.

Two Sample T-test

Two sample t-test will tell us if there is a significant difference between the means of the batting average of high paid and low paid groups.

Assumptions of two sample t-test:

Both samples are random samples
Both populations have normal distributions
The variance of both populations is equal

Wilcox Rank Sum Test

Wilcox rank sum test is a non-parametric test, meaning it is used when the assumptions can’t be met for a parametric test, like two sample t-test. This test compares the central tendencies of two groups using ranks

Assumptions of Wilcox Test:

Both samples are random samples
Both samples have the same shape of distribution

Generalized Linear Model (Binomial Family)

A GLM is similar to a linear model. It is an approach to model the relationship between a response and explantory variable(s). GLM ignores some of the assumptions the a linear model makes, it does not have to be normally distributed or have equal variances.

Assumptions

Response Variable should be distributed like random component
For each x value there is a corresponding y
Y values are independent from one another

Assumption Checks of Batting Average(BA) for two sample t-test

Both samples are random because they were filtered only by more than 200 at bats from the entire population
Both populations are normally distributed as shown in figure 2 the spread appears to be normal. As shown in figure 4, the p-value of the shapiro test on residuals > 0.05, so we can not reject H_o: the data is normally distributed. In appendix section C part 1 the plot of qqPlot of the residuals is shown, given the points fall within the 95% CI of the normal distribution. From these tests data can be assumed normally distributed
Both populations have equal variance as shown by the Levene’s Test p-value > 0.05 so we can reject H_o: the data has equal variance. This indicates the data does have normal variance. It can also been seen in appendix section C part 1 in the residuals vs fitted plot, there is no distinct pattern showing the data is normally distributed.

All assumptions met, a two-sample t-test can be run

full t-test can be seen in appendix C part 2

State hypothesis

\(H_0 : \mu_L = \mu_H\)

\(H_A : \mu_L \neq \mu_H\)

H₀ :The mean batting average in the low salary group is equal to the mean batting average of the high salary group

H_a :The mean batting average in the low salary group is not equal to the mean batting average of the high salary group

t.test(LowSalary$BA, HighSalary$BA, var.equal = TRUE)

Assumption Checks of RBI per 100 AB for two sample t-test

Both samples are random because they were filtered only by more than 200 at bats from the entire population
Both populations are normally distributed as shown in figure 2 the spread appears to be normal. As shown in figure 4, the p-value of the shapiro test on residuals > 0.05, so we can not reject H_o: the data is normally distributed. In appendix section C part 1 the plot of qqPlot of the residuals is shown, given the points fall within the 95% CI of the normal distribution. From these tests data can be assumed normally distributed
Both populations have equal variance as shown by the Levene’s Test p-value > 0.05 so we can reject H_o: the data has equal variance. This indicates the data does have normal variance. It can also been seen in appendix section C part 1 in the residuals vs fitted plot, there is no distinct pattern showing the data is normally distributed.

All assumptions met, a two-sample t-test can be run

full t-test can be seen in appendix C part 2

State hypothesis

\(H_0 : \mu_L = \mu_H\)

\(H_A : \mu_L \neq \mu_H\)

H₀ :The mean number of RBI per 100 at bats in the low salary group is equal to the mean number of RBI per 100 in the high salary group

H_a :The mean number of RBI per 100 at bats in the low salary group is not equal to the mean number of RBI per 100 in the high salary group

t.test(LowSalary$RBIper100, HighSalary$RBIper100, var.equal = TRUE)

Assumption Checks of HR per 100 AB for two sample t-test

Both samples are random because they were filtered only by more than 200 at bats from the entire population
As shown in figure 4 the shapiro test shows that the residuals are not normally distributed

Not all assumptions met, check assumptions for Wilcox Rank Sum Test

Both samples are random, see above
As shown inn figure 2 and 3 the distribution is roughly the same

Assumptions met, Wilcox Rank Sum Test can be run

full test can be seen in appendix C part 2

State hypthosesis

\(H_0 : \mu_L = \mu_H\)

\(H_A : \mu_L \neq \mu_H\)

H₀: The amount of HR per 100 at bats is the same in the low lalary and high salary groups

H_a: The amount of HR per 100 at bats is not the same in the low salary and high salary groups

wilcox.test(LowSalary$HRper100,HighSalary$HRper100)

Assumption checks to run a Generlized Linear Model

The response variable, salary, is distributed like the random component
For each salary point there is a corresponding explanatory variable data entry
Salaries are independent from one another

All assumptions met, a Generlaized Linear Model can be run

full model can be seen in appendix C part 3

as shown in figure 1, there is co-linearity between HRper100 and RBIper100, so HRper100 is removed.

State hypothesis

\[H_0 : \beta = 0 \\ H_A: \beta \neq 0\]

H₀ There is no relationship between the variables batting average and RBIper100 and the salary that a 2016 MLB player gets paid.

H_a There is a relationship between the variables batting average and RBIper100 and the salary that a 2016 MLB player gets paid.

5 generalized linear models were created, then they were assessed using Akaike information criterion (AIC) and Bayesian information criterion (BIC). Both are tools to look at goodness of fit of the model. BIC is harsher on scoring more complex models

Prediction of data points using best model ranked by BIC

full prediction shown in appendix C part 4

Prediction process:

From the figure 7 shown in results, it shows that model 5 was the best fit
The data was subsetted, splitting off 30 points to be used as test points.
The remaining 264 points were used as training points, using the model of best fit to train.
Then the model of best fit was used to predict where those 30 subsetted salary points should be. Whether they are 1 “High Salary” or 2 “Low Salary”

Figure 4: Normality and Equal Variance Test Table For Salary Variable
	Shapiro P-value	Levene P-value
BA	0.7046396	0.3814
RBIper100	0.1131881	0.4193
HRper100	0.0055605	0.1927

Results

Results of Batting Average T-test and RBI per 100 at bats T-test

T-tests were performed on both batting average and RBIper100 because they both met all the assumptions required. As shown in figure 5 the p-value of BA t-test <0.05. This means that we can reject out H₀ that the means are equal. So from this test we can conclude that the means are not equal. As shown in Figure 2, the central tendency of BA is larger in the high salary group. It is also shown in appendix D with a Tukey-Kramer test.

It can be concluded that high salary players do have a higher batting average than low salary players

T-test for RBI per 100 at bats also have a p-value <0.05 as shown in figure 5. From this we can reject H₀ and assumme that there is a difference in the mean number of RBI hit per 100 at bats. As shown in figure 2 and analyzied in appendix D with Tukey-Kramer test the number of RBIper100 is larger in the high salary group.

It can be concluded that high salary players do have a higher number of RBI per 100 at bats than low salary players

Results of Wilcox Rank sum test

The HR per 100 at bats was nor normally distributed and did not meet the assumptions for a parametric test so a non parametric test was performed. From the results in figure 6 we can see that a p-value of >0.05 was returned. This means that we fail to reject H₀: that there is a difference in the mean number of homeruns hit between the high and low salary groups

It can be concluded that there is no difference between the number of home runs hit in 100 at bats between the high and low salary players

Figure 5: Two sample T-test Results
	P-value	95% Confidence Intervals	Alpha Level	Degrees of Freedom	T-value
Batting Avg	0.02605	-0.016112375-0.001029543	0.05	292	-2.2368
RBI per 100	0.01219	-2.0043759 -0.2474258	0.05	292	-2.5225

Figure 6: Wilcox Rank Sum Test Results
	P-value	W-Value
HR per 100	0.06403	8464

Results of Generalized Linear Model

There were 5 models made. As shown in figure 7. The model that I deemed most fit is Salary 5. This model can be written as \[pr(salary group)_i = \beta_0 + exp\beta_1(BA)_i + exp\beta_2(RBIper100)_i+ \epsilon_i\] All models can be seen in Appendix 3 part C. The reason modal “Salary 5” was choosen is because it had the lowest AIC and BIC values of all the models indicating that it is the best fit.

B₀= -2.07589

B₁= 6.977

B₂= 0.07253

The salary 5 model had one significant explanatory variable, RBIper100. As shown in appendix C part 3 the Pr(>Chisq) of RBIper100 = 0.0378. This indicates that the best predictor of salary group is from the RBI per 100 at bats.

Figure 7: GLM Model Results
	Degrees of Freedom	AIC	BIC
Salary1	4	375.3464	390.0807
Salary2	5	376.9793	395.3972
Salary3	8	381.4339	410.9025
Salary4	4	376.6974	391.4317
Salary5	3	375.0443	386.0950

Figure 8: GGplot to show that RBIper100, the most significant variable in predicting salary group, is a strong predictor of salary in both the American League and the National League

Results of Salary Group Prediction As shown in figure 9, the salary group prediction did not do a okay job. Many points were left in the middle between the two groups 9 out of 30. The model did get almost exactly 1 and 0 for some predictions. This is likely because predicting salary is very difficult. Mainly because a salary is determined by human interaction. The owners do not use this predictive model to determine the players next contract.

integer(0)

Figure 9: Predicive plot using chosen model: Salary 5

Discussion

From the analysis it can be concluded that there is a significant difference between the 2016 MLB batting average and RBI’s per 100 at bats for players who are paid over versus under 1,000,000 US dollars. There was found to be no differences between any of the variables based upon league, if interested the results are in appendix D.

Using the variables statistically significant tests they were used in finding a GLM. The model with the lowest BIC score was choosen. This model did not do a great job predicting the salary caliber. This was a limitation in the study. There are many more variables out there that predict a pkayers salary. For starters, defense is taken into account. Another important one is that salaries are negotiated before the player puts up those statistics for the year. This can be some what ignored because ideally a player will play at the average for that year but aging and injuries all play into affect. Another short coming is that young players that quickly bloosom into stars are often stuck on their low paying rookie contracts which conversely former stars, who signed long expensive contracts, are now aging and not producing at the level that earned them that contract.

These are some of the short comings of the study, if I had more time I would certainly over lay many years of data and try to make it into a multideminsional matrix data set then plot my analysis over time.

However it is not to say that this study did have significant findings. The is an age old saying in baseball, “the money lies in the RBI’s”. From this study that was undoubtedly shown as RBI was the most predictive value.

Appendix

Citations

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/

Revelle, W. (2020) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, https://CRAN.R-project.org/package=psych Version = 2.0.7,.

John Fox and Sanford Weisberg (2019). An R Companion to Applied Regression, 3rd Edition. Thousand Oaks, CA

Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.29.

Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

Michael Friendly, Chris Dalzell, Martin Monkman and Dennis Murphy (2020). Lahman: Sean ‘Lahman’ Baseball Database. R package version 8.0-0. https://CRAN.R-project.org/package=Lahman

Russell Lenth (2020). emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.5.0. https://CRAN.R-project.org/package=emmeans

Section A: Data cleaning and manipulation

load('../data/Salaries.RData') 

Salary16<- Salaries %>% 
filter(yearID==2016)  

load('../data/Batting.RData')
 
Stats16x<- Batting %>% 
  filter(yearID==2016) %>% 
  filter(AB > 200) %>% 
  mutate(BA = (H/AB)) 

Stats16 <- left_join(Stats16x, Salary16, by = "playerID")  %>% 
  na.omit(salary)

StatsSimple<- Stats16 %>%
  mutate(HRper100= (HR/AB)*100) %>% 
  mutate(RBIper100= (RBI/AB)*100) %>% 
  select(HRper100,RBIper100,BA,salary, lgID.y)

LowSalary<- StatsSimple %>% 
  filter(SalaryCaliber== "Low")

HighSalary<- StatsSimple %>% 
  filter(SalaryCaliber== "High")

NL<- StatsSimple %>% 
  filter(lgID.y== "NL")

AL<- StatsSimple %>% 
  filter(lgID.y== "AL")

Section B: Exploratory Data Analysis

pairs.panels(StatsSimple[1:3], density = TRUE, cor = TRUE)

par(mfrow= c(2,3))
boxplot(HRper100~SalaryCaliber, data= StatsSimple, col="cyan", xlab="Salary Group", ylab="HR per 100 at bats")
boxplot(RBIper100~SalaryCaliber, data= StatsSimple, col="cyan", xlab="Salary Group", ylab="RBI per 100 at bats")
boxplot(BA~SalaryCaliber, data= StatsSimple, col="cyan", xlab="Salary Group", ylab="Batting Average")
#boxplot of HR,RBI,BA by league

boxplot(HRper100~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League", ylab="HR per 100 at bats")
boxplot(RBIper100~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League", ylab= "RBI per 100 at bats")
boxplot(BA~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League", ylab= "Batting Average")

par(mfrow= c(2,1))
#HR Low
hist(LowSalary$HRper100, col="cyan", xlab="HR per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 Low Paid MLB HR")

#HR High
hist(HighSalary$HRper100, col="cyan", xlab="HR per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 High Paid MLB HR")

#NL  HR
 hist(NL$HRper100, col="cornflowerblue", xlab="HR per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 National League MLB HR")

#AL HR
hist(AL$HRper100, col="cornflowerblue", xlab="HR per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 American League MLB HR")

#BA of high and low
hist(LowSalary$BA, col = "cyan", xlab = "Batting Average", ylab = "Frequency(player count", main ="Histogram of 2016 Low Paid MLB Batting Averages")


hist(HighSalary$BA, col = "cyan", xlab = "Batting Average", ylab = "Frequency(player count", main ="Histogram of 2016 High Paid MLB Batting Averages")

#BA of AL NL
hist(NL$BA, col = "cornflowerblue", xlab = "Batting Average", ylab = "Frequency(player count", main ="Histogram of 2016 National League MLB Batting Averages")


hist(AL$BA, col = "cornflowerblue", xlab = "Batting Average", ylab = "Frequency(player count", main ="Histogram of 2016 American League MLB Batting Averages")

#RBI low and high
hist(LowSalary$RBIper100, col="cyan", xlab="RBI per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 Low Paid MLB RBI")


 hist(HighSalary$RBIper100, col="cyan", xlab="RBI per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 High Paid MLB RBI")

#RBI NL and AL
hist(NL$RBIper100, col="cornflowerblue", xlab="RBI per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 National League MLB RBI")


hist(AL$RBIper100, col="cornflowerblue", xlab="RBI per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 American League MLB RBI")

Section c: Statistical Methods

Part 1

par(mfrow = c(2,2))

BAsalary<- lm(BA~SalaryCaliber, data = StatsSimple)
BAshap<- shapiro.test(BAsalary$residuals)
BAlev<- leveneTest(StatsSimple$BA~StatsSimple$SalaryCaliber)
plot(BAsalary)

RBIsalary<- lm(RBIper100~SalaryCaliber, data = StatsSimple)
RBIshap<- shapiro.test(RBIsalary$residuals)
RBIlev<- leveneTest(StatsSimple$RBIper100~StatsSimple$SalaryCaliber)
plot(RBIsalary)

HRsalary<-lm(HRper100~SalaryCaliber, data = StatsSimple)
HRshap<- shapiro.test(HRsalary$residuals)
HRlev<- leveneTest(StatsSimple$HRper100~StatsSimple$SalaryCaliber)
plot(HRsalary)

SalarysumShap<- c(BAshap$p.value,RBIshap$p.value,HRshap$p.value) %>%  
  setNames(c("BA ", "RBIper100 ", "HRper100 "))
SalarysumLev<- c(0.3814,0.4193,0.1927) 
  
SalTestdf<-data.frame(SalarysumShap, SalarysumLev)

knitr::kable(SalTestdf, caption= "Normality and Variance Table", col.names = c("Shapiro P-value", "Levene P-value"))

Normality and Variance Table
	Shapiro P-value	Levene P-value
BA	0.7046396	0.3814
RBIper100	0.1131881	0.4193
HRper100	0.0055605	0.1927

#Tukey Kramer test
anova(BAsalary)

## Analysis of Variance Table
## 
## Response: BA
##                Df   Sum Sq   Mean Sq F value  Pr(>F)  
## SalaryCaliber   1 0.004871 0.0048707  5.0033 0.02605 *
## Residuals     292 0.284260 0.0009735                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

emmeans(BAsalary, pairwise~SalaryCaliber)

## $emmeans
##  SalaryCaliber emmean      SE  df lower.CL upper.CL
##  High           0.263 0.00225 292    0.258    0.267
##  Low            0.254 0.00310 292    0.248    0.260
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast   estimate      SE  df t.ratio p.value
##  High - Low  0.00857 0.00383 292 2.237   0.0261

anova(RBIsalary)

## Analysis of Variance Table
## 
## Response: RBIper100
##                Df Sum Sq Mean Sq F value  Pr(>F)  
## SalaryCaliber   1   84.0  84.049  6.3628 0.01219 *
## Residuals     292 3857.2  13.210                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

emmeans(RBIsalary, pairwise~SalaryCaliber)

## $emmeans
##  SalaryCaliber emmean    SE  df lower.CL upper.CL
##  High            13.2 0.262 292     12.7     13.7
##  Low             12.1 0.362 292     11.4     12.8
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast   estimate    SE  df t.ratio p.value
##  High - Low     1.13 0.446 292 2.522   0.0122

Appendix C: Part 2

#Two sample t-test of Batting average and salary
t.test(LowSalary$BA, HighSalary$BA, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  LowSalary$BA and HighSalary$BA
## t = -2.2368, df = 292, p-value = 0.02605
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.016112375 -0.001029543
## sample estimates:
## mean of x mean of y 
## 0.2543452 0.2629161

#Two sample t-test of RBI per 100 ab and salary
t.test(LowSalary$BA, HighSalary$BA, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  LowSalary$BA and HighSalary$BA
## t = -2.2368, df = 292, p-value = 0.02605
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.016112375 -0.001029543
## sample estimates:
## mean of x mean of y 
## 0.2543452 0.2629161

#wilcox rank sum test of HR per 100 ab and salary
wilcox.test(LowSalary$HRper100,HighSalary$HRper100)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  LowSalary$HRper100 and HighSalary$HRper100
## W = 8464, p-value = 0.06403
## alternative hypothesis: true location shift is not equal to 0

Appendix C: Part 3

#make salary high and low into binary 0 and 1 so the binomial 
StatsSimple<-StatsSimple %>% 
  mutate(SalaryBin= ifelse(salary > 1000000, 1, 0))


Salary1<- glm(SalaryBin~BA+ RBIper100+lgID.y, family = "binomial", data = StatsSimple)
#check residuals of the model not sure if you have to do this for GLM
Salary2<- glm(SalaryBin~+lgID.y+BA* RBIper100, family = "binomial", data = StatsSimple)
#check residuals if I need to
Salary3<- glm(SalaryBin~lgID.y*BA*RBIper100, family ="binomial", data = StatsSimple)

Salary4<- glm(SalaryBin~BA*RBIper100, family ="binomial", data = StatsSimple)

Salary5<- glm(SalaryBin~BA+RBIper100, family ="binomial", data = StatsSimple)

summary(Salary1)

## 
## Call:
## glm(formula = SalaryBin ~ BA + RBIper100 + lgID.y, family = "binomial", 
##     data = StatsSimple)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8634  -1.3429   0.7801   0.9240   1.2751  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.94616    1.07515  -1.810   0.0703 .
## BA           7.13659    4.12260   1.731   0.0834 .
## RBIper100    0.07230    0.03539   2.043   0.0410 *
## lgID.yNL    -0.32606    0.25083  -1.300   0.1936  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 378.29  on 293  degrees of freedom
## Residual deviance: 367.35  on 290  degrees of freedom
## AIC: 375.35
## 
## Number of Fisher Scoring iterations: 4

summary(Salary2)

## 
## Call:
## glm(formula = SalaryBin ~ +lgID.y + BA * RBIper100, family = "binomial", 
##     data = StatsSimple)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9420  -1.3263   0.7779   0.9412   1.2264  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)    0.3684     3.9834   0.092    0.926
## lgID.yNL      -0.3281     0.2509  -1.308    0.191
## BA            -1.8379    15.4405  -0.119    0.905
## RBIper100     -0.1167     0.3160  -0.369    0.712
## BA:RBIper100   0.7283     1.2118   0.601    0.548
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 378.29  on 293  degrees of freedom
## Residual deviance: 366.98  on 289  degrees of freedom
## AIC: 376.98
## 
## Number of Fisher Scoring iterations: 4

summary(Salary3)

## 
## Call:
## glm(formula = SalaryBin ~ lgID.y * BA * RBIper100, family = "binomial", 
##     data = StatsSimple)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8233  -1.2904   0.7513   0.9443   1.2176  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)            -3.9341     6.1823  -0.636    0.525
## lgID.yNL                7.2636     8.1432   0.892    0.372
## BA                     13.3980    23.8842   0.561    0.575
## RBIper100               0.2595     0.4777   0.543    0.587
## lgID.yNL:BA           -27.0470    31.5451  -0.857    0.391
## lgID.yNL:RBIper100     -0.6901     0.6476  -1.066    0.287
## BA:RBIper100           -0.6058     1.8204  -0.333    0.739
## lgID.yNL:BA:RBIper100   2.4693     2.4796   0.996    0.319
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 378.29  on 293  degrees of freedom
## Residual deviance: 365.43  on 286  degrees of freedom
## AIC: 381.43
## 
## Number of Fisher Scoring iterations: 4

summary(Salary4)

## 
## Call:
## glm(formula = SalaryBin ~ BA * RBIper100, family = "binomial", 
##     data = StatsSimple)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8614  -1.3073   0.8017   0.9579   1.1859  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)    0.1690     3.9729   0.043    0.966
## BA            -1.7284    15.4075  -0.112    0.911
## RBIper100     -0.1108     0.3153  -0.351    0.725
## BA:RBIper100   0.7063     1.2087   0.584    0.559
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 378.29  on 293  degrees of freedom
## Residual deviance: 368.70  on 290  degrees of freedom
## AIC: 376.7
## 
## Number of Fisher Scoring iterations: 4

summary(Salary5)

## 
## Call:
## glm(formula = SalaryBin ~ BA + RBIper100, family = "binomial", 
##     data = StatsSimple)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7850  -1.3271   0.7973   0.9385   1.2044  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.07589    1.06653  -1.946   0.0516 .
## BA           6.97787    4.10406   1.700   0.0891 .
## RBIper100    0.07253    0.03538   2.050   0.0404 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 378.29  on 293  degrees of freedom
## Residual deviance: 369.04  on 291  degrees of freedom
## AIC: 375.04
## 
## Number of Fisher Scoring iterations: 4

Anova(Salary1)

## Analysis of Deviance Table (Type II tests)
## 
## Response: SalaryBin
##           LR Chisq Df Pr(>Chisq)  
## BA          3.0353  1    0.08147 .
## RBIper100   4.2430  1    0.03941 *
## lgID.y      1.6979  1    0.19257  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova(Salary2)

## Analysis of Deviance Table (Type II tests)
## 
## Response: SalaryBin
##              LR Chisq Df Pr(>Chisq)  
## lgID.y         1.7181  1    0.18994  
## BA             3.0353  1    0.08147 .
## RBIper100      4.2430  1    0.03941 *
## BA:RBIper100   0.3671  1    0.54458  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova(Salary3)

## Analysis of Deviance Table (Type II tests)
## 
## Response: SalaryBin
##                     LR Chisq Df Pr(>Chisq)  
## lgID.y                1.7181  1    0.18994  
## BA                    2.9996  1    0.08329 .
## RBIper100             4.2257  1    0.03982 *
## lgID.y:BA             0.1564  1    0.69250  
## lgID.y:RBIper100      0.4758  1    0.49032  
## BA:RBIper100          0.4054  1    0.52430  
## lgID.y:BA:RBIper100   0.9926  1    0.31911  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova(Salary4)

## Analysis of Deviance Table (Type II tests)
## 
## Response: SalaryBin
##              LR Chisq Df Pr(>Chisq)  
## BA             2.9282  1    0.08704 .
## RBIper100      4.2707  1    0.03878 *
## BA:RBIper100   0.3469  1    0.55588  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova(Salary5)

## Analysis of Deviance Table (Type II tests)
## 
## Response: SalaryBin
##           LR Chisq Df Pr(>Chisq)  
## BA          2.9282  1    0.08704 .
## RBIper100   4.2707  1    0.03878 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(Salary1,Salary2, Salary3,Salary4, Salary5)

##         df      AIC
## Salary1  4 375.3464
## Salary2  5 376.9793
## Salary3  8 381.4339
## Salary4  4 376.6974
## Salary5  3 375.0443

BIC(Salary1,Salary2, Salary3, Salary4, Salary5)

##         df      BIC
## Salary1  4 390.0807
## Salary2  5 395.3972
## Salary3  8 410.9025
## Salary4  4 391.4317
## Salary5  3 386.0950

ggplot(StatsSimple, aes(x=BA, y=SalaryBin,  color= lgID.y))+
  geom_point()+
  geom_smooth(method = "glm", method.args= list(family="binomial"))

## `geom_smooth()` using formula 'y ~ x'

#observations represent a random sample, yes data picked from 2016 filtering to only those with over 200 at bats, aka starters, no biases in data selection

# co-linearity between predictors, as shown in the pairs.panels plot the r value between HR and RBI is 0.8 so I removed HR.

# there does appear to be a linear relationship between RBIper100 and salary, as well as a slight linear relationship between BA and salary

#create 2 models and check the residuals for normality and equal variance

Appendix C: part 4

#predict data to see how well the model fits 

splitter1<- sample(1:nrow(StatsSimple), 30, replace = F)
Salary_train<- StatsSimple[-splitter1,]
Salary_test<- StatsSimple[splitter1,]

model_all_train<- glm(SalaryBin~BA+RBIper100, family ="binomial", data = StatsSimple)


Salary_prediction<- predict(model_all_train,Salary_test)


plot(Salary_test$SalaryBin, pch=1)+
  points(Salary_prediction, pch=20, col="red")

## integer(0)

Appendix D

# again I will use statsSimple because it contains only the variables I am looking to analyze. 

# split the data into high and low salary players, high >1,000,000$/yr low <= 1,000,000$/yr

StatsSimple<- StatsSimple %>%
  mutate(SalaryCaliber = ifelse(salary > 1000000, "High", "Low")) %>%  
  mutate(logHR= log(HRper100+0.01)) 

par(mfrow= c(2,3))
boxplot(HRper100~SalaryCaliber, data= StatsSimple, col="cyan")
boxplot(RBIper100~SalaryCaliber, data= StatsSimple, col="cyan")
boxplot(BA~SalaryCaliber, data= StatsSimple, col="cyan")
#boxplot of HR,RBI,BA by league
boxplot(HRper100~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League")
boxplot(RBIper100~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League")
boxplot(BA~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League")

# the low salary does not look normally distributed but this is because it is a much smaller range on the scale compared to the high bin. When subsetting low on its own axis you can see its spread is not terrible, although does require some explaining. The base salary is 500,000$ so there are no data values below that and the majority of "low" salary people are on the base salary contract so that is why it is so heavily centered close to 500,000 with some outliers towards 1 million. 

LowSalary<- StatsSimple %>% 
  filter(SalaryCaliber== "Low")

HighSalary<- StatsSimple %>% 
  filter(SalaryCaliber== "High")

NL<- StatsSimple %>% 
  filter(lgID.y== "NL")

AL<- StatsSimple %>% 
  filter(lgID.y== "AL")


#however for our model the explanatory data (the salary does not need to be normally distributed, only the residuals do... I believe.)


#Check for normality of all three variables in both catergories that I will be testing

#league BA
qqPlot(NL$BA, main = "QQ-Plot of National League Batting Avg", ylab = "BA", col = "cornflowerblue")

## [1] 79 89

qqPlot(AL$BA, main = "QQ-Plot of American League Batting Avg", ylab = "BA", col = "cornflowerblue")

## [1]  54 118

#league- RBI
qqPlot(NL$RBI, main = "QQ-Plot of National League HR", ylab = "HR", col = "cornflowerblue")

## [1] 69 16

qqPlot(AL$RBI, main = "QQ-Plot of American League HR", ylab = "HR", col = "cornflowerblue")

## [1] 100  14

#league-HR
qqPlot(NL$HR, main = "QQ-Plot of National League HR", ylab = "HR", col = "cornflowerblue")

## [1] 56 20

qqPlot(AL$HR, main = "QQ-Plot of American League HR", ylab = "HR", col = "cornflowerblue")

## [1] 137  33

#Check normality of each with shapiro tests

#create linear models of BA,HR,RBI with league to check residuals for normality
BAleague<- lm(BA~lgID.y, data = StatsSimple)
ResBAleague<-BAleague$residuals

RBIleague<- lm(RBIper100~lgID.y, data = StatsSimple)
ResRBIleague<-RBIleague$residuals

HRleague<- lm(HRper100~lgID.y, data = StatsSimple)
ResHRleague<-HRleague$residuals

#check normality of league residuals
shapiro.test(ResBAleague)

## 
##  Shapiro-Wilk normality test
## 
## data:  ResBAleague
## W = 0.99619, p-value = 0.7023

shapiro.test(ResRBIleague)

## 
##  Shapiro-Wilk normality test
## 
## data:  ResRBIleague
## W = 0.99326, p-value = 0.2095

shapiro.test(ResHRleague) #not normal

## 
##  Shapiro-Wilk normality test
## 
## data:  ResHRleague
## W = 0.98376, p-value = 0.002062

# now test for variance of the normally distributed BA and RBI data variables

#BA
leveneTest(StatsSimple$BA~StatsSimple$SalaryCaliber)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.7686 0.3814
##       292

leveneTest(StatsSimple$BA~StatsSimple$lgID.y)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.2382 0.6259
##       292

#RBI
leveneTest(StatsSimple$RBIper100~StatsSimple$SalaryCaliber)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.6541 0.4193
##       292

leveneTest(StatsSimple$HRper100~StatsSimple$lgID.y)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.4656 0.4955
##       292

#levene tests showed that all variables that are normally distributed are also equal variance
#can run paired t-test on both BA and HRper100

#two sample t-test of equal variance

#league
 t.test(NL$BA, AL$BA, var.equal = TRUE) #cant reject Ho

## 
##  Two Sample t-test
## 
## data:  NL$BA and AL$BA
## t = 0.2461, df = 292, p-value = 0.8058
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.006319989  0.008126395
## sample estimates:
## mean of x mean of y 
## 0.2604202 0.2595170

 t.test(NL$RBIper100, AL$RBIper100, var.equal = TRUE) #cant reject Ho

## 
##  Two Sample t-test
## 
## data:  NL$RBIper100 and AL$RBIper100
## t = -0.04456, df = 292, p-value = 0.9645
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8625100  0.8243184
## sample estimates:
## mean of x mean of y 
##  12.83402  12.85311

#non parametric test comparing HR 
 #cant reject Ho
wilcox.test(NL$HRper100,AL$HRper100) #cant reject Ho

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  NL$HRper100 and AL$HRper100
## W = 10008, p-value = 0.2751
## alternative hypothesis: true location shift is not equal to 0

Do MLB Players Earn Their Salary?

EEMB 146 Final Project

Jacob Braslaw

8/26/2020

Abstract

Introduction

Exploratory Data Analysis

Statistical Methods

Results

Discussion

Appendix

Citations