Baseball offensive hitting statistics were used to determine if higher paid players are actually more offensively productive. Given we are in the golden age of statistical analysis in baseball, what team owner wouldn’t want to know if they are over paying their players? This project first found out the distribution and variances of all numerical variables to see what kind of statistical analysis could be performed. Parametric, non-paramertric, generalized linear models, and predictive analysis were used. From these tools it was found that yes, more productive offensive players were paid more in 2016. The old baseball saying, the money lies in the RBI’s, was the most significant conclusion found in this project.
In the age of modern sabermetrics baseball, there are ample datasets to choose from to gather data. For this analysis I pulled data from Lahmans Baseball Git Hub Site, maintained by Chris Dalzell. The Lehman repository had two main R files needed for this project. The batting data and salaries data. These two sets were joined to view salary and batting data in the same data frame. Batting variables that were viewed in this project are batting average(BA), Runs Batted In(RBI), and Home runs(HR). These three variables are the basis for all offensive evaluation metrics in baseball.
Every major league team puts high emphasis on efficiency, owners want to win as many games as possible with out over paying players for their productivity. Given this concern the basis of this project is to evaluate players productivity in relation to their earnings. Statistical analysis will be used to investigate the relationship between 2016 MLB players offensive productivity and their earnings for the same year.
Goals for this project are to discover the discrete individual relationship between earnings and BA, HR’s, and RBI’s. I also aim to find a model that can predict the salaries based on the variables BA, HR, and RBI. From this model we can see what is the strongest predictor than return to the original relationship tests to see if that particular relationship is a statistically significant. From these tests a conclusion can be made about whether players salaries are a good measure of offensive productivity (aka players earning their salaries).
Hypothesis
H0: There is no relationship between the salary of 2016 MLB players with over 200 at bats and the offensive productivity of 2016 MLB players with over 200 at bats.
Ha: There is a relationship between the salary of 2016 MLB players with over 200 at bats and the offensive productivity of 2016 MLB players with over 200 at bats.
From figure 1 it appears that for both categorical variables League and Salary, RBI per 100 at bats and batting average are normal. Home runs per 100 at bats does not look normal for Salary group and will be explored with further analysis in the next section, statistical methods. All R code for the plots in this section are located in appendix section B.
Manipulation of the data. As shown above in Fig1, Fig2, and Fig3 the home runs and RBI’s are adjusted per 100 at bats. This adjustment was made because the spread between at bats varies from 200 to 672, naturally a person with same skills would hit more homeruns in 672 at bats than 200. This adjustment does not apply to batting average because it is inherently an average. Salary data was changed into binary values to make two groups. High (value = 1) is players paid about 1,000,000 US dollar, the low (value =0) is those paid less that or equal to 1,000,000 US dollars. All code for statistical analysis can be found in appendix section C. The types three tests used in the statistical analysis were two sample t-test, Wilcox Rank Sum, and generalized linear model.
Two Sample T-test
Two sample t-test will tell us if there is a significant difference between the means of the batting average of high paid and low paid groups.
Assumptions of two sample t-test:
Wilcox Rank Sum Test
Wilcox rank sum test is a non-parametric test, meaning it is used when the assumptions can’t be met for a parametric test, like two sample t-test. This test compares the central tendencies of two groups using ranks
Assumptions of Wilcox Test:
Both samples are random samples
Both samples have the same shape of distribution
Generalized Linear Model (Binomial Family)
A GLM is similar to a linear model. It is an approach to model the relationship between a response and explantory variable(s). GLM ignores some of the assumptions the a linear model makes, it does not have to be normally distributed or have equal variances.
Assumptions
Response Variable should be distributed like random component
For each x value there is a corresponding y
Y values are independent from one another
Assumption Checks of Batting Average(BA) for two sample t-test
Both samples are random because they were filtered only by more than 200 at bats from the entire population
Both populations are normally distributed as shown in figure 2 the spread appears to be normal. As shown in figure 4, the p-value of the shapiro test on residuals > 0.05, so we can not reject Ho: the data is normally distributed. In appendix section C part 1 the plot of qqPlot of the residuals is shown, given the points fall within the 95% CI of the normal distribution. From these tests data can be assumed normally distributed
Both populations have equal variance as shown by the Levene’s Test p-value > 0.05 so we can reject Ho: the data has equal variance. This indicates the data does have normal variance. It can also been seen in appendix section C part 1 in the residuals vs fitted plot, there is no distinct pattern showing the data is normally distributed.
All assumptions met, a two-sample t-test can be run
full t-test can be seen in appendix C part 2
State hypothesis
\(H_0 : \mu_L = \mu_H\)
\(H_A : \mu_L \neq \mu_H\)
H0 :The mean batting average in the low salary group is equal to the mean batting average of the high salary group
Ha :The mean batting average in the low salary group is not equal to the mean batting average of the high salary group
t.test(LowSalary$BA, HighSalary$BA, var.equal = TRUE)
Assumption Checks of RBI per 100 AB for two sample t-test
Both samples are random because they were filtered only by more than 200 at bats from the entire population
Both populations are normally distributed as shown in figure 2 the spread appears to be normal. As shown in figure 4, the p-value of the shapiro test on residuals > 0.05, so we can not reject Ho: the data is normally distributed. In appendix section C part 1 the plot of qqPlot of the residuals is shown, given the points fall within the 95% CI of the normal distribution. From these tests data can be assumed normally distributed
Both populations have equal variance as shown by the Levene’s Test p-value > 0.05 so we can reject Ho: the data has equal variance. This indicates the data does have normal variance. It can also been seen in appendix section C part 1 in the residuals vs fitted plot, there is no distinct pattern showing the data is normally distributed.
All assumptions met, a two-sample t-test can be run
full t-test can be seen in appendix C part 2
State hypothesis
\(H_0 : \mu_L = \mu_H\)
\(H_A : \mu_L \neq \mu_H\)
H0 :The mean number of RBI per 100 at bats in the low salary group is equal to the mean number of RBI per 100 in the high salary group
Ha :The mean number of RBI per 100 at bats in the low salary group is not equal to the mean number of RBI per 100 in the high salary group
t.test(LowSalary$RBIper100, HighSalary$RBIper100, var.equal = TRUE)
Assumption Checks of HR per 100 AB for two sample t-test
Both samples are random because they were filtered only by more than 200 at bats from the entire population
As shown in figure 4 the shapiro test shows that the residuals are not normally distributed
Not all assumptions met, check assumptions for Wilcox Rank Sum Test
Both samples are random, see above
As shown inn figure 2 and 3 the distribution is roughly the same
Assumptions met, Wilcox Rank Sum Test can be run
full test can be seen in appendix C part 2
State hypthosesis
\(H_0 : \mu_L = \mu_H\)
\(H_A : \mu_L \neq \mu_H\)
H0: The amount of HR per 100 at bats is the same in the low lalary and high salary groups
Ha: The amount of HR per 100 at bats is not the same in the low salary and high salary groups
wilcox.test(LowSalary$HRper100,HighSalary$HRper100)
Assumption checks to run a Generlized Linear Model
The response variable, salary, is distributed like the random component
For each salary point there is a corresponding explanatory variable data entry
Salaries are independent from one another
All assumptions met, a Generlaized Linear Model can be run
full model can be seen in appendix C part 3
as shown in figure 1, there is co-linearity between HRper100 and RBIper100, so HRper100 is removed.
State hypothesis
\[H_0 : \beta = 0 \\ H_A: \beta \neq 0\]
H0 There is no relationship between the variables batting average and RBIper100 and the salary that a 2016 MLB player gets paid.
Ha There is a relationship between the variables batting average and RBIper100 and the salary that a 2016 MLB player gets paid.
5 generalized linear models were created, then they were assessed using Akaike information criterion (AIC) and Bayesian information criterion (BIC). Both are tools to look at goodness of fit of the model. BIC is harsher on scoring more complex models
Prediction of data points using best model ranked by BIC
full prediction shown in appendix C part 4
Prediction process:
From the figure 7 shown in results, it shows that model 5 was the best fit
The data was subsetted, splitting off 30 points to be used as test points.
The remaining 264 points were used as training points, using the model of best fit to train.
Then the model of best fit was used to predict where those 30 subsetted salary points should be. Whether they are 1 “High Salary” or 2 “Low Salary”
Shapiro P-value | Levene P-value | |
---|---|---|
BA | 0.7046396 | 0.3814 |
RBIper100 | 0.1131881 | 0.4193 |
HRper100 | 0.0055605 | 0.1927 |
Results of Batting Average T-test and RBI per 100 at bats T-test
T-tests were performed on both batting average and RBIper100 because they both met all the assumptions required. As shown in figure 5 the p-value of BA t-test <0.05. This means that we can reject out H0 that the means are equal. So from this test we can conclude that the means are not equal. As shown in Figure 2, the central tendency of BA is larger in the high salary group. It is also shown in appendix D with a Tukey-Kramer test.
It can be concluded that high salary players do have a higher batting average than low salary players
T-test for RBI per 100 at bats also have a p-value <0.05 as shown in figure 5. From this we can reject H0 and assumme that there is a difference in the mean number of RBI hit per 100 at bats. As shown in figure 2 and analyzied in appendix D with Tukey-Kramer test the number of RBIper100 is larger in the high salary group.
It can be concluded that high salary players do have a higher number of RBI per 100 at bats than low salary players
Results of Wilcox Rank sum test
The HR per 100 at bats was nor normally distributed and did not meet the assumptions for a parametric test so a non parametric test was performed. From the results in figure 6 we can see that a p-value of >0.05 was returned. This means that we fail to reject H0: that there is a difference in the mean number of homeruns hit between the high and low salary groups
It can be concluded that there is no difference between the number of home runs hit in 100 at bats between the high and low salary players
P-value | 95% Confidence Intervals | Alpha Level | Degrees of Freedom | T-value | |
---|---|---|---|---|---|
Batting Avg | 0.02605 | -0.016112375-0.001029543 | 0.05 | 292 | -2.2368 |
RBI per 100 | 0.01219 | -2.0043759 -0.2474258 | 0.05 | 292 | -2.5225 |
P-value | W-Value | |
---|---|---|
HR per 100 | 0.06403 | 8464 |
Results of Generalized Linear Model
There were 5 models made. As shown in figure 7. The model that I deemed most fit is Salary 5. This model can be written as \[pr(salary group)_i = \beta_0 + exp\beta_1(BA)_i + exp\beta_2(RBIper100)_i+ \epsilon_i\] All models can be seen in Appendix 3 part C. The reason modal “Salary 5” was choosen is because it had the lowest AIC and BIC values of all the models indicating that it is the best fit.
B0= -2.07589
B1= 6.977
B2= 0.07253
The salary 5 model had one significant explanatory variable, RBIper100. As shown in appendix C part 3 the Pr(>Chisq) of RBIper100 = 0.0378. This indicates that the best predictor of salary group is from the RBI per 100 at bats.
Degrees of Freedom | AIC | BIC | |
---|---|---|---|
Salary1 | 4 | 375.3464 | 390.0807 |
Salary2 | 5 | 376.9793 | 395.3972 |
Salary3 | 8 | 381.4339 | 410.9025 |
Salary4 | 4 | 376.6974 | 391.4317 |
Salary5 | 3 | 375.0443 | 386.0950 |
Results of Salary Group Prediction As shown in figure 9, the salary group prediction did not do a okay job. Many points were left in the middle between the two groups 9 out of 30. The model did get almost exactly 1 and 0 for some predictions. This is likely because predicting salary is very difficult. Mainly because a salary is determined by human interaction. The owners do not use this predictive model to determine the players next contract.
integer(0)
From the analysis it can be concluded that there is a significant difference between the 2016 MLB batting average and RBI’s per 100 at bats for players who are paid over versus under 1,000,000 US dollars. There was found to be no differences between any of the variables based upon league, if interested the results are in appendix D.
Using the variables statistically significant tests they were used in finding a GLM. The model with the lowest BIC score was choosen. This model did not do a great job predicting the salary caliber. This was a limitation in the study. There are many more variables out there that predict a pkayers salary. For starters, defense is taken into account. Another important one is that salaries are negotiated before the player puts up those statistics for the year. This can be some what ignored because ideally a player will play at the average for that year but aging and injuries all play into affect. Another short coming is that young players that quickly bloosom into stars are often stuck on their low paying rookie contracts which conversely former stars, who signed long expensive contracts, are now aging and not producing at the level that earned them that contract.
These are some of the short comings of the study, if I had more time I would certainly over lay many years of data and try to make it into a multideminsional matrix data set then plot my analysis over time.
However it is not to say that this study did have significant findings. The is an age old saying in baseball, “the money lies in the RBI’s”. From this study that was undoubtedly shown as RBI was the most predictive value.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/
Revelle, W. (2020) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, https://CRAN.R-project.org/package=psych Version = 2.0.7,.
John Fox and Sanford Weisberg (2019). An R Companion to Applied Regression, 3rd Edition. Thousand Oaks, CA
Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.29.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
Michael Friendly, Chris Dalzell, Martin Monkman and Dennis Murphy (2020). Lahman: Sean ‘Lahman’ Baseball Database. R package version 8.0-0. https://CRAN.R-project.org/package=Lahman
Russell Lenth (2020). emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.5.0. https://CRAN.R-project.org/package=emmeans
Section A: Data cleaning and manipulation
load('../data/Salaries.RData')
Salary16<- Salaries %>%
filter(yearID==2016)
load('../data/Batting.RData')
Stats16x<- Batting %>%
filter(yearID==2016) %>%
filter(AB > 200) %>%
mutate(BA = (H/AB))
Stats16 <- left_join(Stats16x, Salary16, by = "playerID") %>%
na.omit(salary)
StatsSimple<- Stats16 %>%
mutate(HRper100= (HR/AB)*100) %>%
mutate(RBIper100= (RBI/AB)*100) %>%
select(HRper100,RBIper100,BA,salary, lgID.y)
LowSalary<- StatsSimple %>%
filter(SalaryCaliber== "Low")
HighSalary<- StatsSimple %>%
filter(SalaryCaliber== "High")
NL<- StatsSimple %>%
filter(lgID.y== "NL")
AL<- StatsSimple %>%
filter(lgID.y== "AL")
Section B: Exploratory Data Analysis
pairs.panels(StatsSimple[1:3], density = TRUE, cor = TRUE)
par(mfrow= c(2,3))
boxplot(HRper100~SalaryCaliber, data= StatsSimple, col="cyan", xlab="Salary Group", ylab="HR per 100 at bats")
boxplot(RBIper100~SalaryCaliber, data= StatsSimple, col="cyan", xlab="Salary Group", ylab="RBI per 100 at bats")
boxplot(BA~SalaryCaliber, data= StatsSimple, col="cyan", xlab="Salary Group", ylab="Batting Average")
#boxplot of HR,RBI,BA by league
boxplot(HRper100~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League", ylab="HR per 100 at bats")
boxplot(RBIper100~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League", ylab= "RBI per 100 at bats")
boxplot(BA~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League", ylab= "Batting Average")
par(mfrow= c(2,1))
#HR Low
hist(LowSalary$HRper100, col="cyan", xlab="HR per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 Low Paid MLB HR")
#HR High
hist(HighSalary$HRper100, col="cyan", xlab="HR per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 High Paid MLB HR")
#NL HR
hist(NL$HRper100, col="cornflowerblue", xlab="HR per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 National League MLB HR")
#AL HR
hist(AL$HRper100, col="cornflowerblue", xlab="HR per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 American League MLB HR")
#BA of high and low
hist(LowSalary$BA, col = "cyan", xlab = "Batting Average", ylab = "Frequency(player count", main ="Histogram of 2016 Low Paid MLB Batting Averages")
hist(HighSalary$BA, col = "cyan", xlab = "Batting Average", ylab = "Frequency(player count", main ="Histogram of 2016 High Paid MLB Batting Averages")
#BA of AL NL
hist(NL$BA, col = "cornflowerblue", xlab = "Batting Average", ylab = "Frequency(player count", main ="Histogram of 2016 National League MLB Batting Averages")
hist(AL$BA, col = "cornflowerblue", xlab = "Batting Average", ylab = "Frequency(player count", main ="Histogram of 2016 American League MLB Batting Averages")
#RBI low and high
hist(LowSalary$RBIper100, col="cyan", xlab="RBI per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 Low Paid MLB RBI")
hist(HighSalary$RBIper100, col="cyan", xlab="RBI per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 High Paid MLB RBI")
#RBI NL and AL
hist(NL$RBIper100, col="cornflowerblue", xlab="RBI per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 National League MLB RBI")
hist(AL$RBIper100, col="cornflowerblue", xlab="RBI per 100 AB", ylab="Frequency (player count)", main="Histogram of 2016 American League MLB RBI")
Section c: Statistical Methods
Part 1
par(mfrow = c(2,2))
BAsalary<- lm(BA~SalaryCaliber, data = StatsSimple)
BAshap<- shapiro.test(BAsalary$residuals)
BAlev<- leveneTest(StatsSimple$BA~StatsSimple$SalaryCaliber)
plot(BAsalary)
RBIsalary<- lm(RBIper100~SalaryCaliber, data = StatsSimple)
RBIshap<- shapiro.test(RBIsalary$residuals)
RBIlev<- leveneTest(StatsSimple$RBIper100~StatsSimple$SalaryCaliber)
plot(RBIsalary)
HRsalary<-lm(HRper100~SalaryCaliber, data = StatsSimple)
HRshap<- shapiro.test(HRsalary$residuals)
HRlev<- leveneTest(StatsSimple$HRper100~StatsSimple$SalaryCaliber)
plot(HRsalary)
SalarysumShap<- c(BAshap$p.value,RBIshap$p.value,HRshap$p.value) %>%
setNames(c("BA ", "RBIper100 ", "HRper100 "))
SalarysumLev<- c(0.3814,0.4193,0.1927)
SalTestdf<-data.frame(SalarysumShap, SalarysumLev)
knitr::kable(SalTestdf, caption= "Normality and Variance Table", col.names = c("Shapiro P-value", "Levene P-value"))
Shapiro P-value | Levene P-value | |
---|---|---|
BA | 0.7046396 | 0.3814 |
RBIper100 | 0.1131881 | 0.4193 |
HRper100 | 0.0055605 | 0.1927 |
#Tukey Kramer test
anova(BAsalary)
## Analysis of Variance Table
##
## Response: BA
## Df Sum Sq Mean Sq F value Pr(>F)
## SalaryCaliber 1 0.004871 0.0048707 5.0033 0.02605 *
## Residuals 292 0.284260 0.0009735
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(BAsalary, pairwise~SalaryCaliber)
## $emmeans
## SalaryCaliber emmean SE df lower.CL upper.CL
## High 0.263 0.00225 292 0.258 0.267
## Low 0.254 0.00310 292 0.248 0.260
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## High - Low 0.00857 0.00383 292 2.237 0.0261
anova(RBIsalary)
## Analysis of Variance Table
##
## Response: RBIper100
## Df Sum Sq Mean Sq F value Pr(>F)
## SalaryCaliber 1 84.0 84.049 6.3628 0.01219 *
## Residuals 292 3857.2 13.210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(RBIsalary, pairwise~SalaryCaliber)
## $emmeans
## SalaryCaliber emmean SE df lower.CL upper.CL
## High 13.2 0.262 292 12.7 13.7
## Low 12.1 0.362 292 11.4 12.8
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## High - Low 1.13 0.446 292 2.522 0.0122
Appendix C: Part 2
#Two sample t-test of Batting average and salary
t.test(LowSalary$BA, HighSalary$BA, var.equal = TRUE)
##
## Two Sample t-test
##
## data: LowSalary$BA and HighSalary$BA
## t = -2.2368, df = 292, p-value = 0.02605
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.016112375 -0.001029543
## sample estimates:
## mean of x mean of y
## 0.2543452 0.2629161
#Two sample t-test of RBI per 100 ab and salary
t.test(LowSalary$BA, HighSalary$BA, var.equal = TRUE)
##
## Two Sample t-test
##
## data: LowSalary$BA and HighSalary$BA
## t = -2.2368, df = 292, p-value = 0.02605
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.016112375 -0.001029543
## sample estimates:
## mean of x mean of y
## 0.2543452 0.2629161
#wilcox rank sum test of HR per 100 ab and salary
wilcox.test(LowSalary$HRper100,HighSalary$HRper100)
##
## Wilcoxon rank sum test with continuity correction
##
## data: LowSalary$HRper100 and HighSalary$HRper100
## W = 8464, p-value = 0.06403
## alternative hypothesis: true location shift is not equal to 0
Appendix C: Part 3
#make salary high and low into binary 0 and 1 so the binomial
StatsSimple<-StatsSimple %>%
mutate(SalaryBin= ifelse(salary > 1000000, 1, 0))
Salary1<- glm(SalaryBin~BA+ RBIper100+lgID.y, family = "binomial", data = StatsSimple)
#check residuals of the model not sure if you have to do this for GLM
Salary2<- glm(SalaryBin~+lgID.y+BA* RBIper100, family = "binomial", data = StatsSimple)
#check residuals if I need to
Salary3<- glm(SalaryBin~lgID.y*BA*RBIper100, family ="binomial", data = StatsSimple)
Salary4<- glm(SalaryBin~BA*RBIper100, family ="binomial", data = StatsSimple)
Salary5<- glm(SalaryBin~BA+RBIper100, family ="binomial", data = StatsSimple)
summary(Salary1)
##
## Call:
## glm(formula = SalaryBin ~ BA + RBIper100 + lgID.y, family = "binomial",
## data = StatsSimple)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8634 -1.3429 0.7801 0.9240 1.2751
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.94616 1.07515 -1.810 0.0703 .
## BA 7.13659 4.12260 1.731 0.0834 .
## RBIper100 0.07230 0.03539 2.043 0.0410 *
## lgID.yNL -0.32606 0.25083 -1.300 0.1936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 378.29 on 293 degrees of freedom
## Residual deviance: 367.35 on 290 degrees of freedom
## AIC: 375.35
##
## Number of Fisher Scoring iterations: 4
summary(Salary2)
##
## Call:
## glm(formula = SalaryBin ~ +lgID.y + BA * RBIper100, family = "binomial",
## data = StatsSimple)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9420 -1.3263 0.7779 0.9412 1.2264
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3684 3.9834 0.092 0.926
## lgID.yNL -0.3281 0.2509 -1.308 0.191
## BA -1.8379 15.4405 -0.119 0.905
## RBIper100 -0.1167 0.3160 -0.369 0.712
## BA:RBIper100 0.7283 1.2118 0.601 0.548
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 378.29 on 293 degrees of freedom
## Residual deviance: 366.98 on 289 degrees of freedom
## AIC: 376.98
##
## Number of Fisher Scoring iterations: 4
summary(Salary3)
##
## Call:
## glm(formula = SalaryBin ~ lgID.y * BA * RBIper100, family = "binomial",
## data = StatsSimple)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8233 -1.2904 0.7513 0.9443 1.2176
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.9341 6.1823 -0.636 0.525
## lgID.yNL 7.2636 8.1432 0.892 0.372
## BA 13.3980 23.8842 0.561 0.575
## RBIper100 0.2595 0.4777 0.543 0.587
## lgID.yNL:BA -27.0470 31.5451 -0.857 0.391
## lgID.yNL:RBIper100 -0.6901 0.6476 -1.066 0.287
## BA:RBIper100 -0.6058 1.8204 -0.333 0.739
## lgID.yNL:BA:RBIper100 2.4693 2.4796 0.996 0.319
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 378.29 on 293 degrees of freedom
## Residual deviance: 365.43 on 286 degrees of freedom
## AIC: 381.43
##
## Number of Fisher Scoring iterations: 4
summary(Salary4)
##
## Call:
## glm(formula = SalaryBin ~ BA * RBIper100, family = "binomial",
## data = StatsSimple)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8614 -1.3073 0.8017 0.9579 1.1859
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1690 3.9729 0.043 0.966
## BA -1.7284 15.4075 -0.112 0.911
## RBIper100 -0.1108 0.3153 -0.351 0.725
## BA:RBIper100 0.7063 1.2087 0.584 0.559
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 378.29 on 293 degrees of freedom
## Residual deviance: 368.70 on 290 degrees of freedom
## AIC: 376.7
##
## Number of Fisher Scoring iterations: 4
summary(Salary5)
##
## Call:
## glm(formula = SalaryBin ~ BA + RBIper100, family = "binomial",
## data = StatsSimple)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7850 -1.3271 0.7973 0.9385 1.2044
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.07589 1.06653 -1.946 0.0516 .
## BA 6.97787 4.10406 1.700 0.0891 .
## RBIper100 0.07253 0.03538 2.050 0.0404 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 378.29 on 293 degrees of freedom
## Residual deviance: 369.04 on 291 degrees of freedom
## AIC: 375.04
##
## Number of Fisher Scoring iterations: 4
Anova(Salary1)
## Analysis of Deviance Table (Type II tests)
##
## Response: SalaryBin
## LR Chisq Df Pr(>Chisq)
## BA 3.0353 1 0.08147 .
## RBIper100 4.2430 1 0.03941 *
## lgID.y 1.6979 1 0.19257
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(Salary2)
## Analysis of Deviance Table (Type II tests)
##
## Response: SalaryBin
## LR Chisq Df Pr(>Chisq)
## lgID.y 1.7181 1 0.18994
## BA 3.0353 1 0.08147 .
## RBIper100 4.2430 1 0.03941 *
## BA:RBIper100 0.3671 1 0.54458
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(Salary3)
## Analysis of Deviance Table (Type II tests)
##
## Response: SalaryBin
## LR Chisq Df Pr(>Chisq)
## lgID.y 1.7181 1 0.18994
## BA 2.9996 1 0.08329 .
## RBIper100 4.2257 1 0.03982 *
## lgID.y:BA 0.1564 1 0.69250
## lgID.y:RBIper100 0.4758 1 0.49032
## BA:RBIper100 0.4054 1 0.52430
## lgID.y:BA:RBIper100 0.9926 1 0.31911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(Salary4)
## Analysis of Deviance Table (Type II tests)
##
## Response: SalaryBin
## LR Chisq Df Pr(>Chisq)
## BA 2.9282 1 0.08704 .
## RBIper100 4.2707 1 0.03878 *
## BA:RBIper100 0.3469 1 0.55588
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(Salary5)
## Analysis of Deviance Table (Type II tests)
##
## Response: SalaryBin
## LR Chisq Df Pr(>Chisq)
## BA 2.9282 1 0.08704 .
## RBIper100 4.2707 1 0.03878 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(Salary1,Salary2, Salary3,Salary4, Salary5)
## df AIC
## Salary1 4 375.3464
## Salary2 5 376.9793
## Salary3 8 381.4339
## Salary4 4 376.6974
## Salary5 3 375.0443
BIC(Salary1,Salary2, Salary3, Salary4, Salary5)
## df BIC
## Salary1 4 390.0807
## Salary2 5 395.3972
## Salary3 8 410.9025
## Salary4 4 391.4317
## Salary5 3 386.0950
ggplot(StatsSimple, aes(x=BA, y=SalaryBin, color= lgID.y))+
geom_point()+
geom_smooth(method = "glm", method.args= list(family="binomial"))
## `geom_smooth()` using formula 'y ~ x'
#observations represent a random sample, yes data picked from 2016 filtering to only those with over 200 at bats, aka starters, no biases in data selection
# co-linearity between predictors, as shown in the pairs.panels plot the r value between HR and RBI is 0.8 so I removed HR.
# there does appear to be a linear relationship between RBIper100 and salary, as well as a slight linear relationship between BA and salary
#create 2 models and check the residuals for normality and equal variance
Appendix C: part 4
#predict data to see how well the model fits
splitter1<- sample(1:nrow(StatsSimple), 30, replace = F)
Salary_train<- StatsSimple[-splitter1,]
Salary_test<- StatsSimple[splitter1,]
model_all_train<- glm(SalaryBin~BA+RBIper100, family ="binomial", data = StatsSimple)
Salary_prediction<- predict(model_all_train,Salary_test)
plot(Salary_test$SalaryBin, pch=1)+
points(Salary_prediction, pch=20, col="red")
## integer(0)
Appendix D
# again I will use statsSimple because it contains only the variables I am looking to analyze.
# split the data into high and low salary players, high >1,000,000$/yr low <= 1,000,000$/yr
StatsSimple<- StatsSimple %>%
mutate(SalaryCaliber = ifelse(salary > 1000000, "High", "Low")) %>%
mutate(logHR= log(HRper100+0.01))
par(mfrow= c(2,3))
boxplot(HRper100~SalaryCaliber, data= StatsSimple, col="cyan")
boxplot(RBIper100~SalaryCaliber, data= StatsSimple, col="cyan")
boxplot(BA~SalaryCaliber, data= StatsSimple, col="cyan")
#boxplot of HR,RBI,BA by league
boxplot(HRper100~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League")
boxplot(RBIper100~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League")
boxplot(BA~lgID.y, data= StatsSimple, col="cornflowerblue", xlab= "League")
# the low salary does not look normally distributed but this is because it is a much smaller range on the scale compared to the high bin. When subsetting low on its own axis you can see its spread is not terrible, although does require some explaining. The base salary is 500,000$ so there are no data values below that and the majority of "low" salary people are on the base salary contract so that is why it is so heavily centered close to 500,000 with some outliers towards 1 million.
LowSalary<- StatsSimple %>%
filter(SalaryCaliber== "Low")
HighSalary<- StatsSimple %>%
filter(SalaryCaliber== "High")
NL<- StatsSimple %>%
filter(lgID.y== "NL")
AL<- StatsSimple %>%
filter(lgID.y== "AL")
#however for our model the explanatory data (the salary does not need to be normally distributed, only the residuals do... I believe.)
#Check for normality of all three variables in both catergories that I will be testing
#league BA
qqPlot(NL$BA, main = "QQ-Plot of National League Batting Avg", ylab = "BA", col = "cornflowerblue")
## [1] 79 89
qqPlot(AL$BA, main = "QQ-Plot of American League Batting Avg", ylab = "BA", col = "cornflowerblue")
## [1] 54 118
#league- RBI
qqPlot(NL$RBI, main = "QQ-Plot of National League HR", ylab = "HR", col = "cornflowerblue")
## [1] 69 16
qqPlot(AL$RBI, main = "QQ-Plot of American League HR", ylab = "HR", col = "cornflowerblue")
## [1] 100 14
#league-HR
qqPlot(NL$HR, main = "QQ-Plot of National League HR", ylab = "HR", col = "cornflowerblue")
## [1] 56 20
qqPlot(AL$HR, main = "QQ-Plot of American League HR", ylab = "HR", col = "cornflowerblue")
## [1] 137 33
#Check normality of each with shapiro tests
#create linear models of BA,HR,RBI with league to check residuals for normality
BAleague<- lm(BA~lgID.y, data = StatsSimple)
ResBAleague<-BAleague$residuals
RBIleague<- lm(RBIper100~lgID.y, data = StatsSimple)
ResRBIleague<-RBIleague$residuals
HRleague<- lm(HRper100~lgID.y, data = StatsSimple)
ResHRleague<-HRleague$residuals
#check normality of league residuals
shapiro.test(ResBAleague)
##
## Shapiro-Wilk normality test
##
## data: ResBAleague
## W = 0.99619, p-value = 0.7023
shapiro.test(ResRBIleague)
##
## Shapiro-Wilk normality test
##
## data: ResRBIleague
## W = 0.99326, p-value = 0.2095
shapiro.test(ResHRleague) #not normal
##
## Shapiro-Wilk normality test
##
## data: ResHRleague
## W = 0.98376, p-value = 0.002062
# now test for variance of the normally distributed BA and RBI data variables
#BA
leveneTest(StatsSimple$BA~StatsSimple$SalaryCaliber)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.7686 0.3814
## 292
leveneTest(StatsSimple$BA~StatsSimple$lgID.y)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.2382 0.6259
## 292
#RBI
leveneTest(StatsSimple$RBIper100~StatsSimple$SalaryCaliber)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.6541 0.4193
## 292
leveneTest(StatsSimple$HRper100~StatsSimple$lgID.y)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.4656 0.4955
## 292
#levene tests showed that all variables that are normally distributed are also equal variance
#can run paired t-test on both BA and HRper100
#two sample t-test of equal variance
#league
t.test(NL$BA, AL$BA, var.equal = TRUE) #cant reject Ho
##
## Two Sample t-test
##
## data: NL$BA and AL$BA
## t = 0.2461, df = 292, p-value = 0.8058
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.006319989 0.008126395
## sample estimates:
## mean of x mean of y
## 0.2604202 0.2595170
t.test(NL$RBIper100, AL$RBIper100, var.equal = TRUE) #cant reject Ho
##
## Two Sample t-test
##
## data: NL$RBIper100 and AL$RBIper100
## t = -0.04456, df = 292, p-value = 0.9645
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8625100 0.8243184
## sample estimates:
## mean of x mean of y
## 12.83402 12.85311
#non parametric test comparing HR
#cant reject Ho
wilcox.test(NL$HRper100,AL$HRper100) #cant reject Ho
##
## Wilcoxon rank sum test with continuity correction
##
## data: NL$HRper100 and AL$HRper100
## W = 10008, p-value = 0.2751
## alternative hypothesis: true location shift is not equal to 0