# data and some code pulled from BasketballAnalyzeR package: https://bodai.unibs.it/bdsports/basketballanalyzer/
#data(package="BasketballAnalyzeR")
# NCAA data pulled from: https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset
# using 2019 season for completeness
cbb <- read.csv("~/Dropbox/AltAc/Freelancing/BasketballDataScience/MBB19/rawdat/cbb/cbb19.csv")
This report is my own analysis using a combination of BasketballAnalyzeR and other packages to analyze performance in the 2019 NCAA MBB season and tournament.
In Basketball Data Science, chapter 6 reviews aggregating game box scores at the team, opponent, and player level using the tidyverse. This 2019 CBB data is already aggregated within season at the team level, so I’ll be focusing on team-level analyses and visualizations.
TEAM: The Division I college basketball school
CONF: The Athletic Conference in which the school participates in (A10 = Atlantic 10, ACC = Atlantic Coast Conference, AE = America East, Amer = American, ASun = ASUN, B10 = Big Ten, B12 = Big 12, BE = Big East, BSky = Big Sky, BSth = Big South, BW = Big West, CAA = Colonial Athletic Association, CUSA = Conference USA, Horz = Horizon League, Ivy = Ivy League, MAAC = Metro Atlantic Athletic Conference, MAC = Mid-American Conference, MEAC = Mid-Eastern Athletic Conference, MVC = Missouri Valley Conference, MWC = Mountain West, NEC = Northeast Conference, OVC = Ohio Valley Conference, P12 = Pac-12, Pat = Patriot League, SB = Sun Belt, SC = Southern Conference, SEC = South Eastern Conference, Slnd = Southland Conference, Sum = Summit League, SWAC = Southwestern Athletic Conference, WAC = Western Athletic Conference, WCC = West Coast Conference)
G: Number of games played
W: Number of games won
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
BARTHAG: Power Rating (Chance of beating an average Division I team)
EFG_O: Effective Field Goal Percentage Shot
EFG_D: Effective Field Goal Percentage Allowed
TOR: Turnover Percentage Allowed (Turnover Rate)
TORD: Turnover Percentage Committed (Steal Rate)
ORB: Offensive Rebound Rate
DRB: Offensive Rebound Rate Allowed
FTR : Free Throw Rate (How often the given team shoots Free Throws)
FTRD: Free Throw Rate Allowed
2P_O: Two-Point Shooting Percentage
2P_D: Two-Point Shooting Percentage Allowed
3P_O: Three-Point Shooting Percentage
3P_D: Three-Point Shooting Percentage Allowed
ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)
WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)
POSTSEASON: Round where the given team was eliminated or where their season ended (R68 = First Four, R64 = Round of 64, R32 = Round of 32, S16 = Sweet Sixteen, E8 = Elite Eight, F4 = Final Four, 2ND = Runner-up, Champion = Winner of the NCAA March Madness Tournament for that given year)
SEED: Seed in the NCAA March Madness Tournament
cbb<-mutate(cbb, Qual = ifelse(SEED <= 16, "Yes", "No"))
cbb$Qual<-cbb$Qual %>% replace_na("No")
kable(table(cbb$Qual))
| Var1 | Freq |
|---|---|
| No | 285 |
| Yes | 68 |
tourney<-subset(cbb, Qual=="Yes")
# subset to numeric vars and describe
numVARS<-c("G",
"W",
"ADJOE",
"ADJDE",
"BARTHAG",
"EFG_O",
"EFG_D",
"TOR",
"TORD",
"ORB",
"DRB",
"FTR",
"FTRD",
"X2P_O",
"X2P_D",
"X3P_O",
"X3P_D",
"ADJ_T",
"WAB"
)
numITEMS<-cbb[numVARS]
kable(describe(numITEMS),
format='markdown',
digits=2)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| G | 1 | 353 | 31.75 | 2.51 | 31.00 | 31.64 | 2.97 | 26.00 | 39.00 | 13.00 | 0.34 | -0.02 | 0.13 |
| W | 2 | 353 | 17.11 | 6.37 | 17.00 | 16.91 | 7.41 | 3.00 | 35.00 | 32.00 | 0.26 | -0.35 | 0.34 |
| ADJOE | 3 | 353 | 103.34 | 7.02 | 103.10 | 103.26 | 6.82 | 83.70 | 123.40 | 39.70 | 0.12 | 0.21 | 0.37 |
| ADJDE | 4 | 353 | 103.34 | 6.45 | 104.00 | 103.46 | 7.12 | 85.20 | 119.20 | 34.00 | -0.17 | -0.34 | 0.34 |
| BARTHAG | 5 | 353 | 0.49 | 0.25 | 0.48 | 0.49 | 0.30 | 0.03 | 0.97 | 0.94 | 0.16 | -1.10 | 0.01 |
| EFG_O | 6 | 353 | 50.60 | 2.94 | 50.50 | 50.65 | 2.97 | 40.00 | 59.00 | 19.00 | -0.21 | 0.37 | 0.16 |
| EFG_D | 7 | 353 | 50.77 | 2.75 | 50.90 | 50.79 | 2.82 | 42.50 | 59.30 | 16.80 | -0.08 | 0.16 | 0.15 |
| TOR | 8 | 353 | 18.61 | 2.07 | 18.50 | 18.52 | 1.93 | 13.50 | 25.10 | 11.60 | 0.34 | 0.32 | 0.11 |
| TORD | 9 | 353 | 18.52 | 2.09 | 18.30 | 18.46 | 2.08 | 13.30 | 24.70 | 11.40 | 0.35 | -0.01 | 0.11 |
| ORB | 10 | 353 | 28.25 | 3.94 | 28.30 | 28.21 | 4.00 | 15.90 | 38.70 | 22.80 | 0.04 | -0.16 | 0.21 |
| DRB | 11 | 353 | 28.42 | 2.92 | 28.30 | 28.33 | 2.97 | 21.70 | 37.10 | 15.40 | 0.28 | -0.32 | 0.16 |
| FTR | 12 | 353 | 32.95 | 4.71 | 33.30 | 32.96 | 4.45 | 21.90 | 48.10 | 26.20 | 0.03 | -0.17 | 0.25 |
| FTRD | 13 | 353 | 33.20 | 5.08 | 32.70 | 32.95 | 4.89 | 21.80 | 54.00 | 32.20 | 0.63 | 0.85 | 0.27 |
| X2P_O | 14 | 353 | 50.06 | 3.36 | 50.30 | 50.06 | 3.26 | 37.70 | 61.40 | 23.70 | -0.12 | 0.70 | 0.18 |
| X2P_D | 15 | 353 | 50.23 | 3.12 | 50.20 | 50.23 | 2.97 | 40.70 | 61.20 | 20.50 | 0.03 | 0.26 | 0.17 |
| X3P_O | 16 | 353 | 34.29 | 2.54 | 34.20 | 34.27 | 2.67 | 27.90 | 42.40 | 14.50 | 0.09 | -0.09 | 0.14 |
| X3P_D | 17 | 353 | 34.42 | 2.34 | 34.40 | 34.44 | 2.22 | 27.90 | 41.80 | 13.90 | -0.07 | 0.00 | 0.12 |
| ADJ_T | 18 | 353 | 69.17 | 2.69 | 69.00 | 69.08 | 2.52 | 60.70 | 79.10 | 18.40 | 0.41 | 0.73 | 0.14 |
| WAB | 19 | 353 | -7.78 | 7.12 | -8.60 | -8.11 | 7.41 | -23.40 | 11.20 | 34.60 | 0.39 | -0.32 | 0.38 |
tnumITEMS<-tourney[numVARS]
kable(describe(tnumITEMS),
format='markdown',
digits=2)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| G | 1 | 68 | 34.51 | 2.02 | 34.00 | 34.50 | 1.48 | 29.00 | 39.00 | 10.00 | 0.03 | 0.08 | 0.24 |
| W | 2 | 68 | 25.34 | 4.25 | 25.00 | 25.21 | 4.45 | 17.00 | 35.00 | 18.00 | 0.19 | -0.92 | 0.52 |
| ADJOE | 3 | 68 | 111.31 | 5.92 | 110.90 | 111.31 | 5.49 | 98.00 | 123.40 | 25.40 | 0.06 | -0.43 | 0.72 |
| ADJDE | 4 | 68 | 96.47 | 5.63 | 96.30 | 96.20 | 4.74 | 85.20 | 110.30 | 25.10 | 0.41 | -0.12 | 0.68 |
| BARTHAG | 5 | 68 | 0.79 | 0.17 | 0.85 | 0.82 | 0.11 | 0.24 | 0.97 | 0.73 | -1.32 | 1.03 | 0.02 |
| EFG_O | 6 | 68 | 52.69 | 2.52 | 52.95 | 52.69 | 2.59 | 46.70 | 59.00 | 12.30 | -0.03 | -0.33 | 0.31 |
| EFG_D | 7 | 68 | 48.18 | 2.45 | 48.45 | 48.24 | 2.52 | 42.50 | 53.60 | 11.10 | -0.28 | -0.47 | 0.30 |
| TOR | 8 | 68 | 17.42 | 1.57 | 17.40 | 17.43 | 1.56 | 13.90 | 22.70 | 8.80 | 0.22 | 0.74 | 0.19 |
| TORD | 9 | 68 | 19.09 | 2.35 | 18.70 | 19.01 | 2.08 | 14.10 | 24.70 | 10.60 | 0.41 | -0.02 | 0.29 |
| ORB | 10 | 68 | 30.06 | 4.00 | 30.20 | 30.13 | 4.00 | 20.70 | 37.90 | 17.20 | -0.15 | -0.45 | 0.48 |
| DRB | 11 | 68 | 27.70 | 2.97 | 27.00 | 27.56 | 3.19 | 22.20 | 34.90 | 12.70 | 0.40 | -0.34 | 0.36 |
| FTR | 12 | 68 | 33.74 | 4.22 | 33.40 | 33.39 | 3.71 | 26.70 | 45.30 | 18.60 | 0.70 | 0.10 | 0.51 |
| FTRD | 13 | 68 | 31.36 | 4.43 | 31.75 | 31.22 | 5.26 | 23.10 | 46.20 | 23.10 | 0.46 | 0.31 | 0.54 |
| X2P_O | 14 | 68 | 52.26 | 2.95 | 51.95 | 52.19 | 2.52 | 44.10 | 61.40 | 17.30 | 0.31 | 0.85 | 0.36 |
| X2P_D | 15 | 68 | 47.51 | 2.97 | 47.85 | 47.53 | 3.04 | 40.70 | 53.70 | 13.00 | -0.09 | -0.60 | 0.36 |
| X3P_O | 16 | 68 | 35.55 | 2.34 | 35.75 | 35.57 | 2.37 | 30.40 | 41.40 | 11.00 | -0.06 | -0.22 | 0.28 |
| X3P_D | 17 | 68 | 32.83 | 2.06 | 33.25 | 32.90 | 1.70 | 27.90 | 37.40 | 9.50 | -0.35 | -0.13 | 0.25 |
| ADJ_T | 18 | 68 | 68.46 | 2.82 | 68.00 | 68.41 | 2.52 | 60.70 | 76.00 | 15.30 | 0.16 | 0.09 | 0.34 |
| WAB | 19 | 68 | 1.86 | 5.23 | 1.90 | 2.12 | 4.67 | -10.90 | 11.20 | 22.10 | -0.38 | -0.20 | 0.63 |
Descriptives for the entire MBB season and for the subset of teams that qualified for the tournament.
ggplot(cbb, aes(x=W))+
geom_histogram(color="#FFFFFF", fill="#003C80")+
scale_x_continuous(breaks = seq(0, 40, by = 10))+
scale_y_continuous(breaks = seq(0, 90, len = 10))+
labs(title="2019 MBB Wins",x="Frequency", y = "Count")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
kable(table(cbb$CONF))
| Var1 | Freq |
|---|---|
| A10 | 14 |
| ACC | 15 |
| AE | 9 |
| Amer | 12 |
| ASun | 8 |
| B10 | 14 |
| B12 | 10 |
| BE | 10 |
| BSky | 12 |
| BSth | 12 |
| BW | 9 |
| CAA | 10 |
| CUSA | 14 |
| Horz | 10 |
| Ivy | 8 |
| MAAC | 11 |
| MAC | 12 |
| MEAC | 12 |
| MVC | 10 |
| MWC | 11 |
| NEC | 10 |
| OVC | 12 |
| P12 | 12 |
| Pat | 10 |
| SB | 12 |
| SC | 10 |
| SEC | 14 |
| Slnd | 13 |
| Sum | 8 |
| SWAC | 10 |
| WAC | 9 |
| WCC | 10 |
p<-ggplot(cbb, aes(x=CONF, y=W, fill=CONF)) +
geom_boxplot()+
theme_bw()
p+theme(legend.position = "bottom")
p<-ggplot(cbb, aes(x=CONF, y=W, fill=CONF)) +
geom_boxplot()+
facet_wrap(~Qual)+
theme_bw()
p+theme(legend.position = "bottom")
ggplot(tourney, aes(x = fct_infreq(CONF)))+
geom_bar(color="#FFFFFF", fill="#CCA600")+
labs(title="2019 Tournament by Conference",x="Conference", y = "Count")+
theme_minimal()+
theme(legend.position = "bottom", axis.text.x = element_text(angle=60, size=7, hjust = 1))
Power 5 conferences were well represented, but smaller conferences, like the American and Big East conferences also performed well this year.
numVARS<-c("W",
"ADJOE",
"ADJDE",
"BARTHAG",
"EFG_O",
"EFG_D",
"TOR",
"TORD",
"ORB",
"DRB",
"FTR",
"FTRD",
"X2P_O",
"X2P_D",
"X3P_O",
"X3P_D",
"ADJ_T")
numITEMS<-cbb[numVARS]
scatterplot(numITEMS, data.var =1:17, diag = list(continuous="blankDiag"))
numtVARS<-c("W",
"ADJOE",
"ADJDE",
"BARTHAG",
"EFG_O",
"EFG_D",
"TOR",
"TORD",
"ORB",
"DRB",
"FTR",
"FTRD",
"X2P_O",
"X2P_D",
"X3P_O",
"X3P_D",
"ADJ_T",
"Qual"
)
numtITEMS<-cbb[numtVARS]
scatterplot(numtITEMS, data.var =1:17, z.var="Qual", diag = list(continuous="blankDiag"))
Adjusted Offensive Efficiency (ADJOE) and Adjusted Defensive Efficiency (ADJDE) were both strongly correlated with total wins. These variables both index points scored and points allowed.
p<-ggplot(cbb, aes(x=Qual, y=ADJOE, fill=Qual)) +
geom_boxplot()+
labs(title="ADJOE by Qualifying Status")+
theme_bw()
p+theme(legend.position = "bottom")
t.test(ADJOE ~ Qual, data = cbb)
##
## Welch Two Sample t-test
##
## data: ADJOE by Qual
## t = -12.401, df = 100.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -11.457124 -8.296797
## sample estimates:
## mean in group No mean in group Yes
## 101.4333 111.3103
p<-ggplot(cbb, aes(x=Qual, y=ADJDE, fill=Qual)) +
geom_boxplot()+
labs(title="ADJDE by Qualifying Status")+
theme_bw()
p+theme(legend.position = "bottom")
t.test(ADJDE ~ Qual, data = cbb)
##
## Welch Two Sample t-test
##
## data: ADJDE by Qual
## t = 11.249, df = 99.724, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 7.002421 10.001531
## sample estimates:
## mean in group No mean in group Yes
## 104.97404 96.47206
There were fairly substantial differences in average offensive and defensive efficiency between teams that did and did not qualify for the tournament.
cbb$QualFac <- ifelse(cbb$Qual == "Yes",
c(1), c(0))
quallogit <- glm(QualFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data = cbb, family = "binomial")
summary(quallogit)
##
## Call:
## glm(formula = QualFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB +
## ADJ_T, family = "binomial", data = cbb)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9912 -0.3911 -0.1649 -0.0545 3.1703
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.01010 10.46199 -0.574 0.56565
## ADJOE 0.25740 0.05528 4.656 3.22e-06 ***
## ADJDE -0.14201 0.04475 -3.174 0.00151 **
## TOR -0.11747 0.14427 -0.814 0.41551
## TORD 0.20635 0.11644 1.772 0.07635 .
## ORB 0.01531 0.06082 0.252 0.80126
## DRB -0.05702 0.07724 -0.738 0.46038
## ADJ_T -0.13250 0.07948 -1.667 0.09550 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 345.95 on 352 degrees of freedom
## Residual deviance: 182.11 on 345 degrees of freedom
## AIC: 198.11
##
## Number of Fisher Scoring iterations: 6
confint(quallogit)
## 2.5 % 97.5 %
## (Intercept) -26.92777883 14.29234233
## ADJOE 0.15467451 0.37261806
## ADJDE -0.23318617 -0.05684055
## TOR -0.40486518 0.16320295
## TORD -0.02016965 0.43876062
## ORB -0.10420604 0.13548867
## DRB -0.21114406 0.09343618
## ADJ_T -0.29282576 0.02029582
tab_model(quallogit)
| Â | Qual Fac | ||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 0.00 | 0.00 – 1610962.41 | 0.566 |
| ADJOE | 1.29 | 1.17 – 1.45 | <0.001 |
| ADJDE | 0.87 | 0.79 – 0.94 | 0.002 |
| TOR | 0.89 | 0.67 – 1.18 | 0.416 |
| TORD | 1.23 | 0.98 – 1.55 | 0.076 |
| ORB | 1.02 | 0.90 – 1.15 | 0.801 |
| DRB | 0.94 | 0.81 – 1.10 | 0.460 |
| ADJ T | 0.88 | 0.75 – 1.02 | 0.095 |
| Observations | 353 | ||
| R2 Tjur | 0.500 | ||
wald.test(b = coef(quallogit), Sigma = vcov(quallogit), Terms = 4:8)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 6.9, df = 5, P(> X2) = 0.23
exp(coef(quallogit))
## (Intercept) ADJOE ADJDE TOR TORD ORB
## 0.00245384 1.29356306 0.86761547 0.88917003 1.22918772 1.01542722
## DRB ADJ_T
## 0.94457678 0.87589994
exp(cbind(OR = coef(quallogit), confint(quallogit)))
## OR 2.5 % 97.5 %
## (Intercept) 0.00245384 2.020292e-12 1.610962e+06
## ADJOE 1.29356306 1.167278e+00 1.451530e+00
## ADJDE 0.86761547 7.920061e-01 9.447447e-01
## TOR 0.88917003 6.670667e-01 1.177276e+00
## TORD 1.22918772 9.800324e-01 1.550784e+00
## ORB 1.01542722 9.010396e-01 1.145096e+00
## DRB 0.94457678 8.096574e-01 1.097941e+00
## ADJ_T 0.87589994 7.461521e-01 1.020503e+00
labs <- c("Free Throw Rate","Two-Point Shooting Percentage","Three-Point Shooting Percentage")
barline(data=tourney, id="TEAM", bars=c("FTR","X2P_O","X3P_O"),
line="W", order.by="SEED", labels.bars=labs)
This Bar-line plot displays some offensive stats for qualifying teams in the 2019 NCAAMB Tournament. The bars are ordered by seed, and the line plots the total number of wins.
require("ggrepel")
p <- ggplot(tourney, aes(x=ADJ_T, y=SEED)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)
p + geom_label_repel(aes(label = tourney$TEAM,
fill = factor(CONF)), color = 'white',
size = 3.5) +
theme(legend.position = "bottom")
p <- ggplot(tourney, aes(x=ADJOE, y=SEED)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)
p + geom_label_repel(aes(label = tourney$TEAM,
fill = factor(CONF)), color = 'white',
size = 3.5) +
theme(legend.position = "bottom")
p <- ggplot(tourney, aes(x=ADJDE, y=SEED)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)
p + geom_label_repel(aes(label = tourney$TEAM,
fill = factor(CONF)), color = 'white',
size = 3.5) +
theme(legend.position = "bottom")
Plotting relationships between Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo), ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense), ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense), and tournament seed.
There is not much of a relationship between tempo and seed. However, there appear to be very strong relationships between offensive efficiency, defensive efficiency, and seed where more points scored corresponds to a lower seed and more points allowed corresponds to a higher seed.
require("ggrepel")
tourney$TFac<-ordered(tourney$POSTSEASON, levels = c("R68", "R64", "R32", "S16", "E8", "F4", "2ND", "Champions"))
p <- ggplot(tourney, aes(x=ADJ_T, y=W)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)
p + geom_label_repel(aes(label = tourney$TEAM,
fill = factor(CONF)), color = 'white',
size = 3.5) +
theme(legend.position = "bottom")
p <- ggplot(tourney, aes(x=ADJOE, y=W)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)
p + geom_label_repel(aes(label = tourney$TEAM,
fill = factor(CONF)), color = 'white',
size = 3.5) +
theme(legend.position = "bottom")
p <- ggplot(tourney, aes(x=ADJDE, y=W)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)
p + geom_label_repel(aes(label = tourney$TEAM,
fill = factor(CONF)), color = 'white',
size = 3.5) +
theme(legend.position = "bottom")
ggplot(tourney, aes(x=ADJ_T, y=W)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)+
geom_label_repel(aes(label = tourney$TEAM,
fill = factor(TFac)), color = 'white',
size = 3.5) +
labs(x = "Tempo", y = "Wins")+
theme(legend.position = "bottom")+
guides(fill=guide_legend(title="Post Season Finish"))
ggplot(tourney, aes(x=ADJOE, y=W)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)+
geom_label_repel(aes(label = tourney$TEAM,
fill = factor(TFac)), color = 'white',
size = 3.5) +
labs(x = "Offensive Efficiency", y = "Wins")+
theme(legend.position = "bottom")+
guides(fill=guide_legend(title="Post Season Finish"))
ggplot(tourney, aes(x=ADJDE, y=W)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)+
geom_label_repel(aes(label = tourney$TEAM,
fill = factor(TFac)), color = 'white',
size = 3.5) +
labs(x = "Defensive Efficiency", y = "Wins")+
theme(legend.position = "bottom")+
guides(fill=guide_legend(title="Post Season Finish"))
Plotting relationships between Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo), ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense), ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense), and total season wins.
There is not much of a relationship between tempo and wins. However, consistent with seed and the scatterplots above, there appears to be very strong relationships between offensive efficiency, defensive efficiency, and wins where more points scored corresponds to more wins (and therefore going farther in the tourney) and more points allowed corresponds to fewer wins.
Quick sanity check: Season wins should increase with tournament position (with the caveat that some teams play different amounts of games because of preseason tournaments, games cancelled, etc.)
tourney$POSTSEASON<-factor(tourney$POSTSEASON, levels = c("R68", "R64", "R32", "S16", "E8", "F4", "2ND", "Champions"))
p<-ggplot(tourney, aes(x=POSTSEASON, y=W, fill=POSTSEASON)) +
geom_boxplot()+
theme_bw()
p+theme(legend.position = "bottom")
Consistent with the scatterplots above, there appears to be strong relationships between offensive efficiency and going farther in the tourney
p<-ggplot(tourney, aes(x=POSTSEASON, y=ADJOE, fill=POSTSEASON)) +
geom_boxplot()+
theme_bw()
p+theme(legend.position = "bottom")
Consistent with the scatterplots above, there appears to be strong relationships between defensive efficiency and going farther in the tourney. However, the runners-up, Texas Tech, had a better overall defensive efficiency index throughout the course of the season than the champions, UVA.
p<-ggplot(tourney, aes(x=POSTSEASON, y=ADJDE, fill=POSTSEASON)) +
geom_boxplot()+
theme_bw()
p+theme(legend.position = "bottom")
Though it did not relate to total season wins, UVA had a very quick tempo throughout the season.
p<-ggplot(tourney, aes(x=POSTSEASON, y=ADJ_T, fill=POSTSEASON)) +
geom_boxplot()+
theme_bw()
p+theme(legend.position = "bottom")
Here, I examined differences in rebounds and turnovers by post season position. These variables did not relate as strongly to total wins as the points scored and allowed variables, however these stats can be very important to a game’s outcome.
p<-ggplot(tourney, aes(x=POSTSEASON, y=ORB, fill=POSTSEASON)) +
geom_boxplot()+
theme_bw()
p+theme(legend.position = "bottom")
p<-ggplot(tourney, aes(x=POSTSEASON, y=DRB, fill=POSTSEASON)) +
geom_boxplot()+
theme_bw()
p+theme(legend.position = "bottom")
p<-ggplot(tourney, aes(x=POSTSEASON, y=TOR, fill=POSTSEASON)) +
geom_boxplot()+
theme_bw()
p+theme(legend.position = "bottom")
p<-ggplot(tourney, aes(x=POSTSEASON, y=TORD, fill=POSTSEASON)) +
geom_boxplot()+
theme_bw()
p+theme(legend.position = "bottom")
Above, I examined correlations between various statistics in all teams in the 2019 season and correlations split by tournament qualification. Let’s examine the relationships between different stats just in the tournament teams using a corrplot form BasketballAnalyzeR.
Interestingly, these stats range from weakly to moderatley related to wins. The different effective field goal percentage shot and allowed variables are highly related to the 2-point and 3-point percentages made and allowed, suggesting using both in a model could lead to high collinearity. These high correlations makes sense, as the effective field goal rate variables are likely calculated with those 2-point and 3-point percentages made and allowed variables.
tourney$BARTHAG<-NULL
corrmatrix<-corranalysis(tourney[,4:19], threshold = .5)
plot(corrmatrix)
tourney$TFac<-ordered(tourney$POSTSEASON, levels = c("R68", "R64", "R32", "S16", "E8", "F4", "2ND", "Champions"))
tourney$TFac<-as.numeric(tourney$TFac)
m1<- lm(TFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data=tourney)
summary(m1)
##
## Call:
## lm(formula = TFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB +
## ADJ_T, data = tourney)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88803 -0.54753 0.06408 0.54113 2.55633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.50782 5.54203 -0.813 0.4192
## ADJOE 0.15888 0.02867 5.541 7.07e-07 ***
## ADJDE -0.06808 0.02561 -2.658 0.0101 *
## TOR 0.13289 0.10272 1.294 0.2007
## TORD 0.01598 0.06762 0.236 0.8140
## ORB -0.02384 0.03958 -0.602 0.5493
## DRB 0.03106 0.05299 0.586 0.5600
## ADJ_T -0.09501 0.04141 -2.294 0.0253 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9185 on 60 degrees of freedom
## Multiple R-squared: 0.6101, Adjusted R-squared: 0.5646
## F-statistic: 13.41 on 7 and 60 DF, p-value: 2.804e-10
tab_model(m1)
| Â | T Fac | ||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -4.51 | -15.59 – 6.58 | 0.419 |
| ADJOE | 0.16 | 0.10 – 0.22 | <0.001 |
| ADJDE | -0.07 | -0.12 – -0.02 | 0.010 |
| TOR | 0.13 | -0.07 – 0.34 | 0.201 |
| TORD | 0.02 | -0.12 – 0.15 | 0.814 |
| ORB | -0.02 | -0.10 – 0.06 | 0.549 |
| DRB | 0.03 | -0.07 – 0.14 | 0.560 |
| ADJ T | -0.10 | -0.18 – -0.01 | 0.025 |
| Observations | 68 | ||
| R2 / R2 adjusted | 0.610 / 0.565 | ||
#check_model(m1)
#model_performance(m1)
Offensive efficiency, defensive efficiency, and tempo were statistically significantly predictors of tournament placement. Better offensive efficiency, worse defensive efficiency, and quicker tempo corresponded to remaining in the tournament longer.
m2<- lm(W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data=cbb)
summary(m2)
##
## Call:
## lm(formula = W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T,
## data = cbb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9285 -2.5066 0.0583 2.3364 10.3360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.67673 9.40723 -0.497 0.619406
## ADJOE 0.41997 0.04332 9.694 < 2e-16 ***
## ADJDE -0.23797 0.04060 -5.861 1.08e-08 ***
## TOR -0.43752 0.12258 -3.569 0.000409 ***
## TORD 0.73777 0.10573 6.978 1.54e-11 ***
## ORB 0.18087 0.05542 3.263 0.001211 **
## DRB -0.42026 0.07522 -5.587 4.69e-08 ***
## ADJ_T 0.06206 0.06917 0.897 0.370253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.388 on 345 degrees of freedom
## Multiple R-squared: 0.7232, Adjusted R-squared: 0.7176
## F-statistic: 128.8 on 7 and 345 DF, p-value: < 2.2e-16
tab_model(m2)
| Â | W | ||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -4.68 | -23.18 – 13.83 | 0.619 |
| ADJOE | 0.42 | 0.33 – 0.51 | <0.001 |
| ADJDE | -0.24 | -0.32 – -0.16 | <0.001 |
| TOR | -0.44 | -0.68 – -0.20 | <0.001 |
| TORD | 0.74 | 0.53 – 0.95 | <0.001 |
| ORB | 0.18 | 0.07 – 0.29 | 0.001 |
| DRB | -0.42 | -0.57 – -0.27 | <0.001 |
| ADJ T | 0.06 | -0.07 – 0.20 | 0.370 |
| Observations | 353 | ||
| R2 / R2 adjusted | 0.723 / 0.718 | ||
m3<- lm(W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data=tourney)
summary(m3)
##
## Call:
## lm(formula = W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T,
## data = tourney)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4790 -2.4102 0.4107 1.8924 6.5918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.81923 18.89224 0.890 0.3769
## ADJOE 0.20574 0.09774 2.105 0.0395 *
## ADJDE -0.10281 0.08732 -1.177 0.2437
## TOR -0.79951 0.35017 -2.283 0.0260 *
## TORD 0.53273 0.23051 2.311 0.0243 *
## ORB 0.30951 0.13494 2.294 0.0253 *
## DRB -0.43666 0.18064 -2.417 0.0187 *
## ADJ_T 0.03035 0.14116 0.215 0.8305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.131 on 60 degrees of freedom
## Multiple R-squared: 0.5136, Adjusted R-squared: 0.4568
## F-statistic: 9.05 on 7 and 60 DF, p-value: 1.421e-07
tab_model(m3)
| Â | W | ||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 16.82 | -20.97 – 54.61 | 0.377 |
| ADJOE | 0.21 | 0.01 – 0.40 | 0.039 |
| ADJDE | -0.10 | -0.28 – 0.07 | 0.244 |
| TOR | -0.80 | -1.50 – -0.10 | 0.026 |
| TORD | 0.53 | 0.07 – 0.99 | 0.024 |
| ORB | 0.31 | 0.04 – 0.58 | 0.025 |
| DRB | -0.44 | -0.80 – -0.08 | 0.019 |
| ADJ T | 0.03 | -0.25 – 0.31 | 0.830 |
| Observations | 68 | ||
| R2 / R2 adjusted | 0.514 / 0.457 | ||
In the full season sample, Offensive Efficiency, Defensive Efficiency, Turnover Rate, Steal Rate, Offensive Rebound Rate, and Offensive Rebound Rate Allowed were all statistically significantly predictors of total wins.
In the subsample of teams that qualified for the tournament, Offensive Efficiency, Turnover Rate, Steal Rate, Offensive Rebound Rate, and Offensive Rebound Rate Allowed were all statistically significantly predictors of total wins.
clusVARS<-c(
"EFG_O",
"EFG_D",
"TOR",
"TORD",
"ORB",
"DRB",
"FTR",
"FTRD",
"X2P_O",
"X2P_D",
"X3P_O",
"X3P_D",
"ADJ_T"
)
clusITEMS<-tourney[clusVARS]
set.seed(13)
kclu1<-kclustering(clusITEMS)
plot(kclu1)
kclu2<-kclustering(clusITEMS, labels = tourney$TEAM, k=5)
plot(kclu2)
cluster <- as.factor(kclu2$Subjects$Cluster)
Xbubble <- data.frame(Team=tourney$TEAM, PTS=tourney$ADJOE,
PTS.Opp=tourney$ADJDE, cluster,
W=tourney$W)
labs <- c("PTS", "PTS.Opp", "cluster", "Wins")
bubbleplot(Xbubble, id="Team", x="PTS", y="PTS.Opp",
col="cluster", size="W", labels=labs)
Bubble plot of the teams that participated in the 2019 tournament for
offensive efficiency (PTS), defensive efficiency (PTS.Opp), number of
wins, and cluster.