NCAA Men’s Basketball

Basketball Data Science

Working through and adpating code from P. Zuccolotto and M. Manisera (2020) Basketball Data Science – With Applications in R, Chapman and Hall/CRC. ISBN 9781138600799.

Using BasketballAnalyzeR with NCAA Basketball Data

Data Import

# data and some code pulled from BasketballAnalyzeR package: https://bodai.unibs.it/bdsports/basketballanalyzer/

#data(package="BasketballAnalyzeR")

# NCAA data pulled from: https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset
# using 2019 season for completeness 
cbb <- read.csv("~/Dropbox/AltAc/Freelancing/BasketballDataScience/MBB19/rawdat/cbb/cbb19.csv")

This report is my own analysis using a combination of BasketballAnalyzeR and other packages to analyze performance in the 2019 NCAA MBB season and tournament.

In Basketball Data Science, chapter 6 reviews aggregating game box scores at the team, opponent, and player level using the tidyverse. This 2019 CBB data is already aggregated within season at the team level, so I’ll be focusing on team-level analyses and visualizations.

2019 NCAA Men’s Basketball

CBB Variables

TEAM: The Division I college basketball school
CONF: The Athletic Conference in which the school participates in (A10 = Atlantic 10, ACC = Atlantic Coast Conference, AE = America East, Amer = American, ASun = ASUN, B10 = Big Ten, B12 = Big 12, BE = Big East, BSky = Big Sky, BSth = Big South, BW = Big West, CAA = Colonial Athletic Association, CUSA = Conference USA, Horz = Horizon League, Ivy = Ivy League, MAAC = Metro Atlantic Athletic Conference, MAC = Mid-American Conference, MEAC = Mid-Eastern Athletic Conference, MVC = Missouri Valley Conference, MWC = Mountain West, NEC = Northeast Conference, OVC = Ohio Valley Conference, P12 = Pac-12, Pat = Patriot League, SB = Sun Belt, SC = Southern Conference, SEC = South Eastern Conference, Slnd = Southland Conference, Sum = Summit League, SWAC = Southwestern Athletic Conference, WAC = Western Athletic Conference, WCC = West Coast Conference)
G: Number of games played
W: Number of games won
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
BARTHAG: Power Rating (Chance of beating an average Division I team)
EFG_O: Effective Field Goal Percentage Shot
EFG_D: Effective Field Goal Percentage Allowed
TOR: Turnover Percentage Allowed (Turnover Rate)
TORD: Turnover Percentage Committed (Steal Rate)
ORB: Offensive Rebound Rate
DRB: Offensive Rebound Rate Allowed
FTR : Free Throw Rate (How often the given team shoots Free Throws)
FTRD: Free Throw Rate Allowed
2P_O: Two-Point Shooting Percentage
2P_D: Two-Point Shooting Percentage Allowed
3P_O: Three-Point Shooting Percentage
3P_D: Three-Point Shooting Percentage Allowed
ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)
WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)
POSTSEASON: Round where the given team was eliminated or where their season ended (R68 = First Four, R64 = Round of 64, R32 = Round of 32, S16 = Sweet Sixteen, E8 = Elite Eight, F4 = Final Four, 2ND = Runner-up, Champion = Winner of the NCAA March Madness Tournament for that given year)
SEED: Seed in the NCAA March Madness Tournament

Creating additional variables for analyses

cbb<-mutate(cbb, Qual = ifelse(SEED <= 16, "Yes", "No"))
cbb$Qual<-cbb$Qual %>% replace_na("No")

kable(table(cbb$Qual))

Var1	Freq
No	285
Yes	68

tourney<-subset(cbb, Qual=="Yes")

Basic Descriptives

# subset to numeric vars and describe
numVARS<-c("G",
          "W",
          "ADJOE",
          "ADJDE",
          "BARTHAG",
          "EFG_O",
          "EFG_D",
          "TOR",
          "TORD",
          "ORB",
          "DRB",
          "FTR",
          "FTRD",
          "X2P_O",
          "X2P_D",
          "X3P_O",
          "X3P_D",
          "ADJ_T",
          "WAB"
)
numITEMS<-cbb[numVARS]

kable(describe(numITEMS), 
      format='markdown', 
      digits=2)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
G	1	353	31.75	2.51	31.00	31.64	2.97	26.00	39.00	13.00	0.34	-0.02	0.13
W	2	353	17.11	6.37	17.00	16.91	7.41	3.00	35.00	32.00	0.26	-0.35	0.34
ADJOE	3	353	103.34	7.02	103.10	103.26	6.82	83.70	123.40	39.70	0.12	0.21	0.37
ADJDE	4	353	103.34	6.45	104.00	103.46	7.12	85.20	119.20	34.00	-0.17	-0.34	0.34
BARTHAG	5	353	0.49	0.25	0.48	0.49	0.30	0.03	0.97	0.94	0.16	-1.10	0.01
EFG_O	6	353	50.60	2.94	50.50	50.65	2.97	40.00	59.00	19.00	-0.21	0.37	0.16
EFG_D	7	353	50.77	2.75	50.90	50.79	2.82	42.50	59.30	16.80	-0.08	0.16	0.15
TOR	8	353	18.61	2.07	18.50	18.52	1.93	13.50	25.10	11.60	0.34	0.32	0.11
TORD	9	353	18.52	2.09	18.30	18.46	2.08	13.30	24.70	11.40	0.35	-0.01	0.11
ORB	10	353	28.25	3.94	28.30	28.21	4.00	15.90	38.70	22.80	0.04	-0.16	0.21
DRB	11	353	28.42	2.92	28.30	28.33	2.97	21.70	37.10	15.40	0.28	-0.32	0.16
FTR	12	353	32.95	4.71	33.30	32.96	4.45	21.90	48.10	26.20	0.03	-0.17	0.25
FTRD	13	353	33.20	5.08	32.70	32.95	4.89	21.80	54.00	32.20	0.63	0.85	0.27
X2P_O	14	353	50.06	3.36	50.30	50.06	3.26	37.70	61.40	23.70	-0.12	0.70	0.18
X2P_D	15	353	50.23	3.12	50.20	50.23	2.97	40.70	61.20	20.50	0.03	0.26	0.17
X3P_O	16	353	34.29	2.54	34.20	34.27	2.67	27.90	42.40	14.50	0.09	-0.09	0.14
X3P_D	17	353	34.42	2.34	34.40	34.44	2.22	27.90	41.80	13.90	-0.07	0.00	0.12
ADJ_T	18	353	69.17	2.69	69.00	69.08	2.52	60.70	79.10	18.40	0.41	0.73	0.14
WAB	19	353	-7.78	7.12	-8.60	-8.11	7.41	-23.40	11.20	34.60	0.39	-0.32	0.38

tnumITEMS<-tourney[numVARS]

kable(describe(tnumITEMS), 
      format='markdown', 
      digits=2)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
G	1	68	34.51	2.02	34.00	34.50	1.48	29.00	39.00	10.00	0.03	0.08	0.24
W	2	68	25.34	4.25	25.00	25.21	4.45	17.00	35.00	18.00	0.19	-0.92	0.52
ADJOE	3	68	111.31	5.92	110.90	111.31	5.49	98.00	123.40	25.40	0.06	-0.43	0.72
ADJDE	4	68	96.47	5.63	96.30	96.20	4.74	85.20	110.30	25.10	0.41	-0.12	0.68
BARTHAG	5	68	0.79	0.17	0.85	0.82	0.11	0.24	0.97	0.73	-1.32	1.03	0.02
EFG_O	6	68	52.69	2.52	52.95	52.69	2.59	46.70	59.00	12.30	-0.03	-0.33	0.31
EFG_D	7	68	48.18	2.45	48.45	48.24	2.52	42.50	53.60	11.10	-0.28	-0.47	0.30
TOR	8	68	17.42	1.57	17.40	17.43	1.56	13.90	22.70	8.80	0.22	0.74	0.19
TORD	9	68	19.09	2.35	18.70	19.01	2.08	14.10	24.70	10.60	0.41	-0.02	0.29
ORB	10	68	30.06	4.00	30.20	30.13	4.00	20.70	37.90	17.20	-0.15	-0.45	0.48
DRB	11	68	27.70	2.97	27.00	27.56	3.19	22.20	34.90	12.70	0.40	-0.34	0.36
FTR	12	68	33.74	4.22	33.40	33.39	3.71	26.70	45.30	18.60	0.70	0.10	0.51
FTRD	13	68	31.36	4.43	31.75	31.22	5.26	23.10	46.20	23.10	0.46	0.31	0.54
X2P_O	14	68	52.26	2.95	51.95	52.19	2.52	44.10	61.40	17.30	0.31	0.85	0.36
X2P_D	15	68	47.51	2.97	47.85	47.53	3.04	40.70	53.70	13.00	-0.09	-0.60	0.36
X3P_O	16	68	35.55	2.34	35.75	35.57	2.37	30.40	41.40	11.00	-0.06	-0.22	0.28
X3P_D	17	68	32.83	2.06	33.25	32.90	1.70	27.90	37.40	9.50	-0.35	-0.13	0.25
ADJ_T	18	68	68.46	2.82	68.00	68.41	2.52	60.70	76.00	15.30	0.16	0.09	0.34
WAB	19	68	1.86	5.23	1.90	2.12	4.67	-10.90	11.20	22.10	-0.38	-0.20	0.63

Descriptives for the entire MBB season and for the subset of teams that qualified for the tournament.

Research Questions:

1. What was the spread of wins in the 2019 season?

ggplot(cbb, aes(x=W))+
  geom_histogram(color="#FFFFFF", fill="#003C80")+
  scale_x_continuous(breaks = seq(0, 40, by = 10))+
  scale_y_continuous(breaks = seq(0, 90, len = 10))+
  labs(title="2019 MBB Wins",x="Frequency", y = "Count")+
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2a. What was the spread of wins in the 2019 season by conference?

kable(table(cbb$CONF))

Var1	Freq
A10	14
ACC	15
AE	9
Amer	12
ASun	8
B10	14
B12	10
BE	10
BSky	12
BSth	12
BW	9
CAA	10
CUSA	14
Horz	10
Ivy	8
MAAC	11
MAC	12
MEAC	12
MVC	10
MWC	11
NEC	10
OVC	12
P12	12
Pat	10
SB	12
SC	10
SEC	14
Slnd	13
Sum	8
SWAC	10
WAC	9
WCC	10

p<-ggplot(cbb, aes(x=CONF, y=W, fill=CONF)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

2b. What was the spread of wins in the 2019 season by conference grpuped by tournament qualification?

p<-ggplot(cbb, aes(x=CONF, y=W, fill=CONF)) + 
    geom_boxplot()+
      facet_wrap(~Qual)+
  theme_bw()

p+theme(legend.position = "bottom")

3. What conference appeared the most in the 2019 tournament?

ggplot(tourney, aes(x = fct_infreq(CONF)))+
  geom_bar(color="#FFFFFF", fill="#CCA600")+
  labs(title="2019 Tournament by Conference",x="Conference", y = "Count")+
  theme_minimal()+
  theme(legend.position = "bottom", axis.text.x = element_text(angle=60, size=7, hjust = 1))

Power 5 conferences were well represented, but smaller conferences, like the American and Big East conferences also performed well this year.

4a. What are the relationships between different stats and wins?

numVARS<-c("W",
          "ADJOE",
          "ADJDE",
          "BARTHAG",
          "EFG_O",
          "EFG_D",
          "TOR",
          "TORD",
          "ORB",
          "DRB",
          "FTR",
          "FTRD",
          "X2P_O",
          "X2P_D",
          "X3P_O",
          "X3P_D",
          "ADJ_T")
numITEMS<-cbb[numVARS]

scatterplot(numITEMS, data.var =1:17,  diag = list(continuous="blankDiag"))

4b. What are the relationships between different stats and wins split by tournament qualification?

numtVARS<-c("W",
          "ADJOE",
          "ADJDE",
          "BARTHAG",
          "EFG_O",
          "EFG_D",
          "TOR",
          "TORD",
          "ORB",
          "DRB",
          "FTR",
          "FTRD",
          "X2P_O",
          "X2P_D",
          "X3P_O",
          "X3P_D",
          "ADJ_T",
          "Qual"
)
numtITEMS<-cbb[numtVARS]

scatterplot(numtITEMS, data.var =1:17, z.var="Qual", diag = list(continuous="blankDiag"))

Adjusted Offensive Efficiency (ADJOE) and Adjusted Defensive Efficiency (ADJDE) were both strongly correlated with total wins. These variables both index points scored and points allowed.

4c. What are the mean differences in Adjusted Offensive Efficiency (ADJOE) and Adjusted Defensive Efficiency (ADJDE) for qualifiying vs. non-qualifying teams?

p<-ggplot(cbb, aes(x=Qual, y=ADJOE, fill=Qual)) + 
    geom_boxplot()+
    labs(title="ADJOE by Qualifying Status")+
  theme_bw()

p+theme(legend.position = "bottom")

t.test(ADJOE ~ Qual, data = cbb)

## 
##  Welch Two Sample t-test
## 
## data:  ADJOE by Qual
## t = -12.401, df = 100.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -11.457124  -8.296797
## sample estimates:
##  mean in group No mean in group Yes 
##          101.4333          111.3103

p<-ggplot(cbb, aes(x=Qual, y=ADJDE, fill=Qual)) + 
    geom_boxplot()+
    labs(title="ADJDE by Qualifying Status")+
  theme_bw()

p+theme(legend.position = "bottom")

t.test(ADJDE ~ Qual, data = cbb)

## 
##  Welch Two Sample t-test
## 
## data:  ADJDE by Qual
## t = 11.249, df = 99.724, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##   7.002421 10.001531
## sample estimates:
##  mean in group No mean in group Yes 
##         104.97404          96.47206

There were fairly substantial differences in average offensive and defensive efficiency between teams that did and did not qualify for the tournament.

4d. Predicting qualification

cbb$QualFac <- ifelse(cbb$Qual == "Yes",
c(1), c(0))

quallogit <- glm(QualFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data = cbb, family = "binomial")

summary(quallogit)

## 
## Call:
## glm(formula = QualFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + 
##     ADJ_T, family = "binomial", data = cbb)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9912  -0.3911  -0.1649  -0.0545   3.1703  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.01010   10.46199  -0.574  0.56565    
## ADJOE        0.25740    0.05528   4.656 3.22e-06 ***
## ADJDE       -0.14201    0.04475  -3.174  0.00151 ** 
## TOR         -0.11747    0.14427  -0.814  0.41551    
## TORD         0.20635    0.11644   1.772  0.07635 .  
## ORB          0.01531    0.06082   0.252  0.80126    
## DRB         -0.05702    0.07724  -0.738  0.46038    
## ADJ_T       -0.13250    0.07948  -1.667  0.09550 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 345.95  on 352  degrees of freedom
## Residual deviance: 182.11  on 345  degrees of freedom
## AIC: 198.11
## 
## Number of Fisher Scoring iterations: 6

confint(quallogit)

##                    2.5 %      97.5 %
## (Intercept) -26.92777883 14.29234233
## ADJOE         0.15467451  0.37261806
## ADJDE        -0.23318617 -0.05684055
## TOR          -0.40486518  0.16320295
## TORD         -0.02016965  0.43876062
## ORB          -0.10420604  0.13548867
## DRB          -0.21114406  0.09343618
## ADJ_T        -0.29282576  0.02029582

tab_model(quallogit)

	Qual Fac
Predictors	Odds Ratios	CI	p
(Intercept)	0.00	0.00 – 1610962.41	0.566
ADJOE	1.29	1.17 – 1.45	<0.001
ADJDE	0.87	0.79 – 0.94	0.002
TOR	0.89	0.67 – 1.18	0.416
TORD	1.23	0.98 – 1.55	0.076
ORB	1.02	0.90 – 1.15	0.801
DRB	0.94	0.81 – 1.10	0.460
ADJ T	0.88	0.75 – 1.02	0.095
Observations	353
R² Tjur	0.500

wald.test(b = coef(quallogit), Sigma = vcov(quallogit), Terms = 4:8)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 6.9, df = 5, P(> X2) = 0.23

exp(coef(quallogit))

## (Intercept)       ADJOE       ADJDE         TOR        TORD         ORB 
##  0.00245384  1.29356306  0.86761547  0.88917003  1.22918772  1.01542722 
##         DRB       ADJ_T 
##  0.94457678  0.87589994

exp(cbind(OR = coef(quallogit), confint(quallogit)))

##                     OR        2.5 %       97.5 %
## (Intercept) 0.00245384 2.020292e-12 1.610962e+06
## ADJOE       1.29356306 1.167278e+00 1.451530e+00
## ADJDE       0.86761547 7.920061e-01 9.447447e-01
## TOR         0.88917003 6.670667e-01 1.177276e+00
## TORD        1.22918772 9.800324e-01 1.550784e+00
## ORB         1.01542722 9.010396e-01 1.145096e+00
## DRB         0.94457678 8.096574e-01 1.097941e+00
## ADJ_T       0.87589994 7.461521e-01 1.020503e+00

While accounting for the other variables in the model, with every one unit change in ADJOE, the log odds of qualifying for the tournament increases by 0.257.
While accounting for the other variables in the model, with every one unit change in ADJDE, the log odds of qualifying for the tournament decreases by -0.142.
For the wald test of the overall effect of the other variables in the model, the chi-squared test statistic of 6.9, with three degrees of freedom is associated with a p-value of 0.23 indicates that the overall effect of the other variables in the model is not statistically significant.

5. Digging into Post Season Data

Some post season notes
Virginia won this year
Texas Tech were the runners-up
Teams eliminated in the Final 4 consisted of: Michigan St., and Auburn
Teams eliminated in the Elite 8 consisted of: Gonzaga, Duke, Kentucky, and Purdue

labs <- c("Free Throw Rate","Two-Point Shooting Percentage","Three-Point Shooting Percentage")
barline(data=tourney, id="TEAM", bars=c("FTR","X2P_O","X3P_O"),
        line="W", order.by="SEED", labels.bars=labs)

This Bar-line plot displays some offensive stats for qualifying teams in the 2019 NCAAMB Tournament. The bars are ordered by seed, and the line plots the total number of wins.

require("ggrepel")

p <- ggplot(tourney, aes(x=ADJ_T, y=SEED)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

p <- ggplot(tourney, aes(x=ADJOE, y=SEED)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

p <- ggplot(tourney, aes(x=ADJDE, y=SEED)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

Plotting relationships between Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo), ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense), ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense), and tournament seed.

There is not much of a relationship between tempo and seed. However, there appear to be very strong relationships between offensive efficiency, defensive efficiency, and seed where more points scored corresponds to a lower seed and more points allowed corresponds to a higher seed.

require("ggrepel")
tourney$TFac<-ordered(tourney$POSTSEASON, levels = c("R68", "R64", "R32", "S16", "E8", "F4", "2ND", "Champions"))

p <- ggplot(tourney, aes(x=ADJ_T, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

p <- ggplot(tourney, aes(x=ADJOE, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

p <- ggplot(tourney, aes(x=ADJDE, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

ggplot(tourney, aes(x=ADJ_T, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)+ 
  geom_label_repel(aes(label = tourney$TEAM,
                         fill = factor(TFac)), color = 'white',
                     size = 3.5) +
  labs(x = "Tempo", y = "Wins")+
  theme(legend.position = "bottom")+
  guides(fill=guide_legend(title="Post Season Finish"))

ggplot(tourney, aes(x=ADJOE, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)+ 
  geom_label_repel(aes(label = tourney$TEAM,
                         fill = factor(TFac)), color = 'white',
                     size = 3.5) +
  labs(x = "Offensive Efficiency", y = "Wins")+
  theme(legend.position = "bottom")+
  guides(fill=guide_legend(title="Post Season Finish"))

ggplot(tourney, aes(x=ADJDE, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)+ 
  geom_label_repel(aes(label = tourney$TEAM,
                         fill = factor(TFac)), color = 'white',
                     size = 3.5) +
  labs(x = "Defensive Efficiency", y = "Wins")+
  theme(legend.position = "bottom")+
  guides(fill=guide_legend(title="Post Season Finish"))

There is not much of a relationship between tempo and wins. However, consistent with seed and the scatterplots above, there appears to be very strong relationships between offensive efficiency, defensive efficiency, and wins where more points scored corresponds to more wins (and therefore going farther in the tourney) and more points allowed corresponds to fewer wins.

Quick sanity check: Season wins should increase with tournament position (with the caveat that some teams play different amounts of games because of preseason tournaments, games cancelled, etc.)

tourney$POSTSEASON<-factor(tourney$POSTSEASON, levels = c("R68", "R64", "R32", "S16", "E8", "F4", "2ND", "Champions"))

p<-ggplot(tourney, aes(x=POSTSEASON, y=W, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Consistent with the scatterplots above, there appears to be strong relationships between offensive efficiency and going farther in the tourney

p<-ggplot(tourney, aes(x=POSTSEASON, y=ADJOE, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Consistent with the scatterplots above, there appears to be strong relationships between defensive efficiency and going farther in the tourney. However, the runners-up, Texas Tech, had a better overall defensive efficiency index throughout the course of the season than the champions, UVA.

p<-ggplot(tourney, aes(x=POSTSEASON, y=ADJDE, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Though it did not relate to total season wins, UVA had a very quick tempo throughout the season.

p<-ggplot(tourney, aes(x=POSTSEASON, y=ADJ_T, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Here, I examined differences in rebounds and turnovers by post season position. These variables did not relate as strongly to total wins as the points scored and allowed variables, however these stats can be very important to a game’s outcome.

p<-ggplot(tourney, aes(x=POSTSEASON, y=ORB, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

p<-ggplot(tourney, aes(x=POSTSEASON, y=DRB, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

p<-ggplot(tourney, aes(x=POSTSEASON, y=TOR, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

p<-ggplot(tourney, aes(x=POSTSEASON, y=TORD, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Above, I examined correlations between various statistics in all teams in the 2019 season and correlations split by tournament qualification. Let’s examine the relationships between different stats just in the tournament teams using a corrplot form BasketballAnalyzeR.

Interestingly, these stats range from weakly to moderatley related to wins. The different effective field goal percentage shot and allowed variables are highly related to the 2-point and 3-point percentages made and allowed, suggesting using both in a model could lead to high collinearity. These high correlations makes sense, as the effective field goal rate variables are likely calculated with those 2-point and 3-point percentages made and allowed variables.

tourney$BARTHAG<-NULL
corrmatrix<-corranalysis(tourney[,4:19], threshold = .5)
plot(corrmatrix)

Post Season Analysis

tourney$TFac<-ordered(tourney$POSTSEASON, levels = c("R68", "R64", "R32", "S16", "E8", "F4", "2ND", "Champions"))
tourney$TFac<-as.numeric(tourney$TFac)

m1<- lm(TFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data=tourney)
summary(m1)

## 
## Call:
## lm(formula = TFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + 
##     ADJ_T, data = tourney)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88803 -0.54753  0.06408  0.54113  2.55633 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.50782    5.54203  -0.813   0.4192    
## ADJOE        0.15888    0.02867   5.541 7.07e-07 ***
## ADJDE       -0.06808    0.02561  -2.658   0.0101 *  
## TOR          0.13289    0.10272   1.294   0.2007    
## TORD         0.01598    0.06762   0.236   0.8140    
## ORB         -0.02384    0.03958  -0.602   0.5493    
## DRB          0.03106    0.05299   0.586   0.5600    
## ADJ_T       -0.09501    0.04141  -2.294   0.0253 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9185 on 60 degrees of freedom
## Multiple R-squared:  0.6101, Adjusted R-squared:  0.5646 
## F-statistic: 13.41 on 7 and 60 DF,  p-value: 2.804e-10

tab_model(m1)

	T Fac
Predictors	Estimates	CI	p
(Intercept)	-4.51	-15.59 – 6.58	0.419
ADJOE	0.16	0.10 – 0.22	<0.001
ADJDE	-0.07	-0.12 – -0.02	0.010
TOR	0.13	-0.07 – 0.34	0.201
TORD	0.02	-0.12 – 0.15	0.814
ORB	-0.02	-0.10 – 0.06	0.549
DRB	0.03	-0.07 – 0.14	0.560
ADJ T	-0.10	-0.18 – -0.01	0.025
Observations	68
R² / R² adjusted	0.610 / 0.565

#check_model(m1)
#model_performance(m1)

Offensive efficiency, defensive efficiency, and tempo were statistically significantly predictors of tournament placement. Better offensive efficiency, worse defensive efficiency, and quicker tempo corresponded to remaining in the tournament longer.

m2<- lm(W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data=cbb)
summary(m2)

## 
## Call:
## lm(formula = W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, 
##     data = cbb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9285  -2.5066   0.0583   2.3364  10.3360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.67673    9.40723  -0.497 0.619406    
## ADJOE        0.41997    0.04332   9.694  < 2e-16 ***
## ADJDE       -0.23797    0.04060  -5.861 1.08e-08 ***
## TOR         -0.43752    0.12258  -3.569 0.000409 ***
## TORD         0.73777    0.10573   6.978 1.54e-11 ***
## ORB          0.18087    0.05542   3.263 0.001211 ** 
## DRB         -0.42026    0.07522  -5.587 4.69e-08 ***
## ADJ_T        0.06206    0.06917   0.897 0.370253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.388 on 345 degrees of freedom
## Multiple R-squared:  0.7232, Adjusted R-squared:  0.7176 
## F-statistic: 128.8 on 7 and 345 DF,  p-value: < 2.2e-16

tab_model(m2)

	W
Predictors	Estimates	CI	p
(Intercept)	-4.68	-23.18 – 13.83	0.619
ADJOE	0.42	0.33 – 0.51	<0.001
ADJDE	-0.24	-0.32 – -0.16	<0.001
TOR	-0.44	-0.68 – -0.20	<0.001
TORD	0.74	0.53 – 0.95	<0.001
ORB	0.18	0.07 – 0.29	0.001
DRB	-0.42	-0.57 – -0.27	<0.001
ADJ T	0.06	-0.07 – 0.20	0.370
Observations	353
R² / R² adjusted	0.723 / 0.718

m3<- lm(W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data=tourney)
summary(m3)

## 
## Call:
## lm(formula = W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, 
##     data = tourney)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4790 -2.4102  0.4107  1.8924  6.5918 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 16.81923   18.89224   0.890   0.3769  
## ADJOE        0.20574    0.09774   2.105   0.0395 *
## ADJDE       -0.10281    0.08732  -1.177   0.2437  
## TOR         -0.79951    0.35017  -2.283   0.0260 *
## TORD         0.53273    0.23051   2.311   0.0243 *
## ORB          0.30951    0.13494   2.294   0.0253 *
## DRB         -0.43666    0.18064  -2.417   0.0187 *
## ADJ_T        0.03035    0.14116   0.215   0.8305  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.131 on 60 degrees of freedom
## Multiple R-squared:  0.5136, Adjusted R-squared:  0.4568 
## F-statistic:  9.05 on 7 and 60 DF,  p-value: 1.421e-07

tab_model(m3)

	W
Predictors	Estimates	CI	p
(Intercept)	16.82	-20.97 – 54.61	0.377
ADJOE	0.21	0.01 – 0.40	0.039
ADJDE	-0.10	-0.28 – 0.07	0.244
TOR	-0.80	-1.50 – -0.10	0.026
TORD	0.53	0.07 – 0.99	0.024
ORB	0.31	0.04 – 0.58	0.025
DRB	-0.44	-0.80 – -0.08	0.019
ADJ T	0.03	-0.25 – 0.31	0.830
Observations	68
R² / R² adjusted	0.514 / 0.457

In the full season sample, Offensive Efficiency, Defensive Efficiency, Turnover Rate, Steal Rate, Offensive Rebound Rate, and Offensive Rebound Rate Allowed were all statistically significantly predictors of total wins.

In the subsample of teams that qualified for the tournament, Offensive Efficiency, Turnover Rate, Steal Rate, Offensive Rebound Rate, and Offensive Rebound Rate Allowed were all statistically significantly predictors of total wins.

clusVARS<-c(
          "EFG_O",
          "EFG_D",
          "TOR",
          "TORD",
          "ORB",
          "DRB",
          "FTR",
          "FTRD",
          "X2P_O",
          "X2P_D",
          "X3P_O",
          "X3P_D",
          "ADJ_T"
)
clusITEMS<-tourney[clusVARS]

set.seed(13)
kclu1<-kclustering(clusITEMS)
plot(kclu1)

kclu2<-kclustering(clusITEMS, labels = tourney$TEAM, k=5)
plot(kclu2)

cluster <- as.factor(kclu2$Subjects$Cluster)
Xbubble <- data.frame(Team=tourney$TEAM, PTS=tourney$ADJOE,
                      PTS.Opp=tourney$ADJDE, cluster,
                      W=tourney$W)
labs <- c("PTS", "PTS.Opp", "cluster", "Wins")
bubbleplot(Xbubble, id="Team", x="PTS", y="PTS.Opp",
           col="cluster", size="W", labels=labs)

Bubble plot of the teams that participated in the 2019 tournament for offensive efficiency (PTS), defensive efficiency (PTS.Opp), number of wins, and cluster.

NCAA Men’s Basketball - 2019

Sam Freis

2023-02-22