Basketball Data Science

Working through and adpating code from P. Zuccolotto and M. Manisera (2020) Basketball Data Science – With Applications in R, Chapman and Hall/CRC. ISBN 9781138600799.

Using BasketballAnalyzeR with NCAA Basketball Data

Data Import

# data and some code pulled from BasketballAnalyzeR package: https://bodai.unibs.it/bdsports/basketballanalyzer/

#data(package="BasketballAnalyzeR")

# NCAA data pulled from: https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset
# using 2019 season for completeness 
cbb <- read.csv("~/Dropbox/AltAc/Freelancing/BasketballDataScience/MBB19/rawdat/cbb/cbb19.csv")

This report is my own analysis using a combination of BasketballAnalyzeR and other packages to analyze performance in the 2019 NCAA MBB season and tournament.

In Basketball Data Science, chapter 6 reviews aggregating game box scores at the team, opponent, and player level using the tidyverse. This 2019 CBB data is already aggregated within season at the team level, so I’ll be focusing on team-level analyses and visualizations.

2019 NCAA Men’s Basketball

CBB Variables

  • TEAM: The Division I college basketball school

  • CONF: The Athletic Conference in which the school participates in (A10 = Atlantic 10, ACC = Atlantic Coast Conference, AE = America East, Amer = American, ASun = ASUN, B10 = Big Ten, B12 = Big 12, BE = Big East, BSky = Big Sky, BSth = Big South, BW = Big West, CAA = Colonial Athletic Association, CUSA = Conference USA, Horz = Horizon League, Ivy = Ivy League, MAAC = Metro Atlantic Athletic Conference, MAC = Mid-American Conference, MEAC = Mid-Eastern Athletic Conference, MVC = Missouri Valley Conference, MWC = Mountain West, NEC = Northeast Conference, OVC = Ohio Valley Conference, P12 = Pac-12, Pat = Patriot League, SB = Sun Belt, SC = Southern Conference, SEC = South Eastern Conference, Slnd = Southland Conference, Sum = Summit League, SWAC = Southwestern Athletic Conference, WAC = Western Athletic Conference, WCC = West Coast Conference)

  • G: Number of games played

  • W: Number of games won

  • ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)

  • ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)

  • BARTHAG: Power Rating (Chance of beating an average Division I team)

  • EFG_O: Effective Field Goal Percentage Shot

  • EFG_D: Effective Field Goal Percentage Allowed

  • TOR: Turnover Percentage Allowed (Turnover Rate)

  • TORD: Turnover Percentage Committed (Steal Rate)

  • ORB: Offensive Rebound Rate

  • DRB: Offensive Rebound Rate Allowed

  • FTR : Free Throw Rate (How often the given team shoots Free Throws)

  • FTRD: Free Throw Rate Allowed

  • 2P_O: Two-Point Shooting Percentage

  • 2P_D: Two-Point Shooting Percentage Allowed

  • 3P_O: Three-Point Shooting Percentage

  • 3P_D: Three-Point Shooting Percentage Allowed

  • ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)

  • WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)

  • POSTSEASON: Round where the given team was eliminated or where their season ended (R68 = First Four, R64 = Round of 64, R32 = Round of 32, S16 = Sweet Sixteen, E8 = Elite Eight, F4 = Final Four, 2ND = Runner-up, Champion = Winner of the NCAA March Madness Tournament for that given year)

  • SEED: Seed in the NCAA March Madness Tournament

Creating additional variables for analyses
cbb<-mutate(cbb, Qual = ifelse(SEED <= 16, "Yes", "No"))
cbb$Qual<-cbb$Qual %>% replace_na("No")

kable(table(cbb$Qual))
Var1 Freq
No 285
Yes 68
tourney<-subset(cbb, Qual=="Yes")

Basic Descriptives

# subset to numeric vars and describe
numVARS<-c("G",
          "W",
          "ADJOE",
          "ADJDE",
          "BARTHAG",
          "EFG_O",
          "EFG_D",
          "TOR",
          "TORD",
          "ORB",
          "DRB",
          "FTR",
          "FTRD",
          "X2P_O",
          "X2P_D",
          "X3P_O",
          "X3P_D",
          "ADJ_T",
          "WAB"
)
numITEMS<-cbb[numVARS]

kable(describe(numITEMS), 
      format='markdown', 
      digits=2)
vars n mean sd median trimmed mad min max range skew kurtosis se
G 1 353 31.75 2.51 31.00 31.64 2.97 26.00 39.00 13.00 0.34 -0.02 0.13
W 2 353 17.11 6.37 17.00 16.91 7.41 3.00 35.00 32.00 0.26 -0.35 0.34
ADJOE 3 353 103.34 7.02 103.10 103.26 6.82 83.70 123.40 39.70 0.12 0.21 0.37
ADJDE 4 353 103.34 6.45 104.00 103.46 7.12 85.20 119.20 34.00 -0.17 -0.34 0.34
BARTHAG 5 353 0.49 0.25 0.48 0.49 0.30 0.03 0.97 0.94 0.16 -1.10 0.01
EFG_O 6 353 50.60 2.94 50.50 50.65 2.97 40.00 59.00 19.00 -0.21 0.37 0.16
EFG_D 7 353 50.77 2.75 50.90 50.79 2.82 42.50 59.30 16.80 -0.08 0.16 0.15
TOR 8 353 18.61 2.07 18.50 18.52 1.93 13.50 25.10 11.60 0.34 0.32 0.11
TORD 9 353 18.52 2.09 18.30 18.46 2.08 13.30 24.70 11.40 0.35 -0.01 0.11
ORB 10 353 28.25 3.94 28.30 28.21 4.00 15.90 38.70 22.80 0.04 -0.16 0.21
DRB 11 353 28.42 2.92 28.30 28.33 2.97 21.70 37.10 15.40 0.28 -0.32 0.16
FTR 12 353 32.95 4.71 33.30 32.96 4.45 21.90 48.10 26.20 0.03 -0.17 0.25
FTRD 13 353 33.20 5.08 32.70 32.95 4.89 21.80 54.00 32.20 0.63 0.85 0.27
X2P_O 14 353 50.06 3.36 50.30 50.06 3.26 37.70 61.40 23.70 -0.12 0.70 0.18
X2P_D 15 353 50.23 3.12 50.20 50.23 2.97 40.70 61.20 20.50 0.03 0.26 0.17
X3P_O 16 353 34.29 2.54 34.20 34.27 2.67 27.90 42.40 14.50 0.09 -0.09 0.14
X3P_D 17 353 34.42 2.34 34.40 34.44 2.22 27.90 41.80 13.90 -0.07 0.00 0.12
ADJ_T 18 353 69.17 2.69 69.00 69.08 2.52 60.70 79.10 18.40 0.41 0.73 0.14
WAB 19 353 -7.78 7.12 -8.60 -8.11 7.41 -23.40 11.20 34.60 0.39 -0.32 0.38
tnumITEMS<-tourney[numVARS]

kable(describe(tnumITEMS), 
      format='markdown', 
      digits=2)
vars n mean sd median trimmed mad min max range skew kurtosis se
G 1 68 34.51 2.02 34.00 34.50 1.48 29.00 39.00 10.00 0.03 0.08 0.24
W 2 68 25.34 4.25 25.00 25.21 4.45 17.00 35.00 18.00 0.19 -0.92 0.52
ADJOE 3 68 111.31 5.92 110.90 111.31 5.49 98.00 123.40 25.40 0.06 -0.43 0.72
ADJDE 4 68 96.47 5.63 96.30 96.20 4.74 85.20 110.30 25.10 0.41 -0.12 0.68
BARTHAG 5 68 0.79 0.17 0.85 0.82 0.11 0.24 0.97 0.73 -1.32 1.03 0.02
EFG_O 6 68 52.69 2.52 52.95 52.69 2.59 46.70 59.00 12.30 -0.03 -0.33 0.31
EFG_D 7 68 48.18 2.45 48.45 48.24 2.52 42.50 53.60 11.10 -0.28 -0.47 0.30
TOR 8 68 17.42 1.57 17.40 17.43 1.56 13.90 22.70 8.80 0.22 0.74 0.19
TORD 9 68 19.09 2.35 18.70 19.01 2.08 14.10 24.70 10.60 0.41 -0.02 0.29
ORB 10 68 30.06 4.00 30.20 30.13 4.00 20.70 37.90 17.20 -0.15 -0.45 0.48
DRB 11 68 27.70 2.97 27.00 27.56 3.19 22.20 34.90 12.70 0.40 -0.34 0.36
FTR 12 68 33.74 4.22 33.40 33.39 3.71 26.70 45.30 18.60 0.70 0.10 0.51
FTRD 13 68 31.36 4.43 31.75 31.22 5.26 23.10 46.20 23.10 0.46 0.31 0.54
X2P_O 14 68 52.26 2.95 51.95 52.19 2.52 44.10 61.40 17.30 0.31 0.85 0.36
X2P_D 15 68 47.51 2.97 47.85 47.53 3.04 40.70 53.70 13.00 -0.09 -0.60 0.36
X3P_O 16 68 35.55 2.34 35.75 35.57 2.37 30.40 41.40 11.00 -0.06 -0.22 0.28
X3P_D 17 68 32.83 2.06 33.25 32.90 1.70 27.90 37.40 9.50 -0.35 -0.13 0.25
ADJ_T 18 68 68.46 2.82 68.00 68.41 2.52 60.70 76.00 15.30 0.16 0.09 0.34
WAB 19 68 1.86 5.23 1.90 2.12 4.67 -10.90 11.20 22.10 -0.38 -0.20 0.63

Descriptives for the entire MBB season and for the subset of teams that qualified for the tournament.

Research Questions:

1. What was the spread of wins in the 2019 season?

ggplot(cbb, aes(x=W))+
  geom_histogram(color="#FFFFFF", fill="#003C80")+
  scale_x_continuous(breaks = seq(0, 40, by = 10))+
  scale_y_continuous(breaks = seq(0, 90, len = 10))+
  labs(title="2019 MBB Wins",x="Frequency", y = "Count")+
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2a. What was the spread of wins in the 2019 season by conference?

kable(table(cbb$CONF))
Var1 Freq
A10 14
ACC 15
AE 9
Amer 12
ASun 8
B10 14
B12 10
BE 10
BSky 12
BSth 12
BW 9
CAA 10
CUSA 14
Horz 10
Ivy 8
MAAC 11
MAC 12
MEAC 12
MVC 10
MWC 11
NEC 10
OVC 12
P12 12
Pat 10
SB 12
SC 10
SEC 14
Slnd 13
Sum 8
SWAC 10
WAC 9
WCC 10
p<-ggplot(cbb, aes(x=CONF, y=W, fill=CONF)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

2b. What was the spread of wins in the 2019 season by conference grpuped by tournament qualification?

p<-ggplot(cbb, aes(x=CONF, y=W, fill=CONF)) + 
    geom_boxplot()+
      facet_wrap(~Qual)+
  theme_bw()

p+theme(legend.position = "bottom")

3. What conference appeared the most in the 2019 tournament?

ggplot(tourney, aes(x = fct_infreq(CONF)))+
  geom_bar(color="#FFFFFF", fill="#CCA600")+
  labs(title="2019 Tournament by Conference",x="Conference", y = "Count")+
  theme_minimal()+
  theme(legend.position = "bottom", axis.text.x = element_text(angle=60, size=7, hjust = 1))

Power 5 conferences were well represented, but smaller conferences, like the American and Big East conferences also performed well this year.

4a. What are the relationships between different stats and wins?

numVARS<-c("W",
          "ADJOE",
          "ADJDE",
          "BARTHAG",
          "EFG_O",
          "EFG_D",
          "TOR",
          "TORD",
          "ORB",
          "DRB",
          "FTR",
          "FTRD",
          "X2P_O",
          "X2P_D",
          "X3P_O",
          "X3P_D",
          "ADJ_T")
numITEMS<-cbb[numVARS]

scatterplot(numITEMS, data.var =1:17,  diag = list(continuous="blankDiag"))

4b. What are the relationships between different stats and wins split by tournament qualification?

numtVARS<-c("W",
          "ADJOE",
          "ADJDE",
          "BARTHAG",
          "EFG_O",
          "EFG_D",
          "TOR",
          "TORD",
          "ORB",
          "DRB",
          "FTR",
          "FTRD",
          "X2P_O",
          "X2P_D",
          "X3P_O",
          "X3P_D",
          "ADJ_T",
          "Qual"
)
numtITEMS<-cbb[numtVARS]

scatterplot(numtITEMS, data.var =1:17, z.var="Qual", diag = list(continuous="blankDiag"))

Adjusted Offensive Efficiency (ADJOE) and Adjusted Defensive Efficiency (ADJDE) were both strongly correlated with total wins. These variables both index points scored and points allowed.

4c. What are the mean differences in Adjusted Offensive Efficiency (ADJOE) and Adjusted Defensive Efficiency (ADJDE) for qualifiying vs. non-qualifying teams?

p<-ggplot(cbb, aes(x=Qual, y=ADJOE, fill=Qual)) + 
    geom_boxplot()+
    labs(title="ADJOE by Qualifying Status")+
  theme_bw()

p+theme(legend.position = "bottom")

t.test(ADJOE ~ Qual, data = cbb)
## 
##  Welch Two Sample t-test
## 
## data:  ADJOE by Qual
## t = -12.401, df = 100.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -11.457124  -8.296797
## sample estimates:
##  mean in group No mean in group Yes 
##          101.4333          111.3103
p<-ggplot(cbb, aes(x=Qual, y=ADJDE, fill=Qual)) + 
    geom_boxplot()+
    labs(title="ADJDE by Qualifying Status")+
  theme_bw()

p+theme(legend.position = "bottom")

t.test(ADJDE ~ Qual, data = cbb)
## 
##  Welch Two Sample t-test
## 
## data:  ADJDE by Qual
## t = 11.249, df = 99.724, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##   7.002421 10.001531
## sample estimates:
##  mean in group No mean in group Yes 
##         104.97404          96.47206

There were fairly substantial differences in average offensive and defensive efficiency between teams that did and did not qualify for the tournament.

4d. Predicting qualification

cbb$QualFac <- ifelse(cbb$Qual == "Yes",
c(1), c(0))

quallogit <- glm(QualFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data = cbb, family = "binomial")

summary(quallogit)
## 
## Call:
## glm(formula = QualFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + 
##     ADJ_T, family = "binomial", data = cbb)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9912  -0.3911  -0.1649  -0.0545   3.1703  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.01010   10.46199  -0.574  0.56565    
## ADJOE        0.25740    0.05528   4.656 3.22e-06 ***
## ADJDE       -0.14201    0.04475  -3.174  0.00151 ** 
## TOR         -0.11747    0.14427  -0.814  0.41551    
## TORD         0.20635    0.11644   1.772  0.07635 .  
## ORB          0.01531    0.06082   0.252  0.80126    
## DRB         -0.05702    0.07724  -0.738  0.46038    
## ADJ_T       -0.13250    0.07948  -1.667  0.09550 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 345.95  on 352  degrees of freedom
## Residual deviance: 182.11  on 345  degrees of freedom
## AIC: 198.11
## 
## Number of Fisher Scoring iterations: 6
confint(quallogit)
##                    2.5 %      97.5 %
## (Intercept) -26.92777883 14.29234233
## ADJOE         0.15467451  0.37261806
## ADJDE        -0.23318617 -0.05684055
## TOR          -0.40486518  0.16320295
## TORD         -0.02016965  0.43876062
## ORB          -0.10420604  0.13548867
## DRB          -0.21114406  0.09343618
## ADJ_T        -0.29282576  0.02029582
tab_model(quallogit)
  Qual Fac
Predictors Odds Ratios CI p
(Intercept) 0.00 0.00 – 1610962.41 0.566
ADJOE 1.29 1.17 – 1.45 <0.001
ADJDE 0.87 0.79 – 0.94 0.002
TOR 0.89 0.67 – 1.18 0.416
TORD 1.23 0.98 – 1.55 0.076
ORB 1.02 0.90 – 1.15 0.801
DRB 0.94 0.81 – 1.10 0.460
ADJ T 0.88 0.75 – 1.02 0.095
Observations 353
R2 Tjur 0.500
wald.test(b = coef(quallogit), Sigma = vcov(quallogit), Terms = 4:8)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 6.9, df = 5, P(> X2) = 0.23
exp(coef(quallogit))
## (Intercept)       ADJOE       ADJDE         TOR        TORD         ORB 
##  0.00245384  1.29356306  0.86761547  0.88917003  1.22918772  1.01542722 
##         DRB       ADJ_T 
##  0.94457678  0.87589994
exp(cbind(OR = coef(quallogit), confint(quallogit)))
##                     OR        2.5 %       97.5 %
## (Intercept) 0.00245384 2.020292e-12 1.610962e+06
## ADJOE       1.29356306 1.167278e+00 1.451530e+00
## ADJDE       0.86761547 7.920061e-01 9.447447e-01
## TOR         0.88917003 6.670667e-01 1.177276e+00
## TORD        1.22918772 9.800324e-01 1.550784e+00
## ORB         1.01542722 9.010396e-01 1.145096e+00
## DRB         0.94457678 8.096574e-01 1.097941e+00
## ADJ_T       0.87589994 7.461521e-01 1.020503e+00
  • While accounting for the other variables in the model, with every one unit change in ADJOE, the log odds of qualifying for the tournament increases by 0.257.
  • While accounting for the other variables in the model, with every one unit change in ADJDE, the log odds of qualifying for the tournament decreases by -0.142.
  • For the wald test of the overall effect of the other variables in the model, the chi-squared test statistic of 6.9, with three degrees of freedom is associated with a p-value of 0.23 indicates that the overall effect of the other variables in the model is not statistically significant.

5. Digging into Post Season Data

  • Some post season notes
  • Virginia won this year
  • Texas Tech were the runners-up
  • Teams eliminated in the Final 4 consisted of: Michigan St., and Auburn
  • Teams eliminated in the Elite 8 consisted of: Gonzaga, Duke, Kentucky, and Purdue
labs <- c("Free Throw Rate","Two-Point Shooting Percentage","Three-Point Shooting Percentage")
barline(data=tourney, id="TEAM", bars=c("FTR","X2P_O","X3P_O"),
        line="W", order.by="SEED", labels.bars=labs)

This Bar-line plot displays some offensive stats for qualifying teams in the 2019 NCAAMB Tournament. The bars are ordered by seed, and the line plots the total number of wins.

require("ggrepel")

p <- ggplot(tourney, aes(x=ADJ_T, y=SEED)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

p <- ggplot(tourney, aes(x=ADJOE, y=SEED)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

p <- ggplot(tourney, aes(x=ADJDE, y=SEED)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

Plotting relationships between Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo), ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense), ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense), and tournament seed.

There is not much of a relationship between tempo and seed. However, there appear to be very strong relationships between offensive efficiency, defensive efficiency, and seed where more points scored corresponds to a lower seed and more points allowed corresponds to a higher seed.

require("ggrepel")
tourney$TFac<-ordered(tourney$POSTSEASON, levels = c("R68", "R64", "R32", "S16", "E8", "F4", "2ND", "Champions"))

p <- ggplot(tourney, aes(x=ADJ_T, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

p <- ggplot(tourney, aes(x=ADJOE, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

p <- ggplot(tourney, aes(x=ADJDE, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)

p + geom_label_repel(aes(label = tourney$TEAM,
                    fill = factor(CONF)), color = 'white',
                    size = 3.5) +
   theme(legend.position = "bottom")

ggplot(tourney, aes(x=ADJ_T, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)+ 
  geom_label_repel(aes(label = tourney$TEAM,
                         fill = factor(TFac)), color = 'white',
                     size = 3.5) +
  labs(x = "Tempo", y = "Wins")+
  theme(legend.position = "bottom")+
  guides(fill=guide_legend(title="Post Season Finish"))

ggplot(tourney, aes(x=ADJOE, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)+ 
  geom_label_repel(aes(label = tourney$TEAM,
                         fill = factor(TFac)), color = 'white',
                     size = 3.5) +
  labs(x = "Offensive Efficiency", y = "Wins")+
  theme(legend.position = "bottom")+
  guides(fill=guide_legend(title="Post Season Finish"))

ggplot(tourney, aes(x=ADJDE, y=W)) +
  geom_point(color = 'red') +
  theme_classic(base_size = 10)+ 
  geom_label_repel(aes(label = tourney$TEAM,
                         fill = factor(TFac)), color = 'white',
                     size = 3.5) +
  labs(x = "Defensive Efficiency", y = "Wins")+
  theme(legend.position = "bottom")+
  guides(fill=guide_legend(title="Post Season Finish"))

Plotting relationships between Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo), ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense), ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense), and total season wins.

There is not much of a relationship between tempo and wins. However, consistent with seed and the scatterplots above, there appears to be very strong relationships between offensive efficiency, defensive efficiency, and wins where more points scored corresponds to more wins (and therefore going farther in the tourney) and more points allowed corresponds to fewer wins.

Quick sanity check: Season wins should increase with tournament position (with the caveat that some teams play different amounts of games because of preseason tournaments, games cancelled, etc.)

tourney$POSTSEASON<-factor(tourney$POSTSEASON, levels = c("R68", "R64", "R32", "S16", "E8", "F4", "2ND", "Champions"))

p<-ggplot(tourney, aes(x=POSTSEASON, y=W, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Consistent with the scatterplots above, there appears to be strong relationships between offensive efficiency and going farther in the tourney

p<-ggplot(tourney, aes(x=POSTSEASON, y=ADJOE, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Consistent with the scatterplots above, there appears to be strong relationships between defensive efficiency and going farther in the tourney. However, the runners-up, Texas Tech, had a better overall defensive efficiency index throughout the course of the season than the champions, UVA.

p<-ggplot(tourney, aes(x=POSTSEASON, y=ADJDE, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Though it did not relate to total season wins, UVA had a very quick tempo throughout the season.

p<-ggplot(tourney, aes(x=POSTSEASON, y=ADJ_T, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Here, I examined differences in rebounds and turnovers by post season position. These variables did not relate as strongly to total wins as the points scored and allowed variables, however these stats can be very important to a game’s outcome.

p<-ggplot(tourney, aes(x=POSTSEASON, y=ORB, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

p<-ggplot(tourney, aes(x=POSTSEASON, y=DRB, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

p<-ggplot(tourney, aes(x=POSTSEASON, y=TOR, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

p<-ggplot(tourney, aes(x=POSTSEASON, y=TORD, fill=POSTSEASON)) + 
    geom_boxplot()+
  theme_bw()

p+theme(legend.position = "bottom")

Above, I examined correlations between various statistics in all teams in the 2019 season and correlations split by tournament qualification. Let’s examine the relationships between different stats just in the tournament teams using a corrplot form BasketballAnalyzeR.

Interestingly, these stats range from weakly to moderatley related to wins. The different effective field goal percentage shot and allowed variables are highly related to the 2-point and 3-point percentages made and allowed, suggesting using both in a model could lead to high collinearity. These high correlations makes sense, as the effective field goal rate variables are likely calculated with those 2-point and 3-point percentages made and allowed variables.

tourney$BARTHAG<-NULL
corrmatrix<-corranalysis(tourney[,4:19], threshold = .5)
plot(corrmatrix)

Post Season Analysis

tourney$TFac<-ordered(tourney$POSTSEASON, levels = c("R68", "R64", "R32", "S16", "E8", "F4", "2ND", "Champions"))
tourney$TFac<-as.numeric(tourney$TFac)

m1<- lm(TFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data=tourney)
summary(m1)
## 
## Call:
## lm(formula = TFac ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + 
##     ADJ_T, data = tourney)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88803 -0.54753  0.06408  0.54113  2.55633 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.50782    5.54203  -0.813   0.4192    
## ADJOE        0.15888    0.02867   5.541 7.07e-07 ***
## ADJDE       -0.06808    0.02561  -2.658   0.0101 *  
## TOR          0.13289    0.10272   1.294   0.2007    
## TORD         0.01598    0.06762   0.236   0.8140    
## ORB         -0.02384    0.03958  -0.602   0.5493    
## DRB          0.03106    0.05299   0.586   0.5600    
## ADJ_T       -0.09501    0.04141  -2.294   0.0253 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9185 on 60 degrees of freedom
## Multiple R-squared:  0.6101, Adjusted R-squared:  0.5646 
## F-statistic: 13.41 on 7 and 60 DF,  p-value: 2.804e-10
tab_model(m1)
  T Fac
Predictors Estimates CI p
(Intercept) -4.51 -15.59 – 6.58 0.419
ADJOE 0.16 0.10 – 0.22 <0.001
ADJDE -0.07 -0.12 – -0.02 0.010
TOR 0.13 -0.07 – 0.34 0.201
TORD 0.02 -0.12 – 0.15 0.814
ORB -0.02 -0.10 – 0.06 0.549
DRB 0.03 -0.07 – 0.14 0.560
ADJ T -0.10 -0.18 – -0.01 0.025
Observations 68
R2 / R2 adjusted 0.610 / 0.565
#check_model(m1)
#model_performance(m1)

Offensive efficiency, defensive efficiency, and tempo were statistically significantly predictors of tournament placement. Better offensive efficiency, worse defensive efficiency, and quicker tempo corresponded to remaining in the tournament longer.

m2<- lm(W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data=cbb)
summary(m2)
## 
## Call:
## lm(formula = W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, 
##     data = cbb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9285  -2.5066   0.0583   2.3364  10.3360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.67673    9.40723  -0.497 0.619406    
## ADJOE        0.41997    0.04332   9.694  < 2e-16 ***
## ADJDE       -0.23797    0.04060  -5.861 1.08e-08 ***
## TOR         -0.43752    0.12258  -3.569 0.000409 ***
## TORD         0.73777    0.10573   6.978 1.54e-11 ***
## ORB          0.18087    0.05542   3.263 0.001211 ** 
## DRB         -0.42026    0.07522  -5.587 4.69e-08 ***
## ADJ_T        0.06206    0.06917   0.897 0.370253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.388 on 345 degrees of freedom
## Multiple R-squared:  0.7232, Adjusted R-squared:  0.7176 
## F-statistic: 128.8 on 7 and 345 DF,  p-value: < 2.2e-16
tab_model(m2)
  W
Predictors Estimates CI p
(Intercept) -4.68 -23.18 – 13.83 0.619
ADJOE 0.42 0.33 – 0.51 <0.001
ADJDE -0.24 -0.32 – -0.16 <0.001
TOR -0.44 -0.68 – -0.20 <0.001
TORD 0.74 0.53 – 0.95 <0.001
ORB 0.18 0.07 – 0.29 0.001
DRB -0.42 -0.57 – -0.27 <0.001
ADJ T 0.06 -0.07 – 0.20 0.370
Observations 353
R2 / R2 adjusted 0.723 / 0.718
m3<- lm(W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, data=tourney)
summary(m3)
## 
## Call:
## lm(formula = W ~ ADJOE + ADJDE + TOR + TORD + ORB + DRB + ADJ_T, 
##     data = tourney)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4790 -2.4102  0.4107  1.8924  6.5918 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 16.81923   18.89224   0.890   0.3769  
## ADJOE        0.20574    0.09774   2.105   0.0395 *
## ADJDE       -0.10281    0.08732  -1.177   0.2437  
## TOR         -0.79951    0.35017  -2.283   0.0260 *
## TORD         0.53273    0.23051   2.311   0.0243 *
## ORB          0.30951    0.13494   2.294   0.0253 *
## DRB         -0.43666    0.18064  -2.417   0.0187 *
## ADJ_T        0.03035    0.14116   0.215   0.8305  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.131 on 60 degrees of freedom
## Multiple R-squared:  0.5136, Adjusted R-squared:  0.4568 
## F-statistic:  9.05 on 7 and 60 DF,  p-value: 1.421e-07
tab_model(m3)
  W
Predictors Estimates CI p
(Intercept) 16.82 -20.97 – 54.61 0.377
ADJOE 0.21 0.01 – 0.40 0.039
ADJDE -0.10 -0.28 – 0.07 0.244
TOR -0.80 -1.50 – -0.10 0.026
TORD 0.53 0.07 – 0.99 0.024
ORB 0.31 0.04 – 0.58 0.025
DRB -0.44 -0.80 – -0.08 0.019
ADJ T 0.03 -0.25 – 0.31 0.830
Observations 68
R2 / R2 adjusted 0.514 / 0.457

In the full season sample, Offensive Efficiency, Defensive Efficiency, Turnover Rate, Steal Rate, Offensive Rebound Rate, and Offensive Rebound Rate Allowed were all statistically significantly predictors of total wins.

In the subsample of teams that qualified for the tournament, Offensive Efficiency, Turnover Rate, Steal Rate, Offensive Rebound Rate, and Offensive Rebound Rate Allowed were all statistically significantly predictors of total wins.

clusVARS<-c(
          "EFG_O",
          "EFG_D",
          "TOR",
          "TORD",
          "ORB",
          "DRB",
          "FTR",
          "FTRD",
          "X2P_O",
          "X2P_D",
          "X3P_O",
          "X3P_D",
          "ADJ_T"
)
clusITEMS<-tourney[clusVARS]

set.seed(13)
kclu1<-kclustering(clusITEMS)
plot(kclu1)

kclu2<-kclustering(clusITEMS, labels = tourney$TEAM, k=5)
plot(kclu2)

cluster <- as.factor(kclu2$Subjects$Cluster)
Xbubble <- data.frame(Team=tourney$TEAM, PTS=tourney$ADJOE,
                      PTS.Opp=tourney$ADJDE, cluster,
                      W=tourney$W)
labs <- c("PTS", "PTS.Opp", "cluster", "Wins")
bubbleplot(Xbubble, id="Team", x="PTS", y="PTS.Opp",
           col="cluster", size="W", labels=labs)

Bubble plot of the teams that participated in the 2019 tournament for offensive efficiency (PTS), defensive efficiency (PTS.Opp), number of wins, and cluster.