Para auxiliar nessa tarefa algumas covariáveis foram disponibilizadas, na próxima seção algumas delas são apresentadas

Conhecendo os dados

library(latticeExtra)

A imensa maioria dos arremessos tentados por Kobe foram do tipo jump shot

barchart(sort(table(kobe$combined_shot_type))
         , col = "#0080ff"
         , border = "transparent"
         , xlim = c(0, 27000)
         , xlab = "Aremessos"
         , main = "Arremessos por categoria"
         , par.settings = settings
         , panel = function(...){
           panel.abline(v = seq(0, 26000, 1000), col = "gray90")
           args <- list(...)
           panel.text(
             args$x, args$y, paste0(args$x, " (", round(prop.table(args$x), 3)*100, "%)")
             , pos = 4)
           panel.barchart(...)})

Ali os tipos de arremesso foram reunidos em maiores categorias, mas temos uma informação mais detalhada dos tipos de arremessos tentados

table(kobe$action_type)


               Alley Oop Dunk Shot               Alley Oop Layup shot 
                               122                                 80 
    Cutting Finger Roll Layup Shot                 Cutting Layup Shot 
                                 1                                  6 
                 Driving Bank shot                  Driving Dunk Shot 
                                 5                                310 
    Driving Finger Roll Layup Shot           Driving Finger Roll Shot 
                                69                                 82 
   Driving Floating Bank Jump Shot         Driving Floating Jump Shot 
                                 1                                  5 
                 Driving Hook Shot                  Driving Jump shot 
                                14                                 28 
                Driving Layup Shot         Driving Reverse Layup Shot 
                              1978                                 97 
            Driving Slam Dunk Shot                          Dunk Shot 
                                48                                262 
                Fadeaway Bank shot                 Fadeaway Jump Shot 
                                31                               1048 
            Finger Roll Layup Shot                   Finger Roll Shot 
                                33                                 28 
                Floating Jump shot                Follow Up Dunk Shot 
                               114                                 15 
                    Hook Bank Shot                          Hook Shot 
                                 5                                 84 
                    Jump Bank Shot                     Jump Hook Shot 
                               333                                 24 
                         Jump Shot                         Layup Shot 
                             18880                               2567 
                  Pullup Bank shot                   Pullup Jump shot 
                                12                                476 
                 Putback Dunk Shot                 Putback Layup Shot 
                                 5                                 15 
            Putback Slam Dunk Shot                  Reverse Dunk Shot 
                                 2                                 75 
                Reverse Layup Shot             Reverse Slam Dunk Shot 
                               395                                 16 
                 Running Bank shot                  Running Dunk Shot 
                                48                                 19 
    Running Finger Roll Layup Shot           Running Finger Roll Shot 
                                 6                                  4 
                 Running Hook Shot                  Running Jump Shot 
                                41                                926 
                Running Layup Shot          Running Pull-Up Jump Shot 
                                72                                  4 
        Running Reverse Layup Shot             Running Slam Dunk Shot 
                                11                                  1 
                  Running Tip Shot                     Slam Dunk Shot 
                                 2                                411 
               Step Back Jump shot                     Tip Layup Shot 
                               118                                  2 
                          Tip Shot               Turnaround Bank shot 
                               182                                 71 
Turnaround Fadeaway Bank Jump Shot           Turnaround Fadeaway shot 
                                 1                                439 
       Turnaround Finger Roll Shot               Turnaround Hook Shot 
                                 2                                 14 
              Turnaround Jump Shot 
                              1057

Olhando para esse desmembramento a diferença de jump shots para os demais tipos de arremesso tentados se torna ainda mais evidente

26198 (85%) dos arremessos tentados por Kobe foram em temporadas regulares

devel1 <- as.data.frame(xtabs(~ playoffs + combined_shot_type, kobe))

devel1$combined_shot_type <- factor(
  devel1$combined_shot_type, levels(devel1$combined_shot_type)[c(1, 3, 6, 2, 5, 4)])

levels(devel1$playoffs) <- c("Temporada regular", "Playoffs")

levels(devel1$playoffs) <-
  paste0(levels(devel1$playoffs), ": "
         , table(kobe$playoffs), " (", round(prop.table(table(kobe$playoffs)), 4)*100, "%)")

barchart(combined_shot_type ~ Freq | playoffs
         , col = "#0080ff"
         , border = "transparent"
         , scales = list(x = "free")
         , xlim = list(c(0, 27000), c(0, 4500))
         , strip = strip.custom(bg = "white")
         , xlab = "Aremessos"
         , main = "Arremessos por temporada regular e por playoffs"
         , par.settings = settings
         , devel1
         , panel = function(...){
           args <- list(...)
           panel.text(
             args$x, args$y, paste0(args$x, " (", round(prop.table(args$x), 3)*100, "%)")
             , pos = 4)
           panel.barchart(...)})

Mais de 1/4 de todos os arremessos (27%) foram tentados no 3º quarto, com o 2º quarto tendo o segundo maior volume de arremessos. Menos de 400 arremessos foram tentados em prorrogações

devel2 <- as.data.frame(xtabs(~ period + combined_shot_type, kobe))

devel2$combined_shot_type <- factor(
  devel2$combined_shot_type, levels(devel2$combined_shot_type)[c(1, 3, 6, 2, 5, 4)])

levels(devel2$period) <- c("1º quarto", "2º quarto", "3º quarto", "4º quarto"
                           , "1ª prorrogação", "2ª prorrogação", "3ª prorrogação")

levels(devel2$period) <-
  paste0(levels(devel2$period), ": "
         , table(kobe$period), " (", round(prop.table(table(kobe$period)), 4)*100, "%)")

barchart(combined_shot_type ~ Freq | period
         , scales = list(x = "free")
         , strip = strip.custom(bg = "white")
         , xlab = "Arremessos"
         , main = "Arremessos por período do jogo"
         , par.settings = settings
         , col = "#0080ff"
         , border = "transparent"
         , xlim = list(
           c(0, 11000), c(0, 8000), c(0, 11000), c(0, 10000), c(0, 400), c(0, 50), c(0, 8))
         , layout = c(4, 2)
         , devel2
         , panel = function(...){
           args <- list(...)
           panel.text(
             args$x, args$y, paste0(args$x, " (", round(prop.table(args$x), 3)*100, "%)")
             , pos = 4)
           panel.barchart(...)})

As temporadas 2002/2003, 2005/2006, 2007/2008 e 2008/2009 foram as temporadas com os maiores números de arremesso tentados por Kobe

devel3 <- kobe

devel3$season <- factor(devel3$season, levels(devel3$season)[20:1])

barchart(devel3$season
         , col = "#0080ff"
         , border = "transparent"
         , xlim = c(0, 3000)
         , xlab = "Arremessos"
         , main = "Arremessos por temporada"
         , par.settings = settings
         , panel = function(...){
           panel.abline(v = seq(0, 3000, 100), col = "gray90")
           args <- list(...)
           panel.text(
             args$x, args$y, paste0(args$x, " (", round(prop.table(args$x), 3)*100, "%)")
             , pos = 4)
           panel.barchart(...)})

Dispersão dos arremessos

xyplot(loc_y ~ loc_x, groups = shot_type
       , type = c("p", "g")
       , pch = 19
       , col = c("#0080ff", "gray")
       , xlab = "Longitude"
       , ylab = "Latitude"
       , key = list(text = list(c("Arremesso de 2 pontos", "Arremesso de 3 pontos"))
                    , points = TRUE
                    , col = c("#0080ff", "gray") , pch = 19, columns = 2)
       , kobe)

Observa-se alguns erros na base de dados, i.e., arremessos de 2 pontos contabilizados como de 3, e um arremesso de 2 contabilizado como de 3 pontos

Dispersão dos arremessos pela área e zona da quadra e pela distância à cesta

print(xyplot(loc_y ~ loc_x, groups = shot_zone_area
             , type = c("p", "g")
             , pch = 19
             , xlab = "Longitude"
             , ylab = "Latitude"
             , kobe)
      , position = c(0, .5, .5, 1)
      , more = TRUE)

print(xyplot(loc_y ~ loc_x, groups = shot_zone_basic
             , type = c("p", "g")
             , pch = 19
             , xlab = "Longitude"
             , ylab = "Latitude"
             , kobe)
      , position = c(.5, .5, 1, 1)
      , more = TRUE)

print(xyplot(loc_y ~ loc_x, groups = shot_zone_range
       , type = c("p", "g")
       , pch = 19
       , xlab = "Longitude"
       , ylab = "Latitude"
       , kobe)
      , position = c(.25, 0, .75, .5))

Times da conferência oeste são os adversários contra os quais Kobe tentou mais arremessos

barchart(sort(table(kobe$opponent))
         , col = "#0080ff"
         , border = "transparent"
         , xlim = c(0, 2400)
         , xlab = "Aremessos"
         , main = "Arremessos por oponente"
         , par.settings = settings
         , panel = function(...){
           panel.abline(v = seq(0, 2400, 200), col = "gray90")
           args <- list(...)
           panel.text(
             args$x, args$y, paste0(args$x, " (", round(prop.table(args$x), 3)*100, "%)")
             , pos = 4)
           panel.barchart(...)})

Entre as outras variáveis disponíveis temos a informação de quantos minutos restavam em cada quarto de jogo (cada quarto tem duração de 12 minutos) e de quantos segundos restavam em cada um desses minutos, além da data de cada partida

A base de dados é grande e o volume de computações que algumas técnicas exigem é elevado, portanto nem todas as variáveis disponíveis foram utilizadas

Foi dado preferência as variáveis que mais logicamente podem influenciar na conversão, ou não, do aremesso. Por exemplo: adversário, tempo restante no relógio, tipo de arremesso, local da quadra em que o arremesso foi feito, se o jogo era de temporada regular ou de playoff, uma indicação de quais arremessos foram tentados no mesmo jogo

Modelos

levels(kobe$combined_shot_type) <- c("Bank", "Dunk", "Hook", "Jump", "Layup", "Tip")

kobe_train <- subset(kobe, !is.na(shot_made_flag))

kobe_train$shot_made_flag <- factor(kobe_train$shot_made_flag, labels = c("no", "yes"))

kobe_test <- subset(kobe, is.na(shot_made_flag))

Gradient Boosting Machine

library(gbm)

library(caret)

library(pROC)

objcontrol <- trainControl(method = "cv"
                           , number = "3"
                           , returnResamp = "none"
                           , summaryFunction = twoClassSummary
                           , classProbs = TRUE)

objmodel <- train(shot_made_flag
                  ~ combined_shot_type
                  + game_id
                  + minutes_remaining
                  + period
                  + playoffs
                  + season
                  + seconds_remaining
                  + shot_distance
                  + shot_type
                  + shot_zone_basic
                  + shot_zone_range
                  + opponent
                  , kobe_train
                  , method = "gbm"
                  , trControl = objcontrol
                  , metric = "ROC"
                  , preProc = c("center", "scale"))

Depois de mais de 30 horas de processamento a computação foi abortada, a metodologia utilizada se mostrou muito demorada

Random Forest

library(randomForest)

set.seed(68)

(model_rf <- randomForest(shot_made_flag
                          ~ combined_shot_type
                          + game_id
                          + minutes_remaining
                          + period
                          + playoffs
                          + season
                          + seconds_remaining
                          + shot_distance
                          + shot_type
                          + shot_zone_basic
                          + shot_zone_range
                          + opponent
                          , kobe_train
                          , importance = TRUE))


Call:
 randomForest(formula = shot_made_flag ~ combined_shot_type +      game_id + minutes_remaining + period + playoffs + season +      seconds_remaining + shot_distance + shot_type + shot_zone_basic +      shot_zone_range + opponent, data = kobe_train, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 40.88%
Confusion matrix:
       no  yes class.error
no  14021  211  0.01482574
yes 10293 1172  0.89777584

varImpPlot(model_rf, main = "Floresta aleatória", pch = 19)

A floresta se saiu bem na classificação dos arremessos não convertidos, com um erro de classificação de 1.5%. Já nos arremessos convertidos o erro de classificação obtido foi gigantesco, 90%!

Das duas medidas calculadas para representar a importância de cada variável na classificação, em geral, é difícil fazer afirmações. A variável que pra uma medida se mostra a mais importante, opponent, na outra medida é a menos importante

As variáveis game_id, season, shot_distance e combined_shot_type são as que se destacaram em ambas as medidas

print(xyplot(loc_y ~ loc_x
             , groups = shot_made_flag
             , col = 2:3
             , pch = 19
             , type = c("p", "g")
             , main = "Observado"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(0, 0, .5, 1), more = TRUE)

print(xyplot(loc_y ~ loc_x
             , groups = model_rf$predicted
             , col = 2:3
             , pch = 19
             , type = c("p", "g")
             , main = "Predito"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(.5, 0, 1, 1))

Regressão logística

model_glm <- glm(shot_made_flag
                 ~ combined_shot_type
                 + game_id
                 + minutes_remaining
                 + period
                 + playoffs
                 + season
                 + seconds_remaining
                 + shot_distance
                 + shot_type
                 + shot_zone_basic
                 + shot_zone_range
                 + opponent
                 , kobe_train
                 , family = binomial)

anova(model_glm, test = "Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: shot_made_flag

Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                               25696      35325              
combined_shot_type  5  1697.40     25691      33628 < 2.2e-16 ***
game_id             1     0.29     25690      33627 0.5917314    
minutes_remaining   1    23.04     25689      33604 1.584e-06 ***
period              1    19.81     25688      33585 8.575e-06 ***
playoffs            1     2.91     25687      33582 0.0881328 .  
season             19    50.02     25668      33532 0.0001303 ***
seconds_remaining   1    19.15     25667      33512 1.211e-05 ***
shot_distance       1   117.55     25666      33395 < 2.2e-16 ***
shot_type           1     6.67     25665      33388 0.0097822 ** 
shot_zone_basic     6    30.41     25659      33358 3.279e-05 ***
shot_zone_range     3    24.78     25656      33333 1.713e-05 ***
opponent           32    27.34     25624      33306 0.7013656    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Aqui as covariáveis são inseridas sequencialmente, seguindo esse esquema apenas duas variáveis não se mostraram significativas, game_id num modelo que já tinha combined_shot_type, e opponent num modelo com todas as outras variáveis presentes

print(xyplot(loc_y ~ loc_x
             , groups = shot_made_flag
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Observado"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(0, 0, .5, 1)
      , more = TRUE)

print(xyplot(loc_y ~ loc_x
             , groups = ifelse(model_glm$fitted.values > .5, 1, 0)
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Predito"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(.5, 0, 1, 1))

mean(ifelse(model_glm$fitted.values > .5, 1, 0) != model_glm$y)

[1] 0.3861151

table(model_glm$y, ifelse(model_glm$fitted.values > .5, 1, 0))

   
        0     1
  0 11980  2252
  1  7670  3795

A acurácia da regressão logística foi de apenas 39%

O erro de classificação dos arremessos errados é de 16% [ 2252 / (2252+11980) ], já o dos arremessos convertivos é de 67% [ 7670 / (3795+7670) ]

Discriminante de Fischer

Linear

library(MASS)

model_dfl <- lda(shot_made_flag
                 ~ combined_shot_type
                 + game_id
                 + minutes_remaining
                 + period
                 + playoffs
                 + season
                 + seconds_remaining
                 + shot_distance
                 + shot_type
                 + shot_zone_basic
                 + shot_zone_range
                 + opponent
                 , kobe_train)

pred_dfl <- predict(model_dfl, kobe_train)

print(xyplot(loc_y ~ loc_x
             , groups = shot_made_flag
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Observado"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(0, 0, .5, 1)
      , more = TRUE)

print(xyplot(loc_y ~ loc_x
             , groups = pred_dfl$class
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Predito"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(.5, 0, 1, 1))

table(kobe_train$shot_made_flag, pred_dfl$class)

     
         no   yes
  no  11985  2247
  yes  7661  3804

O erro de classificação dos arremessos errados é de 16% [ 2247 / (2247+11985) ], já o dos arremessos convertivos é de 67% [ 7661 / (3804 + 7661) ]

Regularizado

library(klaR)

model_dfr <- rda(shot_made_flag
                 ~ combined_shot_type
                 + game_id
                 + minutes_remaining
                 + period
                 + playoffs
                 + season
                 + seconds_remaining
                 + shot_distance
                 + shot_type
                 + shot_zone_basic
                 + shot_zone_range
                 + opponent
                 , kobe_train)

pred_dfr <- predict(model_dfr, kobe_train)

print(xyplot(loc_y ~ loc_x
             , groups = shot_made_flag
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Observado"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(0, 0, .5, 1)
      , more = TRUE)

print(xyplot(loc_y ~ loc_x
             , groups = pred_dfr$class
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Predito"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(.5, 0, 1, 1))

table(kobe_train$shot_made_flag, pred_dfr$class)

     
         no   yes
  no      0 14232
  yes     0 11465

Aqui todos os arremessos foram classificados como convertidos

Por problemas numéricos não foi possível utilizar o Discriminante de Fischer Quadrático

Predições, comparações e considerações

print(xyplot(loc_y ~ loc_x
             , groups = predict(model_rf, kobe_test)
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Floresta aleatória"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(0, .5, .5, 1)
      , more = TRUE)

print(xyplot(loc_y ~ loc_x
             , groups = ifelse(predict(model_glm, kobe_test, type = "response") > .5, 1, 0)
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Regressão logística"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(.5, .5, 1, 1)
      , more = TRUE)

print(xyplot(loc_y ~ loc_x
             , groups = predict(model_dfl, kobe_test)$class
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Discriminante de Fischer Linear"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(0, 0, .5, .5)
      , more = TRUE)

print(xyplot(loc_y ~ loc_x
             , groups = predict(model_dfr, kobe_test)$class
             , type = c("p", "g")
             , pch = 19
             , col = 2:3
             , main = "Discriminante de Fischer Regularizado"
             , xlab = "Longitude"
             , ylab = "Latitude"
             , key = list(space = "top"
                          , text = list(c("Errou", "Acertou"))
                          , points = list(col = 2:3, pch = 19, cex = .8)
                          , columns = 2)
             , kobe_train)
      , position = c(.5, 0, 1, .5))

Nenhuma das metodologias utilizadas resultou em ótimas predições na base de treino. A Floresta Aleatória apresentou uma altíssima acurária na classificação dos aremessos errados e uma acurária baixíssima nos arremessos convertidos

A Regressão Logística e o Discriminante de Fischer Linear apresentaram resultados praticamente idênticos e em comparação à Floresta Aleatória tiveram uma maior acurária para os arremessos convertidos e uma menor acurácia para os arremessos errados

Como os resultados não foram bons na base de treino as classificações obtidas para a base de teste não são muito confiáveis

O Discriminante de Fischer Regularizado continua classificando todos os arremessos como convertidos, e a Regressão Logística e o Discriminante de Fischer Linear continuam tendo resultados praticamente idênticos

Essas más classficações podem simplesmente representar a dificuldade em predizer o resultado de um arremesso com base em covariáveis. Talvez o processo seja tão aleatório que uma predição razoável se torna inviável e impraticável.

Kobe Bryant Shot Selection

Emerson Rigoni
Henrique Aparecido Laureano [Lattes, GitHub]

Julho de 2016

Conhecendo os dados

Modelos

Gradient Boosting Machine

Random Forest

Regressão logística

Discriminante de Fischer

Linear

Regularizado

Predições, comparações e considerações

Kobe Bryant Shot Selection

Emerson Rigoni Henrique Aparecido Laureano [Lattes, GitHub]

Julho de 2016

Conhecendo os dados

Modelos

Gradient Boosting Machine

Random Forest

Regressão logística

Discriminante de Fischer

Linear

Regularizado

Predições, comparações e considerações

Emerson Rigoni
Henrique Aparecido Laureano [Lattes, GitHub]