Introduction

A total of two datasets will be used in this project, namely Seasons_Stats.csv and All.NBA.1984-2018.csv. These two datasets were all originally parsed from a website called Basketball Reference. Nevertheless, I chose to use the more ready-to-go datasets provided by the users on Kaggle.

In addition, I will not be using all the data from these two datasets in that some important advanced data for the 1950s-1970s are missing (because some of the advanced metrics were only introduced in the 80s). Moreover, the league and the players in this league have evolved throughout the years, so the data from earlier seasons can hardly be doing a good job of predicting whether or not a current player gets chosen to be on the All-NBA team. Consequently, I will only be looking at the data from 2000 (the year I was born) to 2017 (the latest year that these two datasets have) for this project.

Every year, there are more than 450 NBA players that play in a single season, but only 15 of them would get selected into the All-NBA teams for that season. This makes me wonder what makes these 15 players stand out from all the NBA players. Consequently, the question that I am trying to answer is: what characteristics of an NBA player are more indicative of selection to the All-NBA teams, and to what extent can I predict the selection of All-NBA team using the players’ statistics?

After completing the data wrangling part, we can now see a total of 7220 rows of players’ stats from 2000-2017.

dim(players_pergame)
## [1] 7220   24

In addition, there are 273 rows of All-NBA players (All_NBA = 1) where in fact we should only have 270 players (18 seasons * 15 players/season = 270 players). This is because Dikembe Mutombo was counted 2 extra times (he switched teams twice during the 2000-2001 season), and Chauncey Billups was counted a total of 1 extra time (he switched teams once during the 2008-2009 season). Nevertheless, I decide to keep these 3 extra entries because, though on different teams, their stats could still represent their overall performances within that season.

## # A tibble: 18 × 2
##     year `length(Name)`
##    <dbl>          <int>
##  1  2000             15
##  2  2001             17
##  3  2002             15
##  4  2003             15
##  5  2004             15
##  6  2005             15
##  7  2006             15
##  8  2007             15
##  9  2008             15
## 10  2009             16
## 11  2010             15
## 12  2011             15
## 13  2012             15
## 14  2013             15
## 15  2014             15
## 16  2015             15
## 17  2016             15
## 18  2017             15

Let’s compare the distributions of some basic features (namely points, rebounds, and assists) between All_NBA players and players that were not selected in any All-NBA teams. The plots in the first row are using all the players’ data, and the plots in the second row are using only the players that were selected in the All-NBA Teams. The vertical lines of each plot represent the average value of that corresponding feature and population.

Descriptions and justifications of the methodology:

Given that my dataset players_pergame, after pre-processing and data wrangling, has a binary outcome variable named All_NBA, I would choose to implement a multivariate logistic regression for this dataset first. Using a logistic regression model on a binary outcome dataset seems to be a go-to strategy first and can be used to compare to other more advanced and computationally heavy models later on. In addition, all of my variables that concern players’ performances are continuous variables, which also makes it more reasonable to try out a regression model first. Furthermore, I am considering using a random forest model, which is a powerful method that generally performs well. It also gives us a lot of opportunities to tune our parameters for the model, such as mtry, so as to ameliorate our model’s performance and prediction results. Moreover, tree-based methods (and boosting) generally performs better on imbalanced dataset than other classification algorithms. Consequently, I decided to implement multivariate logistic regression and random forests on my dataset. Lastly, I would also like to give some justifications for not choosing some of the algorithms that we mentioned in class. First of all, linear regression was not chosen here since our outcome variable is binary. Moreover, LDA and QDA were not selected here because, as you may tell from the above 6 plots, some of the independent variables do not follow Gaussian distributions, which violates the assumptions for LDA and QDA. Lastly, decision tree was not selected in that random forest gives us certain advantages over decision trees such as more stability and less likely to overfit.