Conclusion

Overall, we implemented a total of two models, namely logistic regression and random forest, and both methods were trained also using a 5-fold cross validation. In addition, we also tuned the model parameter mtry for the random forest model so that we can derive the best random forest model for this dataset. In terms of model performance, the logistic regression model performed almost equally well as our tuned random forest model. In fact, the logistic regression model seemed to be performing just slightly better than random forest. To be more specific, the logistic regression model achieved an overall accuracy of 98.34% and an AUC score of 0.831, while the random forest model had an accuracy of 98.28% and an AUC score of 0.8307. In addition, while both models had relatively high accuracy, they also both had relatively low specificity. This could probably be attributed to the imbalanced nature of this dataset in terms of the outcome variable(i.e. there are much more players that were not selected into the All-NBA teams than players who were selected). Furthermore, the top three most important features derived from the logistic regression model were Games, ShootingPercentage, and BoxPlusMinus; while the top three most important features derived from the random forest model were EfficiencyRating, BoxPlusMinus, and Points. Although the two groups of top three most important features did not overlap much(the only overlapping variable is BoxPlusMinus), they still were very similar to each other. To illustrate, the variable ShootingPercentage has very close ties to the variable EfficiencyRating in that a player who shoots well would have a higher PER value than a player who has a low shooting percentage. Additionally, a player who’s able to earn a lot of game time is usually the same player who can help his team score, which explains why the variable Game and the variable Points are present. Consequently, we can conclude that: a player who is great at scoring points while also maintaining a high efficiency rating and a high Box Plus/Minus will have a significantly higher chance of being selected into the All-NBA teams.

Furthermore, I would consider my analysis to be successful given the high accuracy and AUC scores that I obtained from both models when tested on the test set. One thing worth noticing is the relatively low specificity of the two models. Nevertheless, this is caused by the inevitable nature of the fact that every year less than 4% of the players get selected into the All-NBA teams, which gives the dataset less than 4 percent of positive labels for the outcome variable.

If I had more time to dig deeper into this research question, I would try to scrape the web for the 2018-2022 NBA season data and test my models’ performances on them after training them on all the data that I used in this project. In addition, I would also consider using upsampling or downsampling techniques, such as SMOTE, to balance the skewed training set. Moreover, I would also try to separate my analysis for the All-NBA selections to Centers, Forwards, and Guards since in reality, any All-NBA team is always composed of 2 Guards, 2 Forwards, and 1 Center. Lastly, since the 3 All-NBA teams each year are ranked (i.e. the 1st team is better than the 2nd team, and the 2nd team is better than the 3rd team), maybe I can factor this perspective into consideration in the future by giving relatively more weights to the players that make it onto the 1st team comparing to those who were on the 2nd/3rd team.