Basketball Data Part III – BMI: Does it Matter?

For those of you who are just joining us, please refer back to the previous two posts referencing scraping XML data and length of NBA career by position. The next idea I wanted to explore was whether BMI had any effect on the length of NBA careers.

Originally, I had expected centers to have relatively short careers (based on the premise that ridiculous height/weight -> shorter careers). In the previous post, I find that centers have normal careers, even longer than forwards on average. So now I want to see if larger players in general have shorter careers. Do those players with higher BMIs last fewer years in the NBA?

I begin by looking at the BMI distribution for all retired NBA players:ImageThe distribution appears to be fairly normal and the center is around 24. (Note: normal BMIs range between 18.5 and 25, so a fair number of these athletes were “overweight” or had huge muscles.)

Next, I plotted BMI by position:ImageAs one would expect, centers have the highest BMI, followed by forwards, followed by guards. Dual position G-Fs have BMIs between guards and forwards (as expected), but dual position C-Fs average lower BMIs than centers or forwards.

Finally, I plotted career length by BMI:Image

It doesn’t look like there is much relationship between BMI and career length. I ran a simple linear regression model and confirmed that BMI is not a statistically significant predictor of career length.

Image

It does appear that outliers on both edges of the BMI distribution do have longer careers. These sample sizes are quite small, but my theory is that these players were so exceptional that they made it to the NBA despite their unusual body types (too big and too small). Their high level of skill led to longer than average careers.

Image

###### Settings
library(XML)
library(RColorBrewer)
col.9<-brewer.pal(9,"Blues")
setwd("C:/Blog/Basketball")
 
###### URLs
url<-paste0("http://www.basketball-reference.com/players/",letters,"/")
len<-length(url)
 
###### Reading data
tbl<-readHTMLTable(url[1])[[1]]
 
for (i in 2:len)
	{tbl<-rbind(tbl,readHTMLTable(url[i])[[1]])}
 
###### Formatting data
colnames(tbl)<-c("Name","StartYear","EndYear","Position","Height","Weight","BirthDate","College")
tbl$BirthDate<-as.Date(tbl$BirthDate,format="%B %d, %Y")
 
tbl$StartYear<-as.numeric(as.character(tbl$StartYear))
tbl$EndYear<-as.numeric(as.character(tbl$EndYear))
 
tbl$Position[tbl$Position=="F-C"]<-"C-F"
tbl$Position[tbl$Position=="F-G"]<-"G-F"
tbl$Position<-factor(tbl$Position,levels=c("C","G","F","C-F","G-F"))
 
###### Career Length
tbl$LEN<-tbl$EndYear-tbl$StartYear
 
table(tbl$Position)
boxplot(tbl$LEN~tbl$Position,col="light blue",ylab="Years",xlab="Position",
	main="Length of Career by Position")
 
###### Age at Retirement
tbl$RetireAge<-tbl$EndYear-as.numeric(substr(tbl$BirthDate,0,4))
 
boxplot(tbl$RetireAge~tbl$Position,col="light blue",ylab="Retirement Age",xlab="Position",
	main="Retirement Age by Position")
 
###### Removing Currently Active Players
retired<-tbl[tbl$EndYear<2014,]
 
boxplot(tbl$LEN~tbl$Position,col="light blue",ylab="Years",xlab="Position",
	main="Length of Career by Position")
 
boxplot(tbl$RetireAge~tbl$Position,col="light blue",ylab="Retirement Age",xlab="Position",
	main="Retirement Age by Position")
 
###### BMI Calculation
retired$Height<-as.character(retired$Height)
retired$Weight<-as.numeric(as.character(retired$Weight))
retired$HeightInches<-sapply(strsplit(retired$Height,"-"),function(x) as.numeric(x[1])*12+as.numeric(x[2]))
retired$BMI<-(retired$Weight/(retired$HeightInches^2))*703
 
hist(retired$BMI,col=col.9[4],xlim=c(18,30),xlab="BMI",main="Histogram of Retired NBA Players' BMI")
 
par(mar=c(6,5,5,3))
boxplot(retired$BMI~retired$Position,col=col.9[5],yaxt="n",ylab="BMI (Body Mass Index)",xlab="Position",
	main="BMI by Position")
axis(2,at=seq(18,30,by=2),labels=seq(18,30,by=2))
axis(4,at=seq(18,30,by=2),labels=seq(18,30,by=2))
for (i in seq(16,34,by=1))
	{abline(h=i,lty=3,col="lightgray")}
 
model1<-lm(retired$LEN~retired$BMI)
summary(model1)
 
retired$BMI_GROUP<-cut(retired$BMI,breaks=c(0,18,20,22,24,26,28,30,9999),
	labels=c("<=18","18-20","20-22","22-24","24-26","26-28","28-30","30+"))
 
# Removing Players without Weight Info
retired1<-retired[!is.na(retired$BMI),]
 
boxplot(retired1$LEN~retired1$BMI_GROUP,col=col.9[7],xlab="BMI Group",ylab="Career Length (yrs)",
	main="Career Length by BMI")
axis(4,at=seq(0,20,by=5),labels=seq(0,20,by=5))
table(retired1$BMI_GROUP)
 
retired1[retired1$BMI_GROUP %in% c("<=18","18-20","30+"),c("Name","StartYear","EndYear",
	"Position","LEN","Height","Weight","BMI")]

Created by Pretty R at inside-R.org

Advertisements

Data Scientist Salaries in SF

GlassdoorSalaries

Evidence from Glassdoor that the supply of data scientists in the Bay Area doesn’t meet the current demand. This number has risen rapidly over the past couple of years (since I started tracking it). I predict that it will continue to rise for the foreseeable futureuntil a couple thousand college graduates in math/stat/CS are absorbed into the market.

US Population by Ethnicity Visualization

US Census 2011 (ACS) – choroplethr

As a statistician, I’ve always had a soft spot in my heart for the US Census. I love the rich data sets that are made publicly available and I’ve often experimented with visualizing the results. A couple of months ago, Ari Lamstein (a data scientist at Trulia) released the choroplethr package on CRAN (a repository for R packages). I pulled it up a couple of days ago and found it be simple and intuitive. Only a couple of simple commands are required to build plots like this: USPop

1) Go to http://www.census.gov/developers/tos/key_request.html to get a ACS API key.
2) Visit http://factfinder2.census.gov/faces/affhelp/jsf/pages/metadata.xhtml?lang=en&type=survey&id=survey.en.ACS_ACS to find the appropriate ACS table ID for the attribute that you’re looking to explore.
3) Open up R, install choroplethr package, define your API key using the api.key.install() command
4) Explore away!

I started looking at the US population split by ethnicity.
USPopWhite

USPopBlack

USPopAsian

We can see very clearly the heavier concentrations of African-Americans in the Southeastern states, the Eastern seaboard and Southern CA. Asian-American population centers are focused on the West Coast and the NE Coast.

The R code is shown below:

###### Settings
library(choroplethr)
library(acs)
library(ggplot2)
 
###### API key
# Need to go to http://www.census.gov/developers/tos/key_request.html to set API key
api.key.install("###############")
 
###### Basic ACS Table IDs 
# B19301 = Per Capita Income
# B01003 = Population
 
###### Plotting
## Basic by State
choroplethr_acs(tableId="B19301",lod="state")
choroplethr_acs(tableId="B19301",lod="state",showLabels=FALSE)
choroplethr_acs(tableId="B19301",lod="state",showLabels=FALSE,num_buckets=9)
choroplethr_acs(tableId="B19301",lod="state",showLabels=FALSE,num_buckets=9)+labs(title="US 2011 Per Capita Income by State")
 
## Per Capita Income by County
choroplethr_acs(tableId="B19301",lod="county")
choroplethr_acs(tableId="B19301",lod="county",num_buckets=9,states=c("CA"))
 
## Population by County by Ethnicity
choroplethr_acs(tableId="B01003",lod="county")+labs(title="Total US Population by County (2011)")
choroplethr_acs(tableId="B02008",lod="county")+labs(title="US Population by County (2011) - White")
choroplethr_acs(tableId="B02009",lod="county")+labs(title="US Population by County (2011) - Black ")
choroplethr_acs(tableId="B02011",lod="county")+labs(title="US Population by County (2011) - Asian")
choroplethr_acs(tableId="B03001",lod="county")+labs(title="US Population by County (2011) - Hispanic")