Scraping XML Tables with R

A couple of my good friends also recently started a sports analytics blog. We’ve decided to collaborate on a couple of studies revolving around NBA data found at www.basketball-reference.com. This will be the first part of that project!

Data scientists need data. The internet has lots of data. How can I get that data into R? Scrape it!

People have been scraping websites for as long as there have been websites. It’s gotten pretty easy using R/Python/whatever other tool you want to use. This post shows how to use R to scrape the demographic information for all NBA and ABA players listed at www.basketball-reference.com.

Here’s the code:

###### Settings
library(XML)
 
###### URLs
url<-paste0("http://www.basketball-reference.com/players/",letters,"/")
len<-length(url)
 
###### Reading data
tbl<-readHTMLTable(url[1])[[1]]
 
for (i in 2:len)
	{tbl<-rbind(tbl,readHTMLTable(url[i])[[1]])}
 
###### Formatting data
colnames(tbl)<-c("Name","StartYear","EndYear","Position","Height","Weight","BirthDate","College")
tbl$BirthDate<-as.Date(tbl$BirthDate[1],format="%B %d, %Y")

Created by Pretty R at inside-R.org

And here’s the result:Result

 

Advertisements

10 thoughts on “Scraping XML Tables with R

  1. Pingback: Basketball Data Part II – Length of Career by Position | Analyst At Large

  2. Pingback: Basketball Data Part II – Length of Career by Position ← Patient 2 Earn

  3. The XML package in R is capable of ripping lots of different kinds of data off the web. Tables are particular easy to pull (and useful for data analysis), but text can be ripped as well and then parsed and analyzed. It might make for an interesting future project!

  4. Pingback: Basketball Data Part III – BMI: Does it Matter? | Analyst At Large

  5. Pingback: Basketball Data Part III – BMI: Does it Matter? ← Patient 2 Earn

  6. Pingback: Chanyong's Data Analysis » Basketball Data Part ( R-bloggers 따라 잡기)

  7. Unfortunately, I’m getting an error message. I’ve got the most up to date version of R (3.1.2) and am on Windows Vista. The error is:

    Error in readHTMLTable(url[i])[[1]] : subscript out of bounds

    Am I doing something wrong? I have the XML package loaded, are there more that I may need? Thanks, like your post.

    • Sorry, I copied and pasted the script in and the error occurs right after the

      for (i in 2:len)
      + {tbl<-rbind(tbl,readHTMLTable(url[i])[[1]])}

      Command.

  8. Mike,
    The error is occurring because the code is trying to scrape a page that doesn’t exist. There are no players with last names starting with “x”, so when the code gets to http://www.basketball-reference.com/players/x/ it will throw an error. If you’d like to run the code without any errors, here is an update:

    ###### Settings
    library(XML)

    ###### URLs
    letters1<-letters[-which(letters=="x")]
    url<-paste0("http://www.basketball-reference.com/players/&quot;,letters1,"/")
    len<-length(url)

    ###### Reading data
    tbl<-readHTMLTable(url[1])[[1]]

    for (i in 2:len)
    {tbl<-rbind(tbl,readHTMLTable(url[i])[[1]])}

    ###### Formatting data
    colnames(tbl)<-c("Name","StartYear","EndYear","Position","Height","Weight","BirthDate","College")
    tbl$BirthDate<-as.Date(tbl$BirthDate[1],format="%B %d, %Y")

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s