Correlation Does Not Imply Causation

All statisticians repeat this mantra on a daily basis at work (or just to themselves). Here is a funny site that gives examples where correlation clearly doesn’t imply causation.


To get more check out


Scraping XML Tables with R

A couple of my good friends also recently started a sports analytics blog. We’ve decided to collaborate on a couple of studies revolving around NBA data found at This will be the first part of that project!

Data scientists need data. The internet has lots of data. How can I get that data into R? Scrape it!

People have been scraping websites for as long as there have been websites. It’s gotten pretty easy using R/Python/whatever other tool you want to use. This post shows how to use R to scrape the demographic information for all NBA and ABA players listed at

Here’s the code:

###### Settings
###### URLs
###### Reading data
for (i in 2:len)
###### Formatting data
tbl$BirthDate<-as.Date(tbl$BirthDate[1],format="%B %d, %Y")

Created by Pretty R at

And here’s the result:Result


Data Scientist Salaries in SF


Evidence from Glassdoor that the supply of data scientists in the Bay Area doesn’t meet the current demand. This number has risen rapidly over the past couple of years (since I started tracking it). I predict that it will continue to rise for the foreseeable futureuntil a couple thousand college graduates in math/stat/CS are absorbed into the market.

US Population by Ethnicity Visualization

US Census 2011 (ACS) – choroplethr

As a statistician, I’ve always had a soft spot in my heart for the US Census. I love the rich data sets that are made publicly available and I’ve often experimented with visualizing the results. A couple of months ago, Ari Lamstein (a data scientist at Trulia) released the choroplethr package on CRAN (a repository for R packages). I pulled it up a couple of days ago and found it be simple and intuitive. Only a couple of simple commands are required to build plots like this: USPop

1) Go to to get a ACS API key.
2) Visit to find the appropriate ACS table ID for the attribute that you’re looking to explore.
3) Open up R, install choroplethr package, define your API key using the api.key.install() command
4) Explore away!

I started looking at the US population split by ethnicity.



We can see very clearly the heavier concentrations of African-Americans in the Southeastern states, the Eastern seaboard and Southern CA. Asian-American population centers are focused on the West Coast and the NE Coast.

The R code is shown below:

###### Settings
###### API key
# Need to go to to set API key
###### Basic ACS Table IDs 
# B19301 = Per Capita Income
# B01003 = Population
###### Plotting
## Basic by State
choroplethr_acs(tableId="B19301",lod="state",showLabels=FALSE,num_buckets=9)+labs(title="US 2011 Per Capita Income by State")
## Per Capita Income by County
## Population by County by Ethnicity
choroplethr_acs(tableId="B01003",lod="county")+labs(title="Total US Population by County (2011)")
choroplethr_acs(tableId="B02008",lod="county")+labs(title="US Population by County (2011) - White")
choroplethr_acs(tableId="B02009",lod="county")+labs(title="US Population by County (2011) - Black ")
choroplethr_acs(tableId="B02011",lod="county")+labs(title="US Population by County (2011) - Asian")
choroplethr_acs(tableId="B03001",lod="county")+labs(title="US Population by County (2011) - Hispanic")


Data Scientist/Statistician Job Market

The Bay Area specifically is currently suffering from an imbalance between data scientist positions and qualified workers to fill these openings.

I returned to the US from Shanghai almost a year ago and I’ve found the data scientist job markets to be quite different. In general, the employment atmosphere for qualified data workers in the US is much more friendly than the atmosphere in China. The “Big Data” wave has hit the US (particularly the Bay Area) hard, and demand for people who know how to pull/extract/transform/analyze/visualize data has skyrocketed. China’s economy is substantially more focused on production, industrial productivity, and quality control concerns and, consequently, the demand for data scientists is lower.

The Bay Area specifically is currently suffering from an imbalance between data scientist positions and qualified workers to fill these openings. Recruiters and friends alike contact me (and likely every other data scientist in the Bay Area) almost every day with data positions at tech companies large and small. I’m quite happy at my current job, so I’ve been passing these leads along to qualified friends graduating from school, but I also wanted to share a couple of resources that I think might be useful for budding data scientists looking for work:

1) LinkedIn – this seems to be recruiters’ primary platform. Everybody already knows this…
2) Burtch Works – these recruiters focus on the analytics market. Most positions here seem to be located in the MidWest. They are a little too SAS-oriented/marketing-oriented for my taste, but it is a valuable resource nonetheless.
3) Analytic Recruiting – these recruiters seem to have a wide geographic range.  Dedicated section for Wall St. quants if that’s your thing.
4) DataJobs – a relative newcomer on the scene. I don’t know much about them, but there are a large variety of data science listings.
5) R Jobs – this listing site just started out of R Bloggers and only has listings focused on R
6) Friends – I think this is always the best way.