Monday, March 31, 2014

The cost of studying Data Science

Simply Statistics has a ("non-comprehensive") breakdown of costs associated with various data science educational programs. The list includes MOOCs and traditional degree programs at the Masters and Doctoral levels. Check it out.

Full disclosure: The guys at Simply Statistics are teaching the JHU data science certificate course. There are many other programs that are ostensibly data science programs as well, whether they call themselves "data science" or otherwise. (They did say it was non-comprehensive, after all.)

Friday, March 28, 2014

Friday Links Mar 28, 2014

Set up your own VPN

Deep learning and NLP - This is a follow on (for me) from Monday's DC2 deep learning lecture (which was excellent)

Clusters may be less separated in real life than in the classroom. - You don't say?

The Sunlight Foundation's If This Then That channel

A timeline of statistics

Jeff Leek's take on what my stat professors called the "principal of parsimony"

Krugman on Paul Pfleiderer on assumptions in modeling - This was written for economics, but in principle it's relevant to anyone doing quantitative modeling.

Inside the algorithms predicting the future!

A bit on Bayesian statistics

Some links on Thomas Piketty - (1, 2, 3) HT to Marginal Revolution for some of these. I have Piketty's book and am looking forward to reading it when things slow down. (Read: when I stop taking work home with me.)

Wednesday, March 26, 2014

Tyler Cowen on the Alleged Wage Suppression Scheme in Silicon Valley

There's been some consternation and indignation lately pointed towards Silicon Valley execs who may have conspired not to recruit from each others' firms.

Tyler Cowen weighs in:

"I would suggest caution in interpreting this event.  For one thing, we don’t know how effective this monopsonistic cartel turned out to be. [...] It is hard to find examples of persistently successful monopsonistic labor-buying cartels."

To me, it seems the way to think about this situation breaks down to two questions:

  1. Was the law broken, irrespective of the economics of the situation?
  2. Factoring in economics, should this be illegal? 
The answers to 1 and 2 (and how they interact with each other) ought to drive one's thinking on the issue.

Update 5/23/2014 - looks like they settled.

Tuesday, March 25, 2014


I'm looking into upgrading my personal computing situation. I get to play with some fancy tech at work, but at home, I'm running Windows XP on a laptop that was mediocre when it was new... in 2008. My desktop was mediocre in 2006.

For surfing the web and watching Netflix, a tablet and my ghetto desktop hooked up to the TV are fine. But I occasionally do my own research outside of work. I've also got personal data on a handful of thumb drives, and external hard drive, and the last 3 laptops I've owned. I keep the latter in a box under the printer just in case I need something on them. This needs to change.

I am a fan of Linux, though I can't claim to be an expert. A colleague suggested making the change to Apple with a Macbook air for that very reason. I was considering it for a while, but I just don't see the value for money. For $1,550 I can get 8 G of RAM, 256 G of flash storage, and a dual core processor running at 1.7 GHz. Not really something to write home about considering the price. (For the record, I don't really care about graphics etc. I need to crunch data.)

I recently stumbled across System76, a maker of computers designed to use Linux (Ubuntu, specifically) as the primary OS. For $1,333 I can get 16 G of RAM, 240 G of flash storage, and a quad core processor with hyperthreading (that's an additional 4 virtual cores) pumping 2.0 GHz. I save $200 and get more power and memory? Ooh, baby. Sign me up.

I can spend that $200 on a 4 TB data store, though that would be a mechanical HD rather than flash storage.

I learned last night at DC2's event, A Short History of and Introduction to Deep Learning, that I can get a couple GTX 580 GPUs up and running for less than $600. Hamina hamina. It'll take me longer to teach myself deep learning than it'll take to buy the hardware to use it.

Friday, March 21, 2014

Friday Links Mar 21, 2014

IBM's Watson is looking into genetics and cancer

Get the right data scientists to ask the "wrong" questions - I've used similar methods myself to great success.

"Sloppy researchers beware. A new institute has you in its sights"

So you want to be a data scientist...

Talking about uncertainty in the scientific process

A Bayesian reading list

Economics and the foundations of artificial intelligence

Intelligence as skill vs innate quality (only loosely statistically-related)

Sparse updates the last couple of weeks. I've been busy at work and enjoying the (slow) return of spring here in the DC area in my time off. Hasn't left much time for blogging.

As I said in my last post, I've been reading The Second Machine Age. Stories about IBM's Watson and other learning machines got me thinking about how we view intelligence as a society. These days, I think, we tend to view intelligence as an innate quality, albeit one that's shaped by education.

I read an article (that I desperately tried to find so I could link to it here, sorry) stating that in ancient Greece, intelligence was viewed as a skill. One could look at the habits and practices of an intelligent person and emulate them to boost one's own intelligence.

Flash forward: I've met some really smart people over the years. They might wax poetic about intelligence in the abstract. However, if one was to ask them something concrete like, "how should I study for the upcoming midterm?" You'd get a pretty concrete answer about organizing information, tips and tricks for memorizing theorems, what to do the day of the test, etc. They'd be unlikely to say, "if you're smart, you'll do well." 

I'd also add that when I was in the Marines, they had a similar view of preparation for battle. We called it the 7 P's. Prior Planning and Preparation Prevent Piss Poor Performance. Not much in there about being born a certain way.

The debate over intelligence being innate versus a skill rages on. I wonder what our journey to create "smart" machines will ultimately tell us about ourselves?

(For the record: when the singularity looks like it's getting close, I'm going to start carrying around a pocket full of magnets just in case...)

Monday, March 17, 2014

True statements

"The greatest shortcoming of the human race is our inability to understand the exponential function" - Albert A. Bartlett

I am reading The Second Machine Age by Erik Brynjolfsson and Andrew McAfee. Those who have already read it can guess what page I am on.

Monday, March 10, 2014

Will big data bring a return of sampling statistics? And a review of Arron Strauss's talk at DSDC.

* Edit: 3/10/2014 - 2:45 PM: Added a sentence to the third paragraph of the section "In Practice: Political Polling in 2012 and Beyond" and changed the second section heading under "Some Background" from "Hand-wringing about surveys" to "Much ado about response rates".

Some Background

What is sampling statistics?

Sampling statistics concerns the planning, collection, and analysis of survey data. When most people take a statistics course, they are learning "model-based" statistics. (Model-based statistics is not the same as statistical modeling, stick with me here.) Model-based statistics uses a mathematical function to model the distribution of an infinitely-sized population to quantify uncertainty. Sampling statistics, however, uses a priori knowledge of the size of the target population to inform quantifying uncertainty. The big lesson I learned after taking survey sampling is that if you assume the correct model, then the two statistical philosophies agree. But if your assumed model is wrong, the two approaches give different results. (And one approach has fewer assumptions, bee tee dubs.)

Sampling statistics also has a big bag of other tricks, too many to do justice here. But it provides frameworks for handling missing or biased data, combining data on subpopulations whose sample proportions differ from their proportions of the population, how to sample when subpopulations have very different statistical characteristics, etc.

As I write this, it is entirely possible to earn a PhD in statistics and not take a single course in sampling or survey statistics. Many federal agencies hire statisticians and then send them immediately back to school to places like UMD's Joint Program in Survey Methodology. (The federal government conducts a LOT of surveys.)

I can't claim to be certain, but I think that sampling statistics became esoteric for two reasons. First, surveys (and data collection in general) have traditionally been expensive. Until recently, there weren't many organizations except for the government that had the budget to conduct surveys properly and regularly. (Obviously, there are exceptions.) Second, model-based statistics tend to work well and have broad applicability. You can do a lot with a laptop, a .csv file, and the right education. My guess is that these two factors have meant that the vast majority of statisticians and statistician-like researchers have become consumers of data sets, rather than producers. In an age of "big data" this seems to be changing, however.

Much ado about response rates

Response rates for surveys have been dropping for years, causing frustration among statisticians and skepticism from the public. Having a lower response rate doesn't just mean your confidence intervals get wider. Given the nature of many surveys, it's possible (if not likely) that the probability a person responds to the survey may be related to one or a combination of relevant variables. If unaddressed, such non-response can damage an analysis. Addressing the problem drives up the cost of a survey, however.

Consider measuring unemployment. A person is considered unemployed if they don't have a job and they are looking for one. Somebody who loses their job may be less likely to respond to the unemployment survey for a variety of reasons. They may be embarrassed, they may move back home, they may have lost their house! But if the government sends a survey or interviewer and doesn't hear back, how will it know if the respondent is employed, unemployed (and looking), or off the job market completely? So, they have to find out. Time spent tracking a respondent down is expensive!

So, if you are collecting data that requires a response, you must consider who isn't responding and why. Many people anecdotally chalk this effect up to survey fatigue. Aren't we all tired of being bombarded by websites and emails asking us for "just a couple minutes" of our time? (Businesses that send a satisfaction survey every time a customer contacts customer service take note; you may be your own worst data-collection enemy.)

In Practice: Political Polling in 2012 and Beyond

In context of the above, Aaron Strauss's February 25th talk at DSDC was enlightening. Aaron's presentation was billed as covering "two things that people in [Washington D.C.] absolutely love. One of those things is political campaigns. The other thing is using data to estimate causal effects in subgroups of controlled experiments!" Woooooo! Controlled experiments! Causal effects! Subgroup analysis! Be still, my beating heart.

Aaron earned a PhD in political science from Princeton and has been involved in three of the last four presidential campaigns designing surveys, analyzing collected data, and providing actionable insights for the Democratic party. His blog is here. (For the record, I am strictly non-partisan and do not endorse anyone's politics though I will get in knife fights over statistical practices.)

In an hour-long presentation, Aaron laid a foundation for sampling and polling in the 21st century, revealing how political campaigns and businesses track our data, analyze it, and what the future of surveying may be. The most profound insight I got was to see how the traditional practices of sampling statistics were being blended with 21st century data collection methods, through apps and social media. Whether these changes will address the decline is response rates or only temporarily offset them remains to be seen.

Some highlights:

  • The number of households that have only wireless telephone service is reaching parity with the number having land line phone service. When considering only households with children (excluding older people with grown children and young adults without children) the number sits at 45 percent.
  • Offering small savings on wireless bills may incentivize the taking of flash polls through smart phones.
  • Reducing the marginal cost of surveys allows political pollsters to design randomized controlled trials, to evaluate the efficacy of different campaign messages on voting outcomes. (As with all things statistics, there are tradeoffs and confounding variables with such approaches.)
  • Pollsters would love to get access to all of your Facebook data.

Sampling Statistics and "Big Data"

Today, businesses and other organizations are tracking people at unprecedented levels. One reason rationale for big data being a "revolution" is that for the first time organizations have access to the full population of interest. For example, Amazon can track the purchasing history of 100% of its customers.

I would challenge the above argument, but won't outright disagree with it. Your current customer base may or may not be your full population of interest. You may, for example, be interested in people who don't purchase your product. You may wish to analyze a sample of your market, to figure out how who isn't purchasing from you and why. You may have access to some data on the whole population, but you may not have all the variables you want.

More importantly, sampling statistics has tools that may allow organizations to design tracking schemes to gather the most relevant data to their questions of interest. To quote R.A. Fisher "To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: He may be able to say what the experiment died of." The world (especially the social-science world) is not static; priorities and people's behavior are sure to change.

Data fusion, the process of pulling together data from heterogeneous sources into one analysis, is not a survey. But these sources may represent observations and variables in proportions or frequencies differing from the target population. Combining data from these sources with a simple merge may result in biased analyses. Sampling statistics has methods of using sample weights to combine strata of a stratified sample where some strata may be over or under sampled (and there are reasons to do this intentionally).

I am not proposing that sampling statistics will become the new hottest thing. But I would not be surprised if sampling courses move from the esoteric fringes, to being a core course in many or most statistics graduate programs in the coming decades. (And we know it may take over a hundred years for something to become the new hotness anyway.)

The professor that taught the sampling statistics course that I took a few years ago is the chief of the Statistical Research Division at the U.S. Census Bureau. When I last saw him at an alumni/prospective student mixer for Georgetown's math/stat program in 2013, he was wearing a button that said "ask me about big data." In a time when some think that statistics is the old school discipline only relevant for small data, seeing this button on a man whose field even within statistics is considered so "old school" that even most statisticians have moved on  made me chuckle. But it also made me think; things may be coming full circle for sample statistics.

Links for further reading

A statistician's role in big data (my source for the R.A. Fisher quote, above)

Wednesday, March 5, 2014

The new hotness is an 18th century theorem

From Robot Economics:
"The ‘system’ behind the Google robotic cars that have driven themselves for hundreds of thousands of miles on the streets of several US states without being involved in an accident, or violating any traffic law, whilst analyzing enormous quantities of data fed to a central onboard computer from radar sensors, cameras and laser-range finders and taking the most optimal, efficient and cost effective route, is built upon the 18th-century math theorem known as Bayes’ rule."
Take that, machine learning! Statistics! But it would never have been possible without the computer scientists.
" Before the advent of increased computer power Bayes Theorem was overlooked by most statisticians, scientists and in most industries. Today, thanks to Professor Pearl, Bayes Theorem is used in robotics, artificial intelligence, machine learning, reinforcement learning and big data mining.  IBM’s Watson, perhaps the most well known AI system, in all its intricacies, ultimately relies on the deceivingly simple concept of Bayes’ Rule in negotiating the semantic complexities of natural language."
The article is a quick and interesting read for we math/stat/data science geeks. Check it out.