Friday, February 28, 2014

Friday Links - Feb 27, 2014

Math is still hard - Warning: a curse word is used, if that kind of thing bothers you.

Model a continuous transformation of a discrete variable and Andrew Gelman's "favorite blog that nobody reads."

The ethics of data science and why some of us still prefer the term statistician. (Though I don't think there's a one-to-one mapping between the labels as previously discussed.)

Six types of Twitter conversations

Seven types of big data

"Statistics are the poetry of science." - An award for statistics

Strata 2014: focused on hardware at the expense of the analytics?

** Edit 3/3/2014 **
It was brought to my attention that Feb. 27, as this post is dated, is actually a Thursday. I did something similar last week too. Whoops. Well, if Max Tegmark is right, time is really just a variable in the gigantic mathematical object that is existence and our perception that time flows is an illusion. So, I guess this post always was and always will be... (Or maybe I'm reading too much pop-physics)

Wednesday, February 26, 2014

Machine-human interaction as a model for the future

I had some great conversations last night after Data Science DC's latest meet up. I admit that going to the bar after the talk is my favorite part because you meet so many smart people doing so many different things with data.

(Also, it feels weird to call an auditorium at GWU filled with 200 + data scientists listening to one of the leading statistical consultants in 3 of the last 4 presidential campaigns a "meet up." Also also, a longer post about last night's presentation is forthcoming. Also also also, three sentences in parenthesis as an aside to this whole post; Strunk and White would be so disappointed.)

We got to talking about economic forecasting. I recall reading an IMF paper (that I cannot locate) that said something to the effect of "models don't make forecasts, economists do." That does not mean that economists should eschew models in favor of their gut. The broader point of the paper was about making a quantitative model for internally-consistent forecasts, in fact.

The "economists make forecasts" message is actually an operationalization of a phenomena that Tyler Cowen writes about in his book Average is Over. Cowen cites the interaction between humans and various predictive algorithms in playing freestyle chess as an allegory for how the most productive sector of the labor market will be operating 50 years hence. He claims (and I choose not to verify) that the best chess algorithms routinely trounce the best chess players, but the best human-computer teams trounce everyone. And interestingly, the best humans for these teams are often mediocre chess players in their own right, but they understand the advantages and limitations of both the algorithms and themselves.

I'm inclined to agree with Dr. Cowen; for the foreseeable future we (in data science and beyond) are best served by arming humans with our best computer models, but leaving the humans empowered to make the final judgement.

human + computer > computer > human

For now, at least, the above generally holds. But who knows what will ultimately happen. After all, if economic forecasting has taught me anything, predicting the future is hard (like math).

Monday, February 24, 2014

Tell a story you believe and can justify.

Though this post is a bit of a follow up from a previous post, I was guilted motivated to write it after a tweet from Roger Peng of JHU and Simply Statistics.

 For context, he is referring to something specific that has absolutely nothing to do with me or this blog. Nevertheless, after blogging a few rants about not being dogmatic (what not to do), it's time to be more constructive.

Tell a story you believe and can justify. I am no expert, but I believe that we humans are programmed to learn and digest information best when it's delivered as a story. In my opinion, this makes sense; like quantitative models, stories can condense a lot of information (data) into a few salient points (a model). Stories may also ascribe causality, giving the audience an ability to understand and perhaps shape similar events (as with a causal model). And if events are similar, perhaps the audience will know what's coming (as with a predictive model).

However stories, like models, can have their pitfalls and can be misleading either through mistakes or malice. Tyler Cowen argues (rightly, IMHO) that we should be cautions when faced with stories. (Full video here.) Stories may be biased, based off of partial information, give the illusion of certainty where there is little to none, conflate correlated events with causal ones, etc. These issues should be very familiar to we students of statistics and data science.

But how do you tell a story if you are suspicious of stories? This is where justifiability comes in to play. If we, as professionals, are aware of the above issues and are ethical (i.e. we don't have an agenda beyond trying to be as objective as a human can), then we must constrain our story so that a reasonable person who understands the issue won't immediately identify our story as problematic. Easy, right?

Also, isn't this blog about statistics and such? Yes, I was just getting to that.

Most of my applied statistical background is in the social sciences and public policy. As such, we are looking for the story, not a p-value or credible interval. The story may be a chart, a table, or literally a story (with all the fun stuff statistics, in an appendix). In fact, too much information can muddy the waters leaving the audience confused. But as I am doing my research, I try to keep one thing in mind: what if someone challenges our story on technical grounds? Can I justify what we've done?* Do I believe the story we're telling?

In practice, then, it's safe to put the chart with the ski slope drop in the employment ratio and truncated axes up front as your bottom line, so long as you can produce other evidence that the drop you're showing is a historically big one. It's safe to put your parsimonious five-variable regression out there as "the" model, when you've got a dozen other "reasonable" models in an appendix backing up your choice (preferably performed on sub samples of the data, or recursively if it's time series).

But if a longer time series or transformation makes that drop relatively minor, or if every model you've tried tells a different story, or if the Bayesian approach smashes your frequentest approach (or the other way around), then your story isn't justifiable yet. Rather than get disheartened, ask yourself, "why are theses things different when I'd expect them to be the same?" Because then you might really be on to something interesting.

*Note that "justify" doesn't necessarily mean "win" in response to a challenge. I mean it as not doing something that obviously should have been done or doing something that obviously should not have been done.

Friday, February 21, 2014

I can haz buzzwords?

(Update 11/20/2014 - Sentences were updated to be more technically accurate with respect to bias and consistency.)

Catty title aside, this post takes a good swing at defining terms we hear thrown around about data these days and they mostly do a good job.

I particularly like the definitions for big data (it's mostly about your tools) and data science (the whole enchilada).

But I vehemently disagree with it's characterization of statistics as the discipline for small data. Ha.
One powerful method in statistics involves taking things to infinity and seeing what happens.  (Yes, I'm talking about theoretical statistics now.) But these asymptotic properties and distributions are how statistics can quantify uncertainty around a small dataset. We know what it should look like; we can compare it to what we see. Because such properties were derived from infinity, they should only match your large dataset better, so long as it's representative of the population of interest.*

But that last bit is key. When dealing with big data, experimental design really matters. Get enough biased data and you'll be uber confident in a wrong result. So if you collected data poorly (given your question), you'll be extremely confident in a bad result. (Bias doesn't go away by adding more observations. A biased sample doesn't become unbiased just because its bigger.)

And if you're using an estimator on "big" data, I hope you know its asymptotic properties, because you're operating in the asymptote. An inconsistent estimator is really bad if you have millions of observations.**
The author of the post has statistics nailed for a culture of conservatism, however. I often say that my job is to "be the wet blanket in the room." Statisticians will keep you honest, if frustrated. But if you really need to pull the trigger on something fast, grab a computer scientist. They get the job done and I tip my hat to them for it.

However, one of the keys to success, I think, is knowing when you're better served by the fast and workable solution or the slow but confident solution. That depends on context and the wrong choice can be embarrassing at best. C'est la vie.

* Caveats go with this sentence.

** One way to look at inconsistency is asymptotic bias. If your estimator is inconsistent, then your estimates converge to the wrong answer as your sample gets larger.

*** Update for clarity: Win-victor is a blog I like very much, lest anyone think otherwise. Many of us in this thing we call data science come from different disciplines with different perspectives and limited understanding of other disciplines. (I knowingly open myself to criticism and admit my own fallibility in saying this.) As data science evolves (perhaps into its own independent discipline) we'll all benefit from cross-disciplinary exchanges.

Thursday, February 20, 2014

Friday Links

I'm hoping planning to make this a regular occurrence. Maybe every Friday?

Below is a list of links that I found interesting or helpful this week and haven't found their way into a post.

The man who invented modern probability (Andrei Kolmogorov)

Whole bunch of math primers - Because who doesn't want to know more about Kolmogorov complexity after that last link?

The emergence of the chief data officer

Word Clouds in R - I gather word clouds can be controversial. But I'm starting to think they're a good way to represent "topics" in topic modeling as they convey the order of the top "n" words (the current standard) and the relative weights of the words.

Lots of Unix Colors - Takes me back to my days making charts at the Fed

Color Brewer - I wish this had been around when I was making charts at the Fed

Tuesday, February 18, 2014

Did I miss something?

Jeff Leeks at Simply Statistics has a post, "On the scalability of statistical procedures: why the p-value bashers just don't get it.", following last week's Nature op-ed by Regina Nuzzo.

I find the initial tone (especially the title) of Jeff Leeks's post to be confusing. My reading of Regina Nuzzo's article wasn't that p-values were "bad", simply misused. And, in fact, this seems to be the case that Jeff makes as his post continues. In fact, he has a great section of the pros and cons of several alternatives to the p-value.

I have to admit that I'm new to this whole debate. But it seems to me that Regina and Jeff differ only in that they affiliate with different sides. Both argue that misapplied p-values can be deceiving. Both argue that there is no one-size-fits-all solution.

I am confused.

That said, I'll re-quote Steven Goodman. (The title of my first post on this subject quoted Goodman's quote in Regina Nuzzo's article, so meta.)

"The numbers are where the scientific discussion should start, not end."

It seems banal to say this (again and again), but... Don't be dogmatic; there is no final answer and there never will be. You have to think critically about your analysis and interpret within a wider context.

And while Regina Nuzzo and Jeff Leeks approached the topic from different angles and raised different points, I got the same message from both: Don't be dogmatic.

** Edit 2/19/2014 **
After a brief Twitter conversation, Jeff Leeks and Regina Nuzzo have confirmed that they do agree about a lot. This is a relief (for me and probably for all of us) because both pieces raise many good points about how determining statistical "significance" is not straightforward.

Monday, February 17, 2014

Good data analysis is being good at math. Great data analysis is being creative.

(But you still have to be good at math.)

I was clicking around Simply Statistics today when I came across a quote from a post by Jeff Leeks that is a great analogy for how I think about statistical analysis. (And my apologies to Jeff Leeks, because I'm probably taking my interpretation further than he intended.)

"But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar)."

The emphasis is mine. I think that most would agree that knowing lots about grammar, conjugation, and how to string words together into sentences are necessary but insufficient to being a good writer. And while there are specific skills to make something you write more readable, great writing involves creativity, thinking outside the box, or at least thinking like someone sitting in a different box.

I think statistics can be similar, at least in the social sciences (where most of my experience lies). Time and again, it's a little creative stroke that moves the analysis from "technically executed" to actually speaking towards the phenomena I'm studying. Often times, that creative stroke comes from having seen a problem or an approach from another discipline and/or context.

One may argue that, "we can teach you how to write, but we can't teach you to be a great writer." And so it goes with statistics. But I believe the key is breadth as well as depth. So, throw some history, political science, sociology, economics, etc. alongside your pile of math and comp. sci. books. At the least, the writing's probably better.

Wednesday, February 12, 2014

“The numbers are where the scientific discussion should start, not end.”

An editorial posted today in Nature, Scientific method: Statistical errors, summarizes a growing chorus of arguments against an over reliance on p-values in scientific research. Were the article only speaking to statisticians, I'd say it was preaching to the choir. However, most people who use statistics are not statisticians.

Also, over the course of my relatively short career, I've seen tiny p-values affixed to results I simply don't believe are real or robust and larger p-values on results that I believe are both real and robust. I once received a comment, "this p-value is only 0.07. Why are we even talking about this?" The reason, of course, is that the p-value is only one metric of validity of your results. At the end of the day, we have to think critically about all the evidence we have in front of us, which can include non-statistical measures like past experiences, intuition, or anecdotes.

To be clear, I am not saying that we can safely disregard statistical evidence in favor of gut feeling or whatever bias you may bring with you. Rather, critical thinking is required across the board. If we skewer the p-value without learning the real lesson, we'll find some other "gold standard" dogma of statistical significance later down the road and be back where we started.

Some of the quotes I enjoyed (in addition to the title of this post):
The irony is that when UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look.
Others argue for a more ecumenical approach, encouraging researchers to try multiple methods on the same data set. [...] If the various methods come up with different answers, he says, 'that's a suggestion to be more creative and try to find out why', which should lead to a better understanding of the underlying reality.
But significance is no indicator of practical relevance, he says: We should be asking, 'How much of an effect is there?', not 'Is there an effect?'” 
And data miners/data scientists take note of the second quote. Approaching the same problem from different directions and with different methods should yield similar or mutually-supporting results. If not, why not? That may be an even more interesting question than the one you set out to answer.

Grad school

This is how I felt every day in grad school for a masters degree. I am sure that a PhD will only lead to more of the same.

Also, I realize that Chegg has succeeded in getting me to share their advertisement. Good job, guys. Also, math is hard.

Saturday, February 8, 2014

More on context and data visualization

Shortly after my earlier rant about not having some arbitrary requirement for axes ranges in data plots, I was dismayed to come across this Wikipedia entry: Don't draw misleading graphs* with the following:
"The most commonly seen "sensationalization" of graphs in the popular media is probably when the graph is drawn with the vertical axis starting not at 0, but somewhere just below the low point in the data being graphed."
After I got over my initial reaction, I sat down to write another rant blog post (this one!) on the subject. However, an interesting link serendipitously crossed my twitter feed. As often happens, someone beat me to the punch and said it better than I would have.

Mushon Zer-Aviv has a post: Disinformation Visualization: How to lie with datavis. He uses the attention-getting (i.e. controversial) topic of public opinion on abortion in the United States to illustrate his point.

I won't take away his thunder by summarizing the post here. (Click through yourself; it's a good read.) But I will point out that he covers distortions of both the x-axis and y-axis but does not feel compelled to include zero. Why?

It's the economy context, stupid

It is true that one can abuse the axes of their charts to tell a story that is very misleading. Professional ethics dictate that a data expert should be clear and honest, rather than clear or honest. But the fact remains, context gives the best guidance on how to represent your data. If your data have no reasonable expectation of obtaining a value, why include it in your chart? There is no escaping the fact that you have to think critically; a one-size-fits-all solution does not exist.

On that note, I highly recommend this (very short) book.

* To be fair, this is actually a Wikipedia essay. And Wikipedia does give the disclaimer, "essays may represent widespread norms or minority viewpoints. Consider these views with discretion."

Tuesday, February 4, 2014

Data Science Central has 9 categories of data scientists

(The article is somewhat puzzlingly called "Six categories of Data Scientist.")

Those data scientists strong in statistics are likely to
"develop new statistical theories for big data, that even traditional statisticians are not aware of. They are expert in statistical modeling, experimental design, sampling, clustering, data reduction, confidence intervals, testing, modeling, predictive modeling and other related techniques."
 A quote that particularly jumped out at me:
"Just like there are a few categories of statisticians (biostatisticians, statisticians, econometricians, operations research specialists, actuaries) or business analysts (marketing-oriented, product-oriented, finance-oriented, etc.) we have different categories of data scientists."
I was once ranting to a CS a colleague of mine, "data science is a thing, but 'data scientist' is not." The justification being that we are a collection of (in no particular order) software engineers, statisticians, mathematicians, economists, etc.. He quickly pointed out to me that those who may have referred to themselves as computer scientists around the middle of the 20th century were likely scoffed at as they were really collections of logicians, mathematicians, electrical engineers, and so on.

Another reminder to never assume things will stay the way they are.