Friday, May 30, 2014

Friday links: May 30, 2014

Image via Simply Statistics

Explanation vs prediction as the goal of statistical models - H/T Majid alDosari@msdtechcode

Talking about uncertainty in science to lay audiences (possible repost)

Whole lotta slides about R and finance - H/T Revolutions

Big data is a social construct I

Big data is a social construct II - source of the image above. Note that (even using real, non whiteboard, data) we are on trend in terms of data size. We're above trend for data utilization.

What statistics teaches us about big data

Tuesday, May 27, 2014

Analogies from the past

BigData-Startups ponders on the rise of (or heretofore lack thereof) the Chief Data Officer / Chief Analytics Officer

A data and analytics aware culture is in most businesses not present. Why not? Because revenue and profits are still flowing in at the end of the month and there is no sense of urgency or importance. There is basically no fire to do things differently.

 They go on to quote Jeff Jonas of the Wall St. Journal:

The biggest obstacle preventing companies from taking full advantage of their data is likely outdated information-sharing policies.

What's happening? Why would organizations fail to adopt successful strategies  that are--to we data geeks, at least--self-evidently the right thing to do to achieve the organization's objectives?

Erik Brynjolfsson and Andrew McAfee offer a compelling analogy from history. When factories were powered by steam, a single power plant was placed in the center of the factory. That power plant turned an axle. Factories were built up so that as many machines could be as close to that central axle as possible. Machines were placed based on their power needs, irrelevant to the factory's workflow.

When electric power began to replace steam power, factory owners simply replaced the central steam power plant with a central electric power plant. They saved a few pennies on energy costs and then went about their business.

It took years, a new generation of managers and factory owners, and budgets to build new factories to truly reap the rewards of electric power. New factories were built flat, every machine had its own small electric power plant. Machines were laid out according to the workflow of manufacturing a product. Huge efficiencies were gained in manufacturing. Profits went up while the price of manufactured goods went down. Society wins.

How does this apply today? We're still building our factories vertically. Those organizations and managers who have embraced a data-centric culture, it's not always obvious what the optimal data approach is. These things take time to work out, but the change is coming.

Friday, May 23, 2014

Friday links - May 23, 2014

An excellent overview of machine learning algorithms and techniques

More thoughts on statistics and data science (Personally, I think slides 35 on hit the nail on the head.) H/T Data Science Weekly

Source of the above: "The term big data is going to disappear in the next 2 years. Statistics will be what remains." I've pondered this myself, though I am not sure I agree. (I am not sure I disagree either.)

Despite the title, another case of human + computer > computer

Ensemble methods in R: part 1, part 2, part 3

Repeated for the end of the week: "5 Monday reads in Robotics, Artificial Intelligence, and Economics"

Thursday, May 22, 2014

Late Career Moves to Analytics

I recently read this article: Planning a late career shift to Analytics / Big data? Better be prepared!

Kunal Jain gives a sober perspective on the issue that hit close to home for me:

Non technical experience will not count in your analytics jobs – the only benefit you might get is that the interviewer can expect you to be more mature with your thought process / decision.

I came to this painful realization several years back. As a former Marine and non-commissioned officer, I had 3 "direct reports" before turning 21 and had significant formal leadership training and experience. I knew that much of that wouldn't count getting my first job out of college at 27, but that I could use those experiences to quickly rise in my career. Indeed, this is what several of my former military friends did. (One became a project manager 4 months after graduation, a feat that would normally have taken several years at best.)

This was my plan until I discovered economics--beautiful beautiful economics--at the end of my sophomore year. Suddenly, that past leadership experience counted for very little. Stata, SAS, and SQL did not care that I could turn objectives into plans and delegate. They cared only that my syntax was correct. My professors, and later employers, needed me to have solid foundations in calculus and probability. My public speaking skills would not matter if I couldn't produce analyses worth talking about. The learning curve was steep, especially since I failed algebra the first time, eked by with a D in geometry, and stopped taking high school math as soon as I could. Math is hard.

It took years of formal and self-guided education, coding, and real-world projects before my analytic capabilities were at a level where I could be trusted to design and lead analyses in the real world. It wasn't just a full-time job, it became a complete lifestyle. It has, however, paid off. Years of frustration and feeling as though I'd  set myself back have given way to some of the most intellectually fulfilling work I've done. And at 32, I no longer feel as though becoming a Marine had "set back" my career. (Even if it had, I wouldn't have done anything differently.)

The learning curve can be steep. Those that would move into an analytics later in there career should consider their motivation for doing so.

As Kunal points out:

Take this up only if you tick all the boxes below:
  • You are absolutely crazy about this industry. You can’t help but analyze any numbers you come across – I play with numbers on the number plate of any vehicle which passes me.
  • You have undergone a few courses on Coursera / eDX and have excelled at them. You have submitted all the assignments and have scored extremely well.
  • You have the perseverance and motivation to undergo 2 – 3 years of arduous work learning about a new knowledge intensive domain.
  • You are willing to spend a lot of time as Individual Contributor

Tuesday, May 20, 2014

More on deep learning

John Kaufhold posted to DC2's blog yesterday following up on his spectacular lecture on deep learning. No need to rehash others' material.

His talk can be viewed here. His post on DC2's blog is here.

Friday, May 16, 2014

Friday Links: May 16, 2014

Roboteconomics on Summers on Piketty

Another statistician's view on data science et al.

Data Visualization or Data Interaction?

Type I and Type II errors

So, when you see me with my notebook, know that I'm not a total Luddite. (But seriously, pen + paper > computer.)

Summers vs Taleb

Thursday, May 15, 2014

Follow up: Make the R go!

I posted this link to Hadley Wickham's e-book, Advanced R Programming, earlier.

Turns out the Data Community DC is hosting a workshop on that very topic.

For context: I have learned theses things in the last year or so:
  1. Many operations are embarrassingly parallel.
  2. Some operations have to be performed in sequence; R does not do so well here. Proceed with your "for loop" with caution. (Though sometimes it's just fine, situation dictates.)
  3. The best way to address (2) is with some good old fashioned C, C++, or Fortran. ("I was coding in Fortran before it was cool." *Throws down latte, puts on Ray Bans, takes Macbook out of cafe*)
How could one go about learning to handle such things in R? I guess you could click here... or here. I won't judge. (I will, however, be at this training session.)

Tuesday, May 13, 2014

Time series can be counter intuitive

Or maybe Nicholas Cage better stop while he's ahead. Click through for more.

Thursday, May 8, 2014

Jeff Leek asks questions near and dear to my heart

In the latest post on Simply Statistics, Jeff Leek asks some good questions:

  1. Given the importance of statistical thinking why aren't statisticians involved in these initiatives?
  2. When thinking about the big data era, what are some statistical ideas we've already figured out?
I'd say that (1) is changing, if slowly. But (2) is a good message for non-statistical folks in the data science community. Statistics is a field that is both wide and deep. There are many pressing data science problems that have been addressed in some fashion by someone in the statistics community. In many cases, we don't need to reinvent the wheel.

One area that I see as being quite underdeveloped in data science is how it approaches time series data. For that, we should look to the econometricians as much as statisticians. (What's  the difference between an econometrician and a statistician? About $15 K a year. Boom.) I am a fan of David Hendry's approach and I think the data science community would like it as well. He calls it "general to specific" modeling and I've seen a similar approach used to build machine-learning models.

Oh, but I'm off topic...

Anyway, the title of Leek's post "Why big data is in trouble: they forgot about applied statistics" is a bit melodramatic. Big data isn't in trouble because big data isn't going anywhere. (By big data, I mean the concept of a data-driven world.) As I said in an earlier post,
 It may be tempting to see [Google Flu] as justification that big data/data science is just media buzz. However, the technology that makes acquiring these data easy is here to stay. Reconciling statistical best-practices and big data is actively being discussed in the data science and big data communities. 

I look forward to the day that "data science" applications become mainstream in the statistics community. Then we'll really be cookin' with Crisco!

Tuesday, May 6, 2014

Friday Links (from last Friday): May 2, 2014

I was traveling last Friday and just realized I hadn't set this up to automatically post while I was away. Whoops.

Big Data... Big Deal?

Forbes has a history of data science. (This is fantastic, by the way.)

Deep dive: R vs SAS - I was a respondent to this survey and was quoted in the flash release. “R. But isn’t the debate more between R and Python?”

Write your own R package - I plan to do just that. Currently I "source()" a handful of files full of functions that I use daily.

While we're on the topic, read Advanced R Development