tag:blogger.com,1999:blog-40932290062544916642024-03-05T09:30:48.739-05:00Biased EstimatesMy name is Tommy; I am a statistician, mathematician, or data scientist—depending on the problem or the audience—in Washington DC.
I am a graduate of Georgetown's MS program for mathematics and statistics. I am a PhD student in George Mason's department of Computational and Data Sciences.
Opinions expressed are my own.
You can follow me on twitter @thos_jonesTommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.comBlogger78125tag:blogger.com,1999:blog-4093229006254491664.post-91265520337905058842018-11-16T08:54:00.000-05:002018-11-16T08:54:22.418-05:00textmineR v3.0 is here<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7cAc9xQ4KjL105irGri2vMdY2KzPN7ng3wSbRWQc4b_kRaEO3p2juJnnTOEIdbDyID0XFqSWmaH8ofihIm77t39VflDfTPqD6xoCtaGeXr2VR0gmX3EDjK26n0QVUOxQk1677H_BNjCQ/s1600/textmineR6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="796" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7cAc9xQ4KjL105irGri2vMdY2KzPN7ng3wSbRWQc4b_kRaEO3p2juJnnTOEIdbDyID0XFqSWmaH8ofihIm77t39VflDfTPqD6xoCtaGeXr2VR0gmX3EDjK26n0QVUOxQk1677H_BNjCQ/s320/textmineR6.png" width="283" /></a></div>
<br />
<br />
textmineR version 3 (and up!) is here. This represents a major overhaul. The two most substantive changes are a native implementation of LDA and a more object-oriented take on topic models. The former allows for more flexibility in setting priors and a better Bayesian treatment of model fitting (e.g. averaging over the chain after a pre-determined burn in period). The latter enables a predict method for models, making textmineR's topic models have syntax similar to more traditional models in R. A longer list of changes is below.<br />
<br />
<ul style="box-sizing: border-box; color: #24292e; font-family: -apple-system, system-ui, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px; margin-bottom: 16px; margin-top: 0px; padding-left: 2em;">
<li style="box-sizing: border-box;"><div style="box-sizing: border-box; margin-bottom: 16px; margin-top: 16px;">
Several functions that were slated for deletion in version 2.1.3 are now gone.</div>
<ul style="box-sizing: border-box; margin-bottom: 0px; margin-top: 0px; padding-left: 2em;">
<li style="box-sizing: border-box;">RecursiveRbind</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">Vec2Dtm</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">JSD</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">HellDist</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">GetPhiPrime</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">FormatRawLdaOutput</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">Files2Vec</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">DepluralizeDtm</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">CorrectS</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">CalcPhiPrime</li>
</ul>
</li>
<li style="box-sizing: border-box; margin-top: 0.25em;"><div style="box-sizing: border-box; margin-bottom: 16px; margin-top: 16px;">
FitLdaModel has changed significantly.</div>
<ul style="box-sizing: border-box; margin-bottom: 0px; margin-top: 0px; padding-left: 2em;">
<li style="box-sizing: border-box;">Now only Gibbs sampling is a supported training method. The Gibbs sampler is no longer wrapping lda::lda_collapsed_gibbs_sampler. It is now native to textmineR. It's a little slower, but has additional features.</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">Asymmetric priors are supported for both alpha and beta.</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">There is an option, optimize_alpha, which updates alpha every 10 iterations based on the value of theta at the current iteration.</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">The log likelihood of the data given estimates of phi and theta is optionally calculated every 10 iterations.</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">Probabilistic coherence is optionally calculated at the time of model fit.</li>
<li style="box-sizing: border-box; margin-top: 0.25em;">R-squared is optionally calculated at the time of model fit.</li>
</ul>
</li>
<li style="box-sizing: border-box; margin-top: 0.25em;"><div style="box-sizing: border-box; margin-bottom: 16px; margin-top: 16px;">
Supported topic models (LDA, LSA, CTM) are now object-oriented, creating their own S3 classes. These classes have their own predict methods, meaning you do not have to do your own math to make predictions for new documents.</div>
</li>
<li style="box-sizing: border-box; margin-top: 0.25em;"><div style="box-sizing: border-box; margin-bottom: 16px; margin-top: 16px;">
A new function SummarizeTopics has been added.</div>
</li>
<li style="box-sizing: border-box; margin-top: 0.25em;"><div style="box-sizing: border-box; margin-bottom: 16px; margin-top: 16px;">
tm is no longer a dependency for stopwords. We now use the stopwords package. The extended result of this is that there is no longer <em style="box-sizing: border-box;">any</em> Java dependency.</div>
</li>
<li style="box-sizing: border-box; margin-top: 0.25em;"><div style="box-sizing: border-box; margin-bottom: 16px; margin-top: 16px;">
Several packages have been moved from "Imports" to "Suggests". The result is a faster install and lower likelihood of install failure based on packages with system dependencies. (Looking at you, topicmodels!)</div>
</li>
<li style="box-sizing: border-box; margin-top: 0.25em;"><div style="box-sizing: border-box; margin-bottom: 16px; margin-top: 16px;">
Finally, I have changed the textmineR license to the MIT license. Note, however, that some dependencies may have more restrictive licenses. So if you're looking to use textmineR in a commercial project, you may want to dig deeper into what is/isn't permissable.</div>
</li>
</ul>
Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-69992512146358965552018-02-23T10:15:00.002-05:002018-02-23T10:28:08.878-05:00textmineR 2.1.0 is upOver the weekend I released textmineR 2.1.0 to CRAN (<a href="https://cran.r-project.org/package=textmineR" target="_blank">current version here</a>). The current version contains a couple minor updates and 5 vignettes to get you up and running with text mining.<br />
<br />
The vignettes cover the philosophy of textmineR, basic corpus statistics, document clustering, topic modeling, text embeddings (which is basically topic modeling of a term co-occurrence matrix), and building a basic document summarizer. That last vignette uses text embeddings plus a variation of the <a href="https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approach:_TextRank" target="_blank">TextRank</a> algorithm.<br />
<br />
The other updates are relatively minor. <a href="https://github.com/manuelbickel" target="_blank">@manuelbickle</a> discovered that my implementation of <code>CalcProbCoherence</code> was scaled differently from what I'd intended. That's fixed, though it shouldn't affect the qualitative use of probabilistic coherence. Second, I realized that my documentation for <code>CreateTcm</code> was misleading. So, that's now fixed.Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-47201314259821007932017-12-14T09:39:00.000-05:002017-12-14T09:42:21.237-05:00A few things I'm working on...I've got a few things in the pipe over the next 6 months or so that I want to get out of my brain and on to paper. Some of them will even end up on this blog!<br />
<br />
<br />
<ol>
<li><b>A proper vignette for textmineR</b><br /><br />It turns out that "here you go, just read the documentation" isn't the best way to get people to use your package. I am going to write a proper vignette that explains what problem(s) textmineR is trying to solve, its framework, and lots of examples of text mining using textmineR.<br /><br /></li>
<li><b>A derivation of neural nets using matrix math, with example R code</b><br /><br />I took a machine learning class this past semester. One of our assignments was to code a neural network from (basically) scratch. Almost every example I found had the mathematical derivation written as if it was in the middle of a "for" loop. I think this makes the notation cumbersome and it doesn't let you leverage a vectorized programming language (like R). So, I did the derivations myself (though I'm sure they exist somewhere else on the internet).<br /><br /></li>
<li><b>Calculating marginal effects for arbitrary predictive models</b><br /><br />I've had this idea kicking around in my head for quite some time. It builds off of ICE curves and my friend <a href="https://biodatamining.biomedcentral.com/articles/10.1186/1756-0381-7-2" target="_blank">Abhijit's work</a>. Abhijit has volunteered to work on it with me to turn it into a proper paper and R package. For now, <a href="https://github.com/TommyJones/marginal" target="_blank">the code is here</a>.<br /><br /></li>
<li><b>Updating textmineR's code base</b><br /><br />I want to write my own implementations of some topic models in C++. I'm planning to do this over the summer. The main push is to write a parallel Gibbs sampler for LDA and to allow for asymmetric priors. I am (still) doing topic model research for my dissertation. Implementing some topic models from scratch will be good practice for me and (hopefully) useful to the community. I may also implement DTM and TCM calculation on my own too. If I do all of that, I may be able to change textmineR's license to the more permissive MIT license. I'd like to do that.<br /><br /></li>
<li><b>Using topic models and or word embeddings to track narrative arcs within (longer) documents</b><br /><br />So, I literally just thought about this last night when I was going to bed. The gist is, build a topic model (either traditionally or by using word embeddings) off of a corpus. Then predict topic distributions over a sliding window within a document. This should create several time series of topics. Then one can use regime detection and lagging to parameterize how the narrative changes and relates to itself throughout the document. I have no idea if this will work or if it's already been done.</li>
</ol>
<div>
<br /></div>
<div>
I'm hoping to get (1) and (2) out sometime between now and January 10 (basically before classes start again). I hope (3) will be done by JSM this August. (I guess that means I should submit an abstract?) And I hope (4) will be done by September (when Fall classes start, my last semester of classes). I have no idea if and when I'll tackle (5). </div>
Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-3829757733328706442017-10-16T09:01:00.000-04:002017-10-16T09:01:08.575-04:00textmineR has a logo<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHD4CaLKqshjEEcvbCtGERj31uNFCBpZH_t5SCZBfeMR7846A9EgAH5UZJ1ne3IACl07AtW2zepCcxzjCROD8kKFRi2-OpYS-KVTIeYKiIpesvBET3eYqNIMbc1gx-DEVyEy_pYuhV7v8/s1600/textmineR6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="796" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHD4CaLKqshjEEcvbCtGERj31uNFCBpZH_t5SCZBfeMR7846A9EgAH5UZJ1ne3IACl07AtW2zepCcxzjCROD8kKFRi2-OpYS-KVTIeYKiIpesvBET3eYqNIMbc1gx-DEVyEy_pYuhV7v8/s320/textmineR6.png" width="283" /></a></div>
<br />
<br />
I was at the EARL conference in San Francisco a couple months ago and got inspiration from AirBnb. AirBnb has its own R package it uses internally. To gin up interest and encourage employees to use it and contribute to it, they distributed swag.<br />
<br />
So, in that vain, I present the textmineR logo. I'll be getting stickers made and throwing them around wherever I am.<br />
<br />
I acknowledge this is the easy way out. I still need to write a vignette on textmineR. I'm probably being naive, but I expect to get several papers out when I'm done with classes and in dissertation land. So, maybe then?Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-13675914196065036052017-10-12T22:26:00.001-04:002017-10-12T22:26:30.581-04:00Anyone else think implementing back propagation from scratch is kind of fun?<blockquote class="twitter-tweet">
<p lang="en" dir="ltr" xml:lang="en">Anyone else think implementing back propagation from scratch is kind of fun?</p>
— Thomas Jones (@thos_jones) <a href="https://twitter.com/thos_jones/status/918651245170249728?ref_src=twsrc%5Etfw">October 13, 2017</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8" type="text/javascript">
</script><br />
from Twitter https://twitter.com/thos_jones
Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-22778225105301080712016-06-01T17:58:00.002-04:002016-06-01T17:58:48.561-04:00Weird Error: fatal error in wrapper codeI suppose I'm publishing this so that I can save the next programmer the effort of tracking down the source of a weird error thrown by <code>mclapply</code>.<br />
<blockquote class="tr_bq">
<code>fatal error in wrapper code</code></blockquote>
What? The cause, according to <a href="http://watson.nci.nih.gov/bioc_mirror/packages/release/bioc/vignettes/variancePartition/inst/doc/variancePartition.pdf" target="_blank">this</a>, is that (I think) <code>mclapply</code> is using too many threads or taking up too much memory. This is pretty consistent with the code I was running at the time.<br />
<br />
The solution, then, would be to use fewer threads. I'll re-run the code tonight and see if that fixes it. I'll update this post when I confirm or reject the solution.Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-28966879291574759762016-05-02T09:00:00.000-04:002016-05-02T09:00:09.953-04:00textmineR<blockquote class="twitter-tweet" data-lang="en">
<div dir="ltr" lang="en">
textmineR's number one concern is usability! Thanks for looking out for us <a href="https://twitter.com/thos_jones">@thos_jones</a> <a href="https://twitter.com/hashtag/datadc?src=hash">#datadc</a> <a href="https://twitter.com/hashtag/NLP?src=hash">#NLP</a> <a href="https://t.co/iIfFBNaALW">pic.twitter.com/iIfFBNaALW</a></div>
— Danielle Beaulieu (@andDunny) <a href="https://twitter.com/andDunny/status/725469252694634497">April 27, 2016</a></blockquote>
<br />
<br />
<script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script>
I (quietly) released an R package back in January, <a href="https://cran.r-project.org/web/packages/textmineR/index.html" target="_blank">textmineR</a>. It's a text mining tool designed for usability following 3 basic principles:<br />
<br />
<br />
<ol>
<li>Syntax that is idomatic to R (basically, useRs will find it intuitive)</li>
<li>Maximal integration with the wider R ecosystem (use base classes as much as possible, use widely-adopted classes in all other cases)</li>
<li>Scaleable because NLP data is pretty big</li>
</ol>
<div>
<br /></div>
<div>
I implemented this in textmineR by</div>
<div>
<ol>
<li>Making the document term matrix (DTM) and term-coocurrence matrix (TCM) the central mathematical objects in the workflow</li>
<li>Creating DTMs and TCMs in one step, with options given as arguments.</li>
<li>Documents are expected to be character vectors (nothing more).</li>
<li>You can store your corpus metadata however you want. I like data frames (which is probably what you created when you read the data into R anyway)</li>
<li>Making DTMs and TCMs of class dgCMatrix from the Matrix library. These objects...</li>
<ol>
<li>are widely adopted</li>
<li>have methods and functions making their syntax familiar to base R dense matrices</li>
</ol>
<li>Writing wrappers for a bunch of topic models, so they take DTMs/TCMs as input and return similarly-formatted outputs</li>
<li>Adding a bunch of topic model utility functions, some of which come from my own research (R-squared, anyone?)</li>
<li>Building textmineR on top of <a href="https://cran.r-project.org/web/packages/text2vec/index.html" target="_blank">text2vec</a>, which is really, really, really wicked fast. (<a href="http://dsnotes.com/articles/text2vec-0-3" target="_blank">really</a>)</li>
</ol>
<div>
I gave a<a href="http://tommyjones.github.io/StatProgDC_2014_04_27/assets/player/KeynoteDHTMLPlayer.html#0" target="_blank"> talk at statistical programming DC</a> on Wednesday night announcing. The turnout was great (terrifying) and I received a bunch of great feedback.</div>
</div>
<div>
<br /></div>
<div>
I walked into work Friday morning and <a href="https://github.com/TommyJones/textmineR/issues/21" target="_blank">found a bug that affects all Windows users</a>. :(</div>
<div>
<br /></div>
<div>
Fortunately, the <a href="https://github.com/TommyJones/textmineR" target="_blank">development version</a> is patched and ready to rock!</div>
<div>
<br /></div>
<div>
Unfortunately, I blew my one chance to get a rapid patch up on CRAN by <a href="https://twitter.com/thos_jones/status/724274317677219840" target="_blank">pushing another patch the previous weekend.</a> We'll have to wait until the end of May for me to push the latest patch. I'm afraid that all the good will that came from the talk will be squandered when half (or more) of the people that try it, end up getting errors.</div>
<div>
<br /></div>
<div>
Hopefully, that will give me time to finish <a href="https://github.com/TommyJones/textmineR_vignette" target="_blank">writing the vignette</a>.</div>
Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-8758606401176200662015-11-05T12:06:00.000-05:002015-11-05T12:06:21.778-05:00More on statisticians in data science<div>
<br /></div>
<div>
<br /></div>
The November issue of <a href="http://magazine.amstat.org/">AMSTAT News</a> has published an opinion piece by yours truly on the identity of statisticians in data science. My piece starts on<a href="http://magazine.amstat.org/wp-content/uploads/2015/10/AMSTATNEWSNOV15.pdf"> page 25 of the print version</a>. The <a href="http://magazine.amstat.org/blog/2015/11/01/statnews2015/">online version is here</a>. A quote:<div>
<br /></div>
<blockquote class="tr_bq">
I am not convinced that statistics is data science. But I am convinced that the fundamentals of probability and mathematical statistics taught today add tremendous value and cement our identity as statisticians in data science.</blockquote>
Please read the <a href="http://magazine.amstat.org/blog/2015/11/01/statnews2015/">whole thing</a>.<br />
<br />
<a href="http://magazine.amstat.org/wp-content/themes/arthemia/scripts/timthumb.php?src=/wp-content/uploads/2015/10/statview.png&w=100&h=65&zc=1&q=100" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://magazine.amstat.org/wp-content/themes/arthemia/scripts/timthumb.php?src=/wp-content/uploads/2015/10/statview.png&w=100&h=65&zc=1&q=100" /></a><a href="http://magazine.amstat.org/wp-content/themes/arthemia/scripts/timthumb.php?src=/wp-content/uploads/2015/10/statview.png&w=100&h=65&zc=1&q=100" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><br /></a>Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com2tag:blogger.com,1999:blog-4093229006254491664.post-57379169749685875322015-05-08T09:30:00.000-04:002015-05-08T09:30:02.054-04:00Oops.
I made an accident yesterday.<br />
<br />
<blockquote class="twitter-tweet" lang="en">
<div dir="ltr" lang="en">
Oops. <a href="http://t.co/IDYmbdDwvP">pic.twitter.com/IDYmbdDwvP</a></div>
— 3e Labs (@3eLabs) <a href="https://twitter.com/3eLabs/status/596364721558790144">May 7, 2015</a></blockquote>
<br />
What happened? When creating a matrix of zeros I accidentally typed <code>matrix()</code> instead of <code>Matrix()</code>.<br />
<br />
What's the difference? 4.8 terabytes versus less than one GB. I was creating a document-term-matrix of about 100,000 documents with a vocabulary of about 6,000,000 tokens. This is the thing with linguistic data: one little mistake is the difference between working on a Macbook Air with no fuss and something that would make a super computer choke. (Anyone want to get me a quote on hardware with 5 TB of RAM?)
<script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script>
Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com1tag:blogger.com,1999:blog-4093229006254491664.post-84042321484047276392015-01-14T09:30:00.000-05:002015-01-14T09:30:00.799-05:00Are microeconomists data scientists?From The Economist:<br />
<br />
<blockquote class="tr_bq">
Armed with vast data sets produced by tech firms, microeconomists can produce startlingly good forecasts of human behaviour. Silicon Valley firms have grown to love them: by bringing a cutting-edge economist in house, they are able to predict what customers or employees are likely to do next.</blockquote>
<br />
Sounds like data scientists to me. The article is <a href="http://www.economist.com/news/leaders/21638117-microeconomics-powered-data-shaping-tech-firms-trend-has-lessons-macroeconomics">here</a>. There's a related piece <a href="http://www.economist.com/news/finance-and-economics/21638142-consumers-reap-benefits-e-commerce-surprising-ways-hidden-long">here</a>.Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-23462753850074958312015-01-09T09:30:00.000-05:002015-01-09T09:30:00.578-05:00Introducing R-squared for Topic ModelsI have <a href="https://drive.google.com/file/d/0Bz2enPyUvnKIQmtDTEswbzdUMU0/view?usp=sharing">a new working paper</a> added to my <a href="http://www.biasedestimates.com/p/publications-and-working-papers.html">publications page</a>. The abstract reads<br />
<blockquote class="tr_bq">
This document proposes a new (old) metric for evaluating goodness of fit in topic models, the coefficient of determination, or R<sup>2</sup>. Within the context of topic modeling, R<sup>2</sup> has the same interpretation that it does when used in a broader class of statistical models. Reporting R<sup>2</sup> with topic models addresses two current problems in topic modeling: a lack of standard cross-contextual evaluation metrics for topic modeling and ease of communication with lay audiences. This paper proposes that R<sup>2</sup> should be reported as a standard metric when constructing topic models.</blockquote>
In researching this paper, I came across a potentially significant (in the colloquial sense) finding.<br />
<br />
<blockquote class="tr_bq">
These properties of R<sup>2</sup> compel a change in focus for the topic modeling research community. Document length is an important factor in model fit whereas the number of documents is not. When choosing between fitting a topic model on short abstracts, full-page documents, or multi-page papers, it appears that more text is better. Second, since model fit is invariant for corpora over 1,000 documents (our lower bound for simulation), sampling from large corpora should yield reasonable estimates of population parameters. The topic modeling community has heretofore focused on scalability on large corpora of hundreds-of-thousands to millions of documents. Little attention has been paid to document length, however. These results, if robust, indicate that focus should move away from larger corpora and towards lengthier documents. </blockquote>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRc41suJTpgV1oaMwlCl6-nG6lsL65VUGmac7FK3TZOCHOCzBJxGCwCQZ7_cyYcaTY79lfWGOjxNstURYNExgjfJDYhJ3nJciBuh6L5njMhDMk8DFm72gK_TXzTY3L34A9Ucz_tzh-Jqg/s1600/sim_d_range_theoretical.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRc41suJTpgV1oaMwlCl6-nG6lsL65VUGmac7FK3TZOCHOCzBJxGCwCQZ7_cyYcaTY79lfWGOjxNstURYNExgjfJDYhJ3nJciBuh6L5njMhDMk8DFm72gK_TXzTY3L34A9Ucz_tzh-Jqg/s1600/sim_d_range_theoretical.png" height="320" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9GIkPKE-KvZ0JiKqhCA99wkTdhIGP24QFsIITZYcTW3lGfknQUPTAUK6wKwSDYJe3QUG1Xg-gyiKkJOfpHjIbePDVcOsOL5nShZ83kQRfIsYKRDiMZb8pKO2_syIosS9LjBCXo8yzUR0/s1600/sim_lambda_range_theoretical.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9GIkPKE-KvZ0JiKqhCA99wkTdhIGP24QFsIITZYcTW3lGfknQUPTAUK6wKwSDYJe3QUG1Xg-gyiKkJOfpHjIbePDVcOsOL5nShZ83kQRfIsYKRDiMZb8pKO2_syIosS9LjBCXo8yzUR0/s1600/sim_lambda_range_theoretical.png" height="320" width="320" /></a></div>
<br />
<br />
Basically:<br />
<br />
<ol>
<li>Stop building topic models on tweets and stop using the 20 news groups data.</li>
<li>Your argument of needing gigantic corpora (and algorithms that scale to them) is invalid.</li>
<li>Citing points (1) and (2), I am prone to hyperbole.</li>
<li>Citing points (1) through (3), I like numbered lists.</li>
</ol>
<div>
<br /></div>
<div>
In all seriousness, if you take the time to read the paper and have any comments, please let me know. You can comment below, use the "contact me directly" tool on your right, or email me (if you know my email already). I'll be circulating this paper among friends and colleagues to get their opinions as well.</div>
<div>
</div>
Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-91490034922582664692014-12-31T09:30:00.000-05:002014-12-31T09:30:01.694-05:00P-valuesSource: <a href="http://www.smbc-comics.com/?id=3590#comic">Saturday Morning Breakfast Cereal</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.smbc-comics.com/comics/20141225after.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.smbc-comics.com/comics/20141225after.png" height="400" width="400" /></a></div>
<br />Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-90487767252729857102014-12-23T09:30:00.000-05:002014-12-23T09:30:01.797-05:00Labeling topics from topic modelsMy friend <a href="https://twitter.com/greenbacker">Charlie Greenbacker</a> showed me <a href="https://github.com/cpsievert/LDAvis/tree/new">LDAvis</a>, which creates interactive apps for visualizing topics from topic models. As often happens, this lead to a back and forth between Charlie and me covering a range of topic modeling issues. I've decided to share and expand on bits of the conversation over a few posts.<br />
<br />
<h4>
The old way</h4>
<div>
<br /></div>
<div>
The traditional method for reporting a topic (let's call it topic X) is to list the top 3 to 5 words in topic X. What do I mean by "top" words? In topic modeling a "topic" is a probability distribution over words. Specifically, "topic X" is really <i>P( words | topic X)</i>. </div>
<div>
<br /></div>
<div>
Here is an example from the empirical section of the R-squared paper I'm working on:</div>
<div>
<br /></div>
<blockquote class="tr_bq">
"mode" "model" "red" "predict" "tim" "data" </blockquote>
<br />
A few things pop out pretty quickly.<br />
<br />
<ol>
<li>These are all unigrams. The model includes bigrams, but they aren't at the top of the distribution. </li>
<li>A couple of the words seem to be truncated. Is "mode" supposed to be "model"? Is "tim" supposed to be "time"? It's really hard to tell without any context. (Even if these are truncated, it wouldn't greatly affect the fit of the model. It just makes it look ugly.)</li>
<li>From the information we have, a good guess is that this topic is about data modeling or prediction or something like that.</li>
</ol>
<div>
<br /></div>
<div>
The incoherence of these terms on their own requires topic modelers to spend massive amounts of time curating a dictionary for their final model. If you don't, you may end up with a topic that looks like topic 1 in this <a href="http://cpsievert.github.io/LDAvis/newsgroup/newsgroup.html">example</a>. Good luck interpreting that!</div>
<div>
<br /></div>
<div>
(As a complete aside, it looks like topic 1 is the result of using an asymmetric Dirichlet prior for topics over documents. Those that attended my <a href="http://www.biasedestimates.com/2014/11/lda-and-topic-models-reading-list.html">DC NLP talk</a> know that I have ambivalent feelings about this.)</div>
<div>
<br /></div>
<br />
<h4>
A new approach</h4>
<div>
<br /></div>
<div>
I'm going to get a little theoretical on you: <a href="http://en.wikipedia.org/wiki/Zipf%27s_law">Zipf's law</a> tells me that, in theory, the most probable terms in every topic should be stop words. Think about it. When I'm talking about cats, I still use words like "the", "this", "an", etc. waaaaay more than any cat-specific words. (That's why Zipf's law is, well...., a law.)</div>
<div>
<br /></div>
<div>
Even if we remove general stop words before modeling, I probably have a lot of corpus-specific stop words. Pulling those out, while trying to preserve the integrity of my data, is no easy task. (It's also a little like performing surgery with a chainsaw.) That's why so much time is spent on vocabulary curation.</div>
<div>
<br /></div>
<div>
My point is that I don't think <i>P(words | topic X)</i> is the right way to look at this. Zipf's law means that I <i>expect</i> the most probable words in that distribution to contain no contextual meaning. All that dictionary curation is isn't just time consuming, it's perverting our data.</div>
<div>
<br /></div>
<div>
But what happens if we throw a little <a href="http://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes' Theorem</a> at this problem? Instead of ordering words by <i>P( words | topic x)</i>, let's order them according to <i>P( topic x | words)</i>.</div>
<div>
<br /></div>
<blockquote class="tr_bq">
"prediction_model" "causal_inference" "force_field" "kinetic_model" "markov" "gaussian"</blockquote>
<br />
As I said to Charlie: "Big difference in interpretability, no?"<br />
<br />
<h4>
Full labels</h4>
<div>
<br /></div>
<div>
I think that all of our dictionary curation hurts us beyond being a time sink. I think it makes our models fit the data worse. This has two implications: we have less trust in the resulting analysis and our topics are actually more statistically muddled, not less. </div>
<div>
<br /></div>
<div>
We (as in I and some other folks who work with me) have come up with some automated ways to label topics. This method works by grouping documents together by topic and then extracting keywords from the documents. (The difference between my work and the other guys I work with is in the keyword extraction step.)</div>
<div>
<br /></div>
<div>
The method basically works like this:</div>
<div>
<ol>
<li>For each topic:</li>
<li>Grab a set of documents with high prevalence of that topic.</li>
<li>In a document term matrix of bigrams and trigrams, calculate <i>P( words | that set of documents) - P( words in the overall corpus )</i></li>
<li>Take the n-gram with the highest score as your label.</li>
<li>Next topic</li>
</ol>
<div>
My label for "topic X"?</div>
</div>
<div>
<br /></div>
<blockquote class="tr_bq">
"statistical_method"</blockquote>
<br />
It's not perfect. I have noticed that similar topics tend to get identical labels. The labeling isn't so good at picking up on subtle differences. Some topics are what I call "methods" rather than "subjects". (This is because most of my topic modeling is on scientific research papers.) The "methods" rarely have a high proportion in any document. The document simply isn't "about" its methods; it's about its subject. When this happens, sometimes I don't get any documents to go with a methods topic. The labeling algorithm just returns "NA". No bueno.<br />
<div>
<br /></div>
<br />
<h4>
One last benefit</h4>
<div>
<br /></div>
<div>
By not butchering the statistical signals in our documents by heavy-handed dictionary curation, we get some nice properties in the resulting model. One, for example, is that we can cluster topics together cleanly. So, I can create a nice hierarchical dendrogram of all my topics. (I can also use the labeling algorithm to label groups higher up on the tree if I want.)</div>
<div>
<br /></div>
<div>
You can check out one of the dendrograms I'm using for the R-squared paper by <a href="https://drive.google.com/file/d/0Bz2enPyUvnKIaVJpQUZqblhQdFE/view?usp=sharing">clicking here</a>. The boxes are clusters of topics based on linguistic similarity and document occurrence. (It's easier to see if you zoom in.) It's a model of 100 topics on 10,000 randomly-sampled NIH grant abstracts. (<a href="http://exporter.nih.gov/">You can get your own here.</a>)</div>
<br />
<br />
<br />
<br />
<br />Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-82690850635712762502014-12-17T09:30:00.000-05:002014-12-17T09:30:00.727-05:00Notes on the culture of economicsI'm finally getting around to reading Piketty's <i><a href="http://en.wikipedia.org/wiki/Capital_in_the_Twenty-First_Century">Capital in the 21st Century</a>. </i>That and a project at work has put economics back to the front of my brain. I found the below posts interesting.<br />
<br />
Paul Krugman says in "<a href="http://krugman.blogs.nytimes.com/2014/11/30/notes-on-the-floating-crap-game-economics-inside-baseball/">Notes on the Floating Crap Game (Economics Inside Baseball)</a>"<br />
<br />
<blockquote class="tr_bq">
So, academic economics is indeed very hierarchical; but I think it’s important to understand that it’s not a bureaucratic hierarchy, nor can status be conferred by crude patronage. The profession runs on reputation — basically the shared perception that you’re a smart guy. But how do you get reputation? [...] [R]eputation comes out of clever papers and snappy seminar presentations. </blockquote>
<blockquote class="tr_bq">
[...] Because everything runs on reputation, a lot of what you might imagine academic politics is like — what it may be like in other fields — doesn’t happen in econ. When young I would have relatives asking whether I was “in” with the department head or the senior faculty in my department, whether I was cultivating relationships, whatever; I thought it was funny, because all that mattered was your reputation, which was national if not global.</blockquote>
<br />
Not all Krugman says is rosy for economists. Nevertheless, this is consistent with my experience when I was in economics. Econ has a hierarchical structure, but it's not based on patronage or solely "length of service." For example, when I was at the Fed, the internal structure was quite hierarchical in terms of both titles and managerial responsibility. (It kind of reminded me of the military.) However, it also had a paradoxically "flat" culture. Ideas were swapped and debated constantly. Though I was a lowly research assistant, my forecasts were respected and my input listened to. I was no exception; this was just how we operated.<br />
<br />
<a href="http://simplystatistics.org/2014/12/14/sunday-datastatistics-link-roundup-121414/">Simply statistics</a> brought another post to my attention. From Kevin Drum at Mother Jones: <i><a href="http://www.motherjones.com/kevin-drum/2014/12/economists-are-almost-inhumanly-impartial">Economists are Almost Inhumanly Impartial</a></i>.<br />
<br />
<blockquote class="tr_bq">
Over at 538, a team of researchers takes on the question of whether economists are biased. Given that economists are human beings, it would be pretty shocking if the answer turned out to be no, and sure enough, it's not. In fact, say the researchers, liberal economists tend to produce liberal results and conservative economists tend to produce conservative results. This is unsurprising, but oddly enough, I'm also not sure it's the real takeaway here. [...]</blockquote>
<blockquote class="tr_bq">
What I see is a nearly flat regression line with a ton of variance. [...] If these results are actually true, then congratulations economists! You guys are pretty damn evenhanded. The most committed Austrians and the most extreme socialists are apparently producing numerical results that are only slightly different. If there's another field this side of nuclear physics that does better, I'd be surprised.</blockquote>
<br />
(I'll leave it to you to check out the regression line in question.)<br />
<br />
Simply statistics's Jeff Leek has a <a href="http://simplystatistics.org/2014/12/14/sunday-datastatistics-link-roundup-121414/">different take</a>.<br />
<blockquote class="tr_bq">
I'm not sure the regression line says what they think it does, particularly if you pay attention to the variance around the line.</blockquote>
<br />
I don't know what Leek is getting at exactly; maybe we agree. What I see is a nearly flat line through a cloud of points. My take isn't that economists are unbiased. Rather, their bias is generally uncorrelated with their ideology. That's still a good thing, right? (Either way, I am not one for the philosophy of p < 0.05 means it's true and p > 0.05 means it's false.)<br />
<br />
Here's what I've told other people: microeconomics is about as close to a science as you're going to get. It's a lot like studying predator prey systems in the wild. There's definitely stochastic variation, but the trends are pretty clear; not much to argue about. Macroeconomics, on the other hand is a lot trickier. It's not that macroeconomists are any less objective than microeconomists. Rather, measurement and causality are much trickier. In the resulting vacuum, there's room for different assumptions and philosophies. This is what macroeconomists debate about.<br />
<br />
Nevertheless, my experience backs up a comment to Drum's article:<br />
<blockquote class="tr_bq">
Economists generally avoid and form consensus in regard to fringe theories. </blockquote>
<br />
Translation: the differences in philosophies between macroeconomists isn't as big as you'd think. And they're tiny compared to our political differences.Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-65802904621148824422014-12-16T09:30:00.000-05:002014-12-29T14:02:45.898-05:00Empowering People with Machine LearningFrom an <a href="http://www.wsj.com/articles/automation-makes-us-dumb-1416589342">article in the Wall Street Journal</a>:<br />
<br />
<blockquote class="tr_bq">
<blockquote class="tr_bq">
When system designers begin a project, they first consider the capabilities of computers, with an eye toward delegating as much of the work as possible to the software. The human operator is assigned whatever is left over, which usually consists of relatively passive chores such as entering data, following templates and monitoring displays.</blockquote>
<blockquote class="tr_bq">
This philosophy traps people in a vicious cycle of de-skilling. By isolating them from hard work, it dulls their skills and increases the odds that they will make mistakes. When those mistakes happen, designers respond by seeking to further restrict people’s responsibilities—spurring a new round of de-skilling.</blockquote>
<blockquote class="tr_bq">
Because the prevailing technique “emphasizes the needs of technology over those of humans,” it forces people “into a supporting role, one for which we are most unsuited,” writes the cognitive scientist and design researcher Donald Norman of the University of California, San Diego.</blockquote>
<blockquote class="tr_bq">
There is an alternative.</blockquote>
<blockquote class="tr_bq">
In “human-centered automation,” the talents of people take precedence. Systems are designed to keep the human operator in what engineers call “the decision loop”—the continuing process of action, feedback and judgment-making. That keeps workers attentive and engaged and promotes the kind of challenging practice that strengthens skills.</blockquote>
<blockquote class="tr_bq">
In this model, software plays an essential but secondary role. It takes over routine functions that a human operator has already mastered, issues alerts when unexpected situations arise, provides fresh information that expands the operator’s perspective and counters the biases that often distort human thinking. The technology becomes the expert’s partner, not the expert’s replacement.</blockquote>
<blockquote class="tr_bq">
Pushing automation in a more humane direction doesn't require any technical breakthroughs. It requires a shift in priorities and a renewed focus on human strengths and weaknesses</blockquote>
<div>
<br /></div>
</blockquote>
Two thoughts come to mind:<br />
<br />
First, there's Tyler Cowen's analogy of freestyle chess. He uses this analogy liberally in Average is Over. And the division of labor between human and computer in freestyle chess mirrors the above quote.<br />
<br />
Second, I was taught the dichotomy of these two philosophies in the Marine Corps. I enlisted just before 9/11; ten + years of war may have changed the budgetary environment. But at the time, Marine infantry units did not have much of a budget. As a result, we trained ourselves (and our minds) first, and supplemented with what technology we could afford. On occasional training with some other (unnamed) branches of the military, we observed that these other units were awash in technology, helpless without it, and not any better than us with it. (Think fancy GPS versus old GPS + map & compass.)<br />
<br />
I believe that the latter thought is an example of another quote from the WSJ article:<br />
<br />
<blockquote class="tr_bq">
If we let our own skills fade by relying too much on automation, we are going to render ourselves less capable, less resilient and more subservient to our machines. </blockquote>
<br />
Something to keep in mind as you're implementing your decision support systems.Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-89592010824017600982014-12-15T14:09:00.001-05:002014-12-15T14:31:44.154-05:00Ukrain?So, I've noticed a trend over the last few month's in the blog's traffic. The vast majority of hits seem to be coming from domains ending in ".ru".<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvsBTxrBx7kPMfUv3FlEdKfvdkIC8Jrysgo_foyyiQSNtovDxGt178afOGK2Hzj6bvU3UckusvYIIJZxtYiNcHFB25l9XdAxxbQv_kiFELpgOIyenwbS3arfGmLAjTQitYQ8xsSgnqh-U/s1600/traffic_sources.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvsBTxrBx7kPMfUv3FlEdKfvdkIC8Jrysgo_foyyiQSNtovDxGt178afOGK2Hzj6bvU3UckusvYIIJZxtYiNcHFB25l9XdAxxbQv_kiFELpgOIyenwbS3arfGmLAjTQitYQ8xsSgnqh-U/s1600/traffic_sources.png" height="561" width="640" /></a></div>
<br />
Of course, these are bots. (I am heartened to see that when you aggregate URLs to sites, twitter, meetup, and datasciencecentral are still close to the top.)<br />
<br />
When looking at the geography of the traffic sources, I'm seeing a whole lot of... Ukrain?<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjR0OwWUiW3XE8Z5K2pehaMqGABoVwmNDUTftnGf8ZvzHFjxi0hnpaq_w_sEHMIT9AMAyw1SyxwbDcVo8CsMOt8VKY8qMULlLh1842YQwQX9tk4BakF86tzajkeM731sAMGgMMfn7vkYbw/s1600/audience.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjR0OwWUiW3XE8Z5K2pehaMqGABoVwmNDUTftnGf8ZvzHFjxi0hnpaq_w_sEHMIT9AMAyw1SyxwbDcVo8CsMOt8VKY8qMULlLh1842YQwQX9tk4BakF86tzajkeM731sAMGgMMfn7vkYbw/s1600/audience.png" height="640" width="416" /></a></div>
<br />
Who knew stats were so popular in Ukrain? (Kidding.)<br />
<br />
But seriously, this only started a few months ago. I'm wondering if the conflict in Ukraine has anything to do with this. It's conceivable that computers and servers are getting hijacked over there as part of the war. Anyone have any thoughts?Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-37356598747754233052014-12-11T18:13:00.000-05:002014-12-11T18:13:34.786-05:00Saved by plagiarism!I am writing a paper on goodness-of-fit for topic models. (Specifically, I've derived an R-squared metric for use with topic models.) I came across this definition for goodness-of-fit in our friend, <a href="http://en.wikipedia.org/wiki/Goodness_of_fit">Wikipedia</a>.<br />
<br />
<blockquote class="tr_bq">
The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question.</blockquote>
<br />
I love it! It's concise and to the point. But do I really want to cite Wikipedia in an article for peer review?<br />
<br />
<a href="https://www.google.com/search?q=goodness+of+fit+of+a+statistical+model+describes+how+well+it+fits+a+set+of+observations.+Measures+of+goodness+of+fit+typically+summarize+the+discrepancy+between+observed+values+and+the+values+expected+under+the+model+in+question.&oq=goodness+of+fit+of+a+statistical+model+describes+how+well+it+fits+a+set+of+observations.+Measures+of+goodness+of+fit+typically+summarize+the+discrepancy+between+observed+values+and+the+values+expected+under+the+model+in+question.&aqs=chrome..69i57.767j0j7&sourceid=chrome&es_sm=93&ie=UTF-8#q=%22goodness+of+fit+of+a+statistical+model+describes+how+well+it+fits+a+set+of+observations.+Measures+of+goodness+of+fit+typically+summarize+the+discrepancy+between+observed+values+and+the+values+expected+under+the+model+in+question.%22&tbs=li:1&start=10">A Google search for the verbatim quote above reveals</a> that this definition appears in countless books, papers, and websites <i>without attribution</i>. Did these authors plagiarize Wikipedia? Did Wikipedia plagiarize these authors? Who knows.<br />
<br />
My solution, put the definition in quotes and attach a footnote. "This quote appears verbatim on Wikipedia and countless books, papers, and websites."<br />
<br />
Done.Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-10597846941928937272014-12-09T09:30:00.000-05:002014-12-09T09:30:00.708-05:00Simulating Realistic Data for Topic Modeling<a href="https://stat.duke.edu/~bss18/">Brian</a> and I have finally submitted our paper to <a href="http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=34">IEEE Transactions on Pattern Analysis and Machine Intelligence</a>. This is the culmination of a year of hard work. (There's more work yet to be done; I doubt we'll make it through peer-review without having to revise.)<br />
<br />
I presented our preliminary results at JSM in August, as described in this <a href="http://www.biasedestimates.com/2014/04/its-all-about-beta.html">earlier post</a>.<br />
<br />
Here is the abstract.<br />
<br />
<blockquote class="tr_bq">
<blockquote class="tr_bq">
Latent Dirichlet Allocation (LDA) is a popular Bayesian methodology for topic modeling. However, the priors in LDA analysis are not reflective of natural language. In this paper we introduce a Monte Carlo method for generating documents that accurately reflect word frequencies in language by taking advantage of Zipf’s Law. In developing this method we see a result for incorporating the structure of natural language into the prior of the topic model. Technical issues with correctly assigning power law priors drove us to use ensemble estimation methods. The ensemble estimation technique has the additional benefit of improving the quality of topics and providing an approximation of the true number of topics.</blockquote>
</blockquote>
<br />
The rest of the paper can be read <a href="https://drive.google.com/file/d/0Bz2enPyUvnKIQjRzZU9TWFJNcVhobWlWV180TmdmendzZ2JJ/view?usp=sharing">here</a>.Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-3928233609467288552014-12-08T23:44:00.003-05:002014-12-08T23:44:53.135-05:00Look up<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">I've added a couple pages to the blog here. The <a href="http://www.biasedestimates.com/p/about.html">about me</a> page has a quick bio. The <a href="http://www.biasedestimates.com/p/publications-and-working-papers.html">publications and presentations</a> page is where I'll be putting up my <strike>bragging rights</strike> research portfolio.</span>Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-15516979523611465312014-11-26T09:30:00.000-05:002014-11-26T09:30:02.255-05:00Economics and Data Mining<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://upload.wikimedia.org/wikipedia/commons/6/67/Miner_Emerging_From_Tunnel.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://upload.wikimedia.org/wikipedia/commons/6/67/Miner_Emerging_From_Tunnel.jpg" height="240" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: x-small;">He's mining for data.</span></div>
<br />
<br />
I stumbled across <a href="http://ineteconomics.org/video/30-ways-be-economist/cosma-shalizi-why-economics-needs-data-mining">this video</a>.<br />
<br />
<a href="http://ineteconomics.org/video/30-ways-be-economist/cosma-shalizi-why-economics-needs-data-mining">Cosma Shalizi</a>, a stats professor at Carnegie Mellon, argues that economists should stop "fitting large complex models to a small set of highly correlated time series data. Once you add enough variables, parameters, bells and whistles, your model can fit past data very well, and yet fail miserably in the future."<br />
<br />
I think there's a bit of a conflation of problems here. Not all economic data sets are small. An economist friend of mine pointed out that he's been working with datasets that have millions of observations. I am told this is common in microeconomics.<br />
<br />
Nevertheless, my experience is that "acceptable" econometric methods are overly-conservative. As stated in the video, an economist saying someone is "data mining" is tantamount to an accusation of academic dishonesty. I was indoctrinated early in the ways of David Hendry's <a href="http://www.federalreserve.gov/pubs/ifdp/2005/838/ifdp838.pdf">general to specific modeling</a>, which is basically data mining (but doing it intelligently). This, I think, made machine learning an intuitive move for me, and I've always thought that economics research would benefit greatly from machine learning methods.<br />
<br />
There are some important caveats to all this. First, I don't see anyone beating out economics the same way <a href="http://magazine.amstat.org/blog/2014/11/01/statistics-losing-ground-to-computer-science/">computer science is sticking it to statistics</a>. For "big data analytics" to live up to its hype, data scientists have to think a lot like economists, not the other way around. A big part of an economics education is economic thinking; this goes above and beyond statistical methods. Second, (and more importantly) you should take anything I say here with a grain of salt. Though I have a background in (and profound love for) economics, I never held a graduate degree in econ and I've been out of the field (and professional network) for several years. My knowledge may be dated.<br />
<br />
Even so, I'm happy to hear voices like Dr. Shalizi's. It adds to Hal Varian's paper on <a href="http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf">"big data" tricks for econometrics</a>. Maybe instead of worrying about the <a href="http://www.theguardian.com/technology/2014/oct/27/elon-musk-artificial-intelligence-ai-biggest-existential-threat">AI singularity</a>, we should be worrying about economists using machine learning and then taking all of our jobs. ;-)<br />
<br />Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-39996559015315745952014-11-19T09:35:00.000-05:002014-11-19T09:35:00.605-05:00What do you do when you see a bad study?Debate: <a href="http://blogs.ams.org/blogonmathblogs/2014/04/10/bad-statistics-ignore/">how should we respond in the face of a study using bad statistics</a>? - This post actually has a bit of history to it, citing Andrew Gelman and Jeff Leeks. I'd recommend clicking through and taking in its totality.<br />
<br />Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-67421236045907228432014-11-14T14:28:00.000-05:002014-11-14T14:49:59.652-05:00LDA and Topic Models Reading List<blockquote class="twitter-tweet" lang="en">
Good crowd tonight to hear <a href="https://twitter.com/thos_jones">@thos_jones</a> talk about topic modeling <a href="https://twitter.com/hashtag/NLProc?src=hash">#NLProc</a> <a href="https://twitter.com/hashtag/datadc?src=hash">#datadc</a> cc: <a href="https://twitter.com/DataCommunityDC">@DataCommunityDC</a> <a href="https://twitter.com/YourGirlK">@YourGirlK</a> <a href="http://t.co/dthE7z1lRB">pic.twitter.com/dthE7z1lRB</a><br />
— DC NLP Meetup (@DCNLP) <a href="https://twitter.com/DCNLP/status/532694568094150656">November 13, 2014</a></blockquote>
<script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script>
<br />
A big thank you to everyone that came to see me <a href="http://www.meetup.com/DC-NLP/events/203956152/">talk about topic models</a> at <a href="http://www.meetup.com/DC-NLP/">DC-NLP</a> on Wednesday. I am grateful for the feedback that I received. I'd also like to give a big shout out to my co-author, <a href="https://stat.duke.edu/~bss18/">Brian St. Thomas</a>. Not only has his hard work made our research shine, he is the one who came up with the "ball and urns" graphic to explain topic models. Many people came up to me afterwords saying how intuitive that was; I wish I could take the credit, but it was all Brian.<br />
<br />
While I wait on approval from work to release my slides, I thought I'd put together an LDA-related reading list of many of my sources. I've done a bit of that before <a href="http://www.biasedestimates.com/2014/04/its-all-about-beta.html">here</a>. Some of those papers are also below, as well as others.<br />
<br />
<h4>
LDA Basics</h4>
<div>
<ol>
<li><a href="http://en.wikipedia.org/wiki/Dirichlet-multinomial_distribution#A_combined_example:_LDA_topic_models">The clearest statement of LDA I've seen is on Wikipedia.</a></li>
<li><a href="http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf">Here is David Blei et. al's original paper.</a></li>
<li><a href="http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf">This paper introduces Gibbs sampling for LDA.</a></li>
</ol>
<h4>
On Priors and Zipf's Law</h4>
</div>
<div>
<ol>
<li><a href="https://people.cs.umass.edu/~wallach/publications/wallach09rethinking.pdf">Rethinking LDA: Why Priors Matter</a> (This is a good paper, though I am skeptical of the conclusion.)</li>
<li><a href="http://machinelearning202.pbworks.com/w/file/fetch/45880249/asuncion10.1.1.157.6861.pdf">Comparison of topic models, their estimation algorithms, and priors.</a> (Very underrated, MUST READ.)</li>
<li><a href="https://cocosci.berkeley.edu/tom/papers/goldwater11a.pdf">Incorporating Zipf's law in language models</a></li>
<li><a href="http://akuz.me/wp-content/uploads/2014/01/akuz_lda_asym.pdf">A note on estimating LDA with asymmetric priors</a></li>
</ol>
<h4>
Evaluating LDA/Issues With LDA</h4>
</div>
<div>
<ol>
<li><a href="http://www.stat.ufl.edu/~ajwomack/WMC-LDA.pdf">LDA is an inconsistent estimator</a></li>
<li><a href="http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf">Reading Tea Leaves: How humans interpret topic models</a> (Also, MUST READ.)</li>
<li><a href="http://www.ics.uci.edu/~newman/pubs/rtm_nips.pdf">A coherence (cohesion?) metric for topic models.</a> (Note: This metric has the issue of "liking" topics full of statistically-independent words. It is still useful though.)</li>
</ol>
<h4>
Other Topic Models</h4>
<div>
<ol>
<li><a href="http://www.icml2010.org/papers/45.pdf">Spherical topic models</a>. (<strike>My co-author assures me that these are consistent estimators; we've not yet implemented them though. Know anyone that has?</strike>) (Update 2:48: I was wrong, this model is *not* consistent but it could be. See Brian's note, below.)</li>
<li><a href="http://pdf.aminer.org/000/334/521/dynamic_topic_models.pdf">Dynamic topic models</a></li>
<li><a href="http://www.cs.colorado.edu/~jbg/docs/2014_emnlp_howto_gibbs.pdf">Ensembles of topic models</a> (not our stuff, but from Jordan Boyd-Graber who is super smart and a friend of DC-NLP)</li>
</ol>
</div>
<h4>
Other Stuff</h4>
</div>
<div>
<ol>
<li><a href="http://arxiv.org/abs/1308.2359">KERA keyword extraction</a> used to label topics in one of my examples. (The paper applying it to LDA is forthcoming, however.)</li>
<li><a href="http://www.pnas.org/content/108/10/3825.full">Rethinking Language: How probabilities shape the words we use</a> (MUST READ, though not about topic modeling specifically.)</li>
<li><a href="http://www.cs.princeton.edu/~blei/topicmodeling.html">David Blei's topic modeling website</a></li>
</ol>
<div>
<br /></div>
</div>
<div>
<b>From Brian on spherical topic models:</b> "A small note on spherical topic models - the basic spherical topic model that is out there (SAM) is *not* a consistent estimator, but we have a framework to make a consistent estimator from my work on estimating mixtures of linear subspaces by tweaking the prior."</div>
Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-29679694073229891712014-11-14T09:30:00.000-05:002014-11-14T09:30:02.725-05:00Statistics, Computer Science, and How to Move ForwardI'm still here! Took a break from blogging/Twitter/etc. over the last couple months. My brain needed a break and I picked up a real hobby. But this blog isn't dead yet!<br />
<br />
This month's <a href="http://magazine.amstat.org/wp-content/uploads/2014an/November2014.pdf">issue</a> of <a href="http://magazine.amstat.org/">Amstat News</a> features an editorial by Norman Matloff titled "<a href="http://magazine.amstat.org/blog/2014/11/01/statistics-losing-ground-to-computer-science/">Statistics Losing Ground to Computer Science</a>." Provocative title, no?<br />
<br />
I was expecting yet another article whose argument could be summed up as "get off of my lawn, you punk computer scientists!" When I read/hear these kinds of arguments from statisticians, I usually <a href="http://24.media.tumblr.com/tumblr_m44xssSoJp1qjgyuwo1_500.gif">roll my eyes</a> and move on with my life. But this time... I agreed.<br />
<br />
<a href="http://heather.cs.ucdavis.edu/matloff.html">Dr. Matloff's</a> article is quite critical of CS research involving statistics. And maybe I'm getting crotchety, but I've run into many of these issues myself in my topic modeling research. An exemplar quote is below.<br />
<br />
<blockquote class="tr_bq">
Due in part to the pressure for rapid publication and the lack of long-term commitment to research topics, most CS researchers in statistical issues have little knowledge of the statistics literature, and they seldom cite it. There is much “reinventing the wheel,” and many missed opportunities.</blockquote>
<br />
The fact of the matter is, CS and statistics come from very different places culturally. This doesn't always lend itself to clear communication and cross-disciplinary respect. Dr. Matloff touches on this mismatch. At one end...<br />
<br />
<blockquote class="tr_bq">
CS people tend to have grand—and sometimes starry-eyed—ambitions. On the one hand, this is a huge plus, leading to highly impressive feats such as recognizing faces in a large crowd. But this mentality leads to an oversimplified view, with everything being viewed as a paradigm shift.</blockquote>
<br />
And at the other...<br />
<br />
<blockquote class="tr_bq">
Statistics researchers should be much more aggressive in working on complex, large-scale, “messy” problems, such as the face recognition example cited earlier.</blockquote>
<br />
I 100% agree with the above. CS didn't start "overshadowing statistics researchers in their own field" simply because computer scientists "<a href="http://xkcd.com/1428/">move fast and break things</a>." In addition, our (statisticians') conservatism stifled creativity and ambitions to solve grand problems, like facial recognition (or text analyses).<br />
<br />
Dr. Matloff recommends several changes for statistics to make. I particularly like the suggestion that more CS and statistics professors have joint appointments. A criticism that I regularly hear from my CS colleagues is that many statisticians are mediocre programmers, and that they lack pragmatism on the tradeoff between mathematical rigor and a useful application. We've covered CS's sometimes cavalier attitude towards modeling above. Perhaps more joint appointments will not only influence faculty, but also educate students early on the needs and advantages of both approaches.<br />
<br />
<br />Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-51159318874572490002014-09-02T13:01:00.000-04:002014-09-02T13:01:08.330-04:00Recommendation systems on my mind<div class="separator" style="clear: both; text-align: center;">
<a href="http://info.cunyba.gc.cuny.edu/Portals/212300/images/recommendation-stamp-photo.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://info.cunyba.gc.cuny.edu/Portals/212300/images/recommendation-stamp-photo.jpg" height="292" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: xx-small;"><b>Source:</b> <a href="http://info.cunyba.gc.cuny.edu/blog/bid/296342/10-Tips-for-Requesting-an-Academic-Letter-of-Recommendation">http://info.cunyba.gc.cuny.edu/blog/bid/296342/10-Tips-for-Requesting-an-Academic-Letter-of-Recommendation</a></span></div>
<br />
<br />
I've got<a href="http://en.wikipedia.org/wiki/Recommender_system"> recommendation systems</a> on the brain today. Here are links to the sources I've been finding most helpful.<br />
<br />
<br />
<ul>
<li><a href="http://datacommunitydc.org/blog/2013/05/recommendation-engines-why-you-shouldnt-build-one/">You probably shouldn't build a recommendation system at all</a>, actually.</li>
</ul>
<div>
<br /></div>
<ul>
<li>If you insist, here's <a href="http://infolab.stanford.edu/~ullman/mmds/ch9.pdf">chapter 9</a> from <a href="http://www.mmds.org/">Mining Massive Datasets</a></li>
<ul>
<li>Actually, Jeff Ullman has some <a href="http://infolab.stanford.edu/~ullman/ullman-books.html">pretty awesome stuff</a> up in general.</li>
</ul>
</ul>
<div>
<br /></div>
<ul>
<li>A good lit review on <a href="http://dl.acm.org/citation.cfm?id=2556270&bnc=1">collaborative filtering algorithms</a> </li>
</ul>
Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com0tag:blogger.com,1999:blog-4093229006254491664.post-1316679950007393642014-08-22T09:35:00.000-04:002014-09-02T13:17:32.569-04:00Friday links: August 22, 2014<div class="separator" style="clear: both; text-align: center;">
<a href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b014e607b1377970c-800wi" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b014e607b1377970c-800wi" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: x-small;">Source: <a href="http://xkcd.com/">xkcd</a>, retreived from <a href="http://revolution-computing.typepad.com/">Revolution Analytics</a></span></div>
<br />
<br />
<a href="http://magazine.amstat.org/blog/2014/08/01/science-review-panel/">Scince establishes a statistical review panel</a> (hopefully to avoid <a href="http://xkcd.com/882/">this</a>)<br />
<br />
<a href="http://simplystatistics.org/2014/08/19/swiftkey-and-johns-hopkins-partner-for-data-science-specialization-capstone/">JHU/Coursera Data Science Track gets an NLP-based capstone</a><br />
<br />
<a href="http://www.fastcompany.com/3034307/hit-the-ground-running/ready-to-launch-a-startup-try-this-unpopular-advice-first">Before working at a startup, get a job at a big company</a><br />
<br />
Related: <a href="http://www.entrepreneur.com/article/236512">How to make employees happy</a> (Number one is my favorite)<br />
<br />
<a href="http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html">Being a data scientist means lots of data curation</a><br />
<br />
<a href="https://twitter.com/paulisci/status/501130967575044099">How to determine the order of authors for your next paper</a><br />
<br />
<br />Tommyhttp://www.blogger.com/profile/14573787040201086607noreply@blogger.com2