Embrace Big Data and Simple Analysis

I’ve been seeing the buzzword “big data” cropping up a lot lately.  As best I can tell, in business-droid speak it refers to any very large collection of non-uniform or hard to work with data.  For example, all the videos on YouTube.  The large part should be obvious.  The hard to work with part stems with the fact that you can’t really extract any value from the videos without doing something difficult (or at least time consuming) – watching them.  There’s lots of big data out there – census data, social media, credit card data, military intelligence, etc.

While I’m always loath to jump on trendy business bandwagons, I think this one is bringing to the public an idea long overdue, and which I strongly believe in:

Given a choice between having better analysis or more data, 99% of the time you’d be way better off having more data.

I can’t take credit for this statement – I read it somewhere long ago and initially believed it to be false.  In fact, all my scientific/engineering education subtly taught me it was false (I’ll explain how in a bit).  But I eventually realized that, at least on this point, my education was full of crap.  Data really is king and analysis is overvalued.  This is of direct application to traders.

Now, depending on your biases, you may be inclined to disagree with me.  After all, good analysis is the mark of an intelligent mind.  Analysis is what separates us from the chimps – it gives you an opportunity to flex the old frontal lobe.  More data is just more of the same.

Maybe I can convince you otherwise, with an example taken from the field of marketing.  Suppose you have the following marketing task:

Market theory: poorer customers prefer sugared soda, while wealthier customers prefer sugar-free soda.  Task: determine if this is true.

The “classic” approach to solving this problem might go as follows:

  • Convene a focus group
  • Ask each participant about their income and soda buying habits
  • Discretize those written or verbal responses to numeric values via some means
  • Perform statistical analysis to determine the relationship between income and soda preference.  Use regression, confidence testing etc. as needed.

This approach is full of analytic goodness – confidence tests, regression, sampling etc.  It requires lots of brain horsepower and statistics education.  But the methodology is none the less pretty close to garbage – useful only when there are no other options.  Regardless of the nominal statistical confidence achieved at the end, it’s quite likely you will end up knowing nothing of use about soda preferences.  Why?  Too little data.  There’s a practical limit to how big your focus group is going to be and thus factors out side your control can easily play havoc with the results.

I can illustrate how your result might be completely wrong even with high statistical confidence: what happens if you convene your focus group in a particularly granola-heavy college town?  In that case, it’s quite likely that your “poor” focus group members will in fact be college students.  While they may have little or no income, they also are likely mostly from middle and upper class families, and have eating habits to match.  You could easily conclude that the soda preferences of poor people look pretty much the same as rich people, without ever having encountered an actual poor person.  Sadly, statistics does nothing to address problems of this sort.  The result could have a 99% statistical confidence attached to it and still be pure bullshit.

You could mess up in the opposite direction if you convened your focus group in an oil boom town where there are lots of high-income manual laborers.  Only more data (including college towns, oil towns, and lots of regular towns) could save you, and more data is exactly what you don’t have.

Now consider an alternate methodology of solving the same problem:

  • Get addresses and per store/restaurant purchasing receipts for every grocery store, convenience store and restaurant in the country.
  • Get census data – the average income and population for every zip code in the country
  • Take the top 20% of zip codes by income, and add up the total amounts of refular and diet soda consumed in those zip codes last year.  Divide by the number of people in those zip codes.  The result is consumption is gallons/person or whatever.
  • Do the same for the bottom 20% of zip codes by income.
  • Compare.

This method is the exact opposite – it requires huge amounts of data (census tables and hundreds of thousands of data points for various stores)  But the analysis is retardedly simple – add up the soda and divide by the number of people.  A motivated 3rd grader could do the math if you were willing to wait a while.  A computer could do it in minutes if not seconds.  What’s interesting about this method is despite the lack of sophisticated analysis it basically can’t fail.  At the end, assuming you can add, you will have accounted for all the soda consumed by a bunch of rich people and a bunch of poor people.  Whatever the results tell you is going to be the right answer.  It’s not even a matter of statistical confidence really – you’ve looked at the whole population.  You’re going to be right, guaranteed.  What you needed was more (of the right) data, not better analysis.

Interestingly, college education in both the hard and soft sciences teaches the junk methodology but not the good one.  Every school has classes on how to design small experiments and do mind numbingly complex statistical analysis on the data.  I’ve never seen a class on how to go out and get ALL the available data, and then do dirt simple analysis and get the right answer guaranteed.  Academia is painfully tied to small data – probably why so many nominally scientific studies are unrepeatable crap (strangely all with the obligatory 95%+ confidence number).

Cool story, huh?  Ok, maybe not.  But this is directly applicable to trading.  Trading education is full of analysis.  Trading software is designed to highlight analytic tools – hell, my trading package supports 4 or 5 different kinds of moving averages.  There’s probably 2-300 analytic tools included in all – oscillators, market profile studies, averages, various kinds of trend and breakout detectors etc.  Go to any trading forum, and there’s enough analysis discussion to last you a lifetime.

Now, I’m not saying analysis is bad.  I do lots of analysis.  But analysis is weak compared to more data.  And that’s what’s missing from all these tools and discussions – the data.  You should just assume, when you get in the trading business, that you’re really in the data collection business.  You want to record (or buy/lease access to, or find on the web) as much data about as many financial matters as possible.  Ideally you’d like to have tick-level price and DOM data for every conceivable instrument since the invention of the electronic market.  Now, that’s a lot of data (they don’t call it “big data” for nothing) so you’re going to have to be selective to keep costs down.  But the idea is to collect tons of data, even if you don’t know what to do with it just yet.  Hard disk space is cheap – use that to your advantage.

Don’t just constrain yourself to market data.  Collect news releases, economic data, your own trade results, links to websites with trading information, anything you can get your hands on that might be interesting.  Be selective when you have to, but get it all whenever you can.

Then, when you formulate (or read) some theory about how the market works, don’t test it on some piddly little sample of 23 trades you did last week.  That’s the trading equivalent of  holding your marketing focus group in a clown college parking lot.  Your results are going to be humorous but useless.  Test your theories against ALL the data, or as much as you can possibly get your hands on.  This is hard work – you’ll have to learn how to program a computer and make it do what you want.  So learn.  You’re going to spend a lot of time looking at more data than you think you really need.  Spend the time.

Don’t get too excited about statistical confidence testing – when you’ve got all the data, or at least an excessive amount, the answer is usually painfully obvious.  Frequently what you’ll find when you look at all the data is popularly held theories hold at some times, but not at others.  These dependencies are often left out of the original theory.  Why?  Because you can only find them with huge amounts of data.  So now you know something the rest of the world doesn’t know – and that way lies money.

Keep asking yourself how you’d solve this problem if you had 100x more data.  Then do it. Stop when the answer is “There isn’t any more data – I’ve got it all.”

 

2 thoughts on “Embrace Big Data and Simple Analysis

  1. Where does one start this process? Some data feeds such as DTN IQ offer tick data historically for 3 months I believe. But how do you store it? Some type of database? And then how do you manipulate and test it? Do you personally just use Excel, or something more advanced like Matlab or R?

    It seems this whole statistical side of trading is very different from what I initially learned, which was discretionary TA methods of reading charts. But luckily gambling/poker/sportsbetting gives me enough of a background to at least understand some of the things you say, even if I don’t really know where to go with them..

  2. I use NinjaTrader’s internal historic data format for tick and bar data. It’s just a text file, so it’s easy to work with using other tools. For analysis I use Excel, SciPy, and Minitab. The Minitab license was free from another job – I wouldn’t pay for it but it does do stats fast and easy. The industry standard is Matlab, but I like SciPy better just because I’m fluent in Python but not Matlab scripting. Plus SciPy is free.

Leave a Reply to mike Cancel reply

Your email address will not be published. Required fields are marked *