US Insights

Random sampling's enduring value

J. Walker Smith

Executive Chairman, Kantar Futures

Brands 08.07.2013 / 00:00


Reports of random sampling's demise have been greatly exaggerated

There is a view in brand marketing research these days that random sampling is on its last legs. With tons of data in hand and real-time testing on the rise, many claim there is no need for random sampling anymore.

This view was even given an imprimatur of historical legitimacy recently in the
bestselling book, Big Data: A Revolution That Will Transform How We Live, Work, and Think by Oxford professor Viktor Mayer-Schonberger and Economist reporter Kenneth Cukier. Early on in their book, the authors assert that "[s]ampling was a solution to the problem of information overload in an earlier age"-presumably, a problem that will no longer plague us in the post-revolutionary age of Big Data. The authors then go on to detail all the problems of, well, truth be told, bad samples, not proper random samples.

This conflation of bad sampling with random samples by Mayer-Schonberger and Cukier is yet another instance of the sort of confusion that has long characterized untutored discussions of sampling. Muddles like this are sure to get worse as Big Data, mobile devices, digital footprints and cloud computing come together over the very near future.

The presumption of those now eulogizing random samples is that having data on the full population ensures a clear view of what's going on. This is just not so, for at least three reasons.

Big Data can be too big.

The first reason is an important bit of statistical nuance that is almost always overlooked. It's called power. Most of us are familiar with the concept of sampling error, which is the plus-or-minus error range attached to statistical estimates such as election polls. The mirror image of sampling error is power.

Sampling error ranges keep us from calling something real when it's actually random chance. Power helps us detect what's real when we might otherwise mistake it for random chance.

Sampling error ranges and power trade off. When a sample has a wide margin of error (or a big plus-or-minus range), the sample itself has little power, hence, a poor ability to detect real differences. On the other hand, when a sample has a small margin of error, it has a lot of power, as you would expect from more precision.

But there's a catch. You don't want too much power because then every difference, no matter how miniscule, will be statistically significant, including differences that are actually random. But you don't want too little power, either, because then the differences that are real will not show up as statistically significant. This is where the science of statistics becomes the art of research. One of the most important tasks in every study is figuring out the right balance of sampling error and power. In practical terms, this means determining the ideal sample size. Too big a sample means too much power; too small a sample means too much sampling error.

To put it another way, there is, indeed, such a thing as a dataset that is too big, a statistical reality that Big Data enthusiasts typically overlook. With too much data, every difference is statistically significant. Analysis of the data won't separate real differences from chance differences because every difference will look to be real.

Statistically speaking, data on an entire population can be the very same thing as a sample that is so big it has too much power. When too much data makes every difference statistically significant, statistical testing is of no help, so we are forced back on our own judgment. But this encourages us to indulge our inborn tendency to see patterns where none really exist.

Random sampling is better than reasoning.

This takes us to the second reason that Big Data does not ensure a clear view of what's really going on. Random samples are often better than entire populations for figuring out what's going on.

Rule number one of probability is that random events occur in clumps. What looks like a pattern to the naked eye is almost always just a random distribution. Unfortunately, we have a built-in bias for seeing structure and order where none exists. Human beings specialize in interpreting chance, and then telling very compelling stories that make these misinterpretations seem true. Even when we go out looking for evidence to put our beliefs to the test, we usually fall prey to confirmation bias. 

Having all the data can't keep us from misinterpreting what we see. Time and again, research has shown that common sense fails us when it comes to analyzing large-scale phenomena. We need a process that protects us from ourselves.

The process to follow is one that begins with a look at all the data before us in the context of our past experience and prior knowledge. From this, we come up with hypotheses about what's going on. Then we put these hypotheses to the test.

In today's digital marketplace, Google A/B testing is often cited as best practice for junking hifalutin' theory and just looking at the data to see what works. Even Big Data enthusiasts believe in A/B testing. But here's the thing: the A and the B in A/B testing are both samples, and if they're not random samples, then you can't have confidence that the test results are reliable enough for significant brand marketing investments.

A sample is a subset of the population. Any division of the population is sampling (though not always random sampling). Splitting the population in two means two samples, even if those two parts are very large. There's no rule that says a sample has to be small. The defining characteristic of a sample is that it is partial, and the defining characteristic of a good sample is that it is random.

Big Data and random sampling work together.

The importance of random sampling for interpreting results brings us to final reason for not yet counting out random sampling. Big Data analytics require random sampling. The choice posed between random sampling and Big Data is a false dichotomy. Even with Big Data, random sampling remains essential.

When working with samples, only results from random samples can be said with any degree of confidence to be true of the entire population. This is where Big Data enthusiasts jump in to proclaim that having all the data eliminates the need for sample projection, and thus the need for random samples. If you can see the entire population (at an affordable cost and in reasonable time), then, it is asked, why use only a small piece that generates results with a big error range around them

This is where the logic starts to get circular. The reason that we can't "see the entire population," so to speak, is that what we see are patterns that often turn out to be nothing but random clumping. The only way to be sure we are not "seeing" structure and order in a random distribution is to put what we see to the test. Whether that test is an A/B field test or a laboratory simulation or a structural equation model, the population dataset is going to have to be split apart and the parts compared. Sampling is an inherent part of good explanation, not an alternative way of developing explanations that stands in contrast to explanations built on entire populations.

In almost every way possible, Big Data analytics are rife with random sampling. Big Data draws on established statistical methodologies, and random sampling is a cornerstone of statistics.

But to note that random sampling is alive and well despite the obituaries being written about it is not to say that less data is better than more data. More data means more options and bigger brand marketing possibilities, particularly execution and delivery. But figuring out what those options are and how to capitalize on those possibilities in a more competitive, more rapidly changing marketplace will take every bit of smarts and savvy we can bring to the table, of which random sampling is part and parcel.

Source: Kantar Futures

Editor's Notes

The original post from which this article is excerpted can be found at Branding Strategy Insider. Journalists, to speak with J. Walker Smith, contact us

Latest Stories

In 2017, Black Friday will complete its conquest of November.

The U.S. Top 100 brand value totaled $3.16 trillion

Those with health conditions turn to social media for emotional support.

Tmall, Taobao and are most mentioned e-commerce brands ahead of China's Singles Day.

Apple shows signs of strength in key global markets.

Related Content