There is a view in brand marketing research these days that
random sampling is on its last legs. With tons of data in hand and
real-time testing on the rise, many claim there is no need for
random sampling anymore.
This view was even given an imprimatur of historical legitimacy
recently in the
bestselling book, Big Data: A Revolution That Will
Transform How We Live, Work, and Think by Oxford
professor Viktor Mayer-Schonberger and Economist reporter
Kenneth Cukier. Early on in their book, the authors assert that
"[s]ampling was a solution to the problem of information overload
in an earlier age"-presumably, a problem that will no longer plague
us in the post-revolutionary age of Big Data. The authors then go
on to detail all the problems of, well, truth be told, bad samples,
not proper random samples.
This conflation of bad sampling with random samples by
Mayer-Schonberger and Cukier is yet another instance of the sort of
confusion that has long characterized untutored discussions of
sampling. Muddles like this are sure to get worse as Big Data,
mobile devices, digital footprints and cloud computing come
together over the very near future.
The presumption of those now eulogizing random samples is that
having data on the full population ensures a clear view of what's
going on. This is just not so, for at least three reasons.
Big Data can be too big.
The first reason is an important bit of statistical nuance that
is almost always overlooked. It's called power. Most of us are
familiar with the concept of sampling error, which is the
plus-or-minus error range attached to statistical estimates such as
election polls. The mirror image of sampling error is power.
Sampling error ranges keep us from calling something real when
it's actually random chance. Power helps us detect what's real when
we might otherwise mistake it for random chance.
Sampling error ranges and power trade off. When a sample has a
wide margin of error (or a big plus-or-minus range), the sample
itself has little power, hence, a poor ability to detect real
differences. On the other hand, when a sample has a small margin of
error, it has a lot of power, as you would expect from more
But there's a catch. You don't want too much power because then
every difference, no matter how miniscule, will be statistically
significant, including differences that are actually random. But
you don't want too little power, either, because then the
differences that are real will not show up as statistically
significant. This is where the science of statistics becomes the
art of research. One of the most important tasks in every study is
figuring out the right balance of sampling error and power. In
practical terms, this means determining the ideal sample size. Too
big a sample means too much power; too small a sample means too
much sampling error.
To put it another way, there is, indeed, such a thing as a
dataset that is too big, a statistical reality that Big Data
enthusiasts typically overlook. With too much data, every
difference is statistically significant. Analysis of the data won't
separate real differences from chance differences because every
difference will look to be real.
Statistically speaking, data on an entire population can be the
very same thing as a sample that is so big it has too much power.
When too much data makes every difference statistically
significant, statistical testing is of no help, so we are forced
back on our own judgment. But this encourages us to indulge our
inborn tendency to see patterns where none really exist.
Random sampling is better than reasoning.
This takes us to the second reason that Big Data does not ensure
a clear view of what's really going on. Random samples are often
better than entire populations for figuring out what's going
Rule number one of probability is that random events occur in
clumps. What looks like a pattern to the naked eye is almost
always just a random distribution. Unfortunately, we have a
built-in bias for seeing structure and order where none exists.
Human beings specialize in interpreting chance, and then telling
very compelling stories that make these misinterpretations seem
true. Even when we go out looking for evidence to put our beliefs
to the test, we usually fall prey to confirmation bias.
Having all the data can't keep us from misinterpreting what we
see. Time and again, research has shown that common sense fails us
when it comes to analyzing large-scale phenomena. We need a process
that protects us from ourselves.
The process to follow is one that begins with a look at all the
data before us in the context of our past experience and prior
knowledge. From this, we come up with hypotheses about what's going
on. Then we put these hypotheses to the test.
In today's digital marketplace, Google A/B testing is often
cited as best practice for junking hifalutin' theory and just
looking at the data to see what works. Even Big Data enthusiasts
believe in A/B testing. But here's the thing: the A and the B in
A/B testing are both samples, and if they're not random samples,
then you can't have confidence that the test results are reliable
enough for significant brand marketing investments.
A sample is a subset of the population. Any division of the
population is sampling (though not always random sampling).
Splitting the population in two means two samples, even if those
two parts are very large. There's no rule that says a sample has to
be small. The defining characteristic of a sample is that it is
partial, and the defining characteristic of a good sample is that
it is random.
Big Data and random sampling work together.
The importance of random sampling for interpreting results
brings us to final reason for not yet counting out random sampling.
Big Data analytics require random sampling. The choice posed
between random sampling and Big Data is a false dichotomy. Even
with Big Data, random sampling remains essential.
When working with samples, only results from random samples can
be said with any degree of confidence to be true of the entire
population. This is where Big Data enthusiasts jump in to proclaim
that having all the data eliminates the need for sample projection,
and thus the need for random samples. If you can see the entire
population (at an affordable cost and in reasonable time), then, it
is asked, why use only a small piece that generates results with a
big error range around them
This is where the logic starts to get circular. The reason that
we can't "see the entire population," so to speak, is that what we
see are patterns that often turn out to be nothing but random
clumping. The only way to be sure we are not "seeing" structure and
order in a random distribution is to put what we see to the test.
Whether that test is an A/B field test or a laboratory simulation
or a structural equation model, the population dataset is going to
have to be split apart and the parts compared. Sampling is an
inherent part of good explanation, not an alternative way of
developing explanations that stands in contrast to explanations
built on entire populations.
In almost every way possible, Big Data analytics are rife with
random sampling. Big Data draws on established statistical
methodologies, and random sampling is a cornerstone of
But to note that random sampling is alive and well despite the
obituaries being written about it is not to say that less data is
better than more data. More data means more options and bigger
brand marketing possibilities, particularly execution and delivery.
But figuring out what those options are and how to capitalize on
those possibilities in a more competitive, more rapidly changing
marketplace will take every bit of smarts and savvy we can bring to
the table, of which random sampling is part and parcel.
Source: The Futures Company