Bits of Books - Books by Title

Big Data

A revolution that will transform the way we live, work and think

Viktor Mayer-Schonberger and Kenneth Cukier

If you buy fuel for your car at 4pm, in the hour that follows you are likely to spend £20 to £30 on groceries. If your groceries include medication, it is more probable that you will complete the course than someone who doesn't own a car. And if your car is orange, then it should break down less.

We might well have theories for why the above statements are true. We might also have theories for why people buy more Pop Tarts before hurricanes, or why an analysis of where people go and who they call can diagnose their flu infection before they themselves know they have it. But we are entering a world in which those theories really don't matter.

What matters, argue Viktor Mayer-Schonberger and Kenneth Cukier, is not 'why but only what". "Society will need to shed some of its obsession for causality in exchange for simple correlations," they say. The reason for this is data.

In 3BC Ptolemy II resolved to capture all the world's information in a grand library in Alexandria. Today, if every human being were given their own individual library of Alexandria we would be storing only 0.3 per cent of the world's information. Even in the past decade, electronic storage has gone from being a scarce resource, where old e-mails were deleted to make room for new ones, to being virtually limitless. The result is that patterns can be spotted amid these terabytes humming on the world's servers that before would have been passed over or wiped away for ever.

When Walmart, for instance, discovered in those patterns the culinary importance of Pop Tarts in high winds, it rearranged its shelves accordingly - the resulting hurricane's profits blowing away any mundane concerns about working out the reason for the link. The authors argue that in the 21st century data is the "oil of the information economy". Nowhere is this more evident than at Amazon, one of the first organisations to spot the money in megabytes. It is the data it gathers on each browse - which pages we visit, what books we click on - that provides the company's chief competitive advantage, enabling it to produce the book recommendations that now drive most of its sales.

With the arrival of e-readers, the intrusion of these algorithms goes even farther: data can now be gathered on reading habits past the point of sale. Thus Barnes and Noble learnt that users of its Nook e-reader tended to quit non-fiction books halfway through, prompting the company to make the brave move of introducing non-fiction shorts.

One can't help but feel that it was even braver of Mayer-Schonberger and Cukier to introduce their own non-fiction readers to this fact roughly halfway through the book. Because by this stage, with 100 pages to go, it does seem as if their point has already been made - and made well. In collecting vast amounts of imperfect information we have a hugely powerful tool for decision-making.

More books on Computers

Covering everything that's happening today with information technology in one book is a monumental challenge. As if to acknowledge that difficulty, Viktor Mayer-Schonberger and Kenneth Cukier, authors of Big Data, begin by describing the data's magnitude. They note, for instance, that the amount of data now stored around the world is an estimated 1,200 exabytes (itself an already dated and debatable number), which can be expressed as an equally incomprehensible 1.2 zettabytes. "If it were all printed in books, they would cover the entire surface of the United States some 52 layers thick."

Big Data's authors observe that humanity is marching into unfamiliar territory: "Ultimately big data marks the moment when the 'information society' finally fulfills the promise implied by its name. The data takes center stage. All those digital bits that we have gathered can now be harnessed in novel ways to serve new purposes and unlock new forms of value." Put more simply, the emergence of Big Data - whatever we think we mean by that term - marks the pivot in history when computing will finally become useful for nearly everyone and everything. In the end, what makes data useful is software - and truth be told, Big Data is really just about software. But if Mayer-Schonberger and Cukier had used the word software - now a tired term, by tech-media standards - in the title, their book might not have generated any excitement. Nonetheless, what they explore in fact is the next emergent era of software.

When the first microprocessor was invented in 1971, the world was in the mainframe age, wherein software comprised a $1 billion industry. The second software era emerged with the rise of distributed (personal) computing, enabled by the microprocessor. Software today has grown to a $350 billion industry. As we enter software's third era, dominated by big-data analytics, we will see another 100-fold growth, as software becomes a multi-trillion dollar industry. One bellwether is the amount of venture investing in big data, with over 100 start-ups funded in the past few years. In a hot IPO last year, Splunk was the first new company purely anchored in big data to go public. Odds are good that in years to come Splunk, or some yet-to-be-launched company, will become as recognizable as Microsoft or Google.

To understand what the new software - that is, analytics - can do that's different from more familiar software like spreadsheets, word processing, and graphics, consider the lowly photograph. Here the relevant facts aren't how many bytes constitute a digital photograph, or a billion of them. That's about as instructive as counting the silver halide molecules used to form a single old-fashioned print photo. The important feature of a digital image's bytes is that, unlike crystalline molecules, they are uniquely easy to store, transport, and manipulate with software. In the first era of digital images, people were fascinated by the convenience and malleability (think PhotoShop) of capturing, storing, and sharing pictures. Now, instead of using software to manage photos, we can mine features of the bytes that make up the digital image. Facebook can, without privacy invasion, track where and when, for example, vacationing is trending, since digital images reveal at least that much. But more importantly, those data can be cross-correlated, even in real time, with seemingly unrelated data such as local weather, interest rates, crime figures, and so on. Such correlations associated with just one photograph aren't revealing. But imagine looking at billions of photos over weeks, months, years, then correlating them with dozens of directly related data sets (vacation bookings, air traffic), tangential information (weather, interest rates, unemployment), or orthogonal information (social or political trends). With essentially free super-computing, we can mine and usefully associate massive, formerly unrelated data sets and unveil all manner of economic, cultural, and social realities. (This is precisely what the National Security Agency is doing with just the data from phone and e-mail records.)

For science fiction aficionados, Isaac Asimov anticipated the idea of using massive data sets to predict human behavior, coining it 'psychohistory' in his 1951 Foundation trilogy. The bigger the data set, Asimov said then, the more predictable the future. With big-data analytics, one can finally see the forest, instead of just the capillaries in the tree leaves. Or to put it in more accurate terms, one can see beyond the apparently random motion of a few thousand molecules of air inside a balloon; one can see the balloon itself, and beyond that, that it is inflating, that it is yellow, and that it is part of a bunch of balloons en route to a birthday party. The data/software world has, until now, been largely about looking at the molecules inside one balloon.

Mayer-Schonberger and Cukier begin their exploration of analytics with an oft-cited example: Google data about the location and frequency of searches for 'the flu' are already more effective in tracking the rise and vector of an epidemic than anything the Centers for Disease Control can do. By analyzing Google requests about mortgages, the Federal Reserve has made a similar discovery about tracking mortgage-market trends. No personal information is needed. This is true for traffic and equipment efficiency and safety, for disease research, and perhaps soon, for financial market forecasts and much more. The data speak volumes - when they're in sufficient volumes to matter.

Amazon has long used analytics to predict and personalize purchasing behaviors. Facebook's analytics about the trending behaviors and interests of its 1 billion users are perhaps its most valuable asset. But the implications go far beyond using data streams about Instagram posts, Amazon purchases, and Web clicks from e-commerce and consumer behavior, though these practices alone spook some people. The new era will involve data collected from all manner of human and machine activities - from exercise bands to heart monitors, from car and aircraft engines and tires to crops in farmers' fields and manufacturing machinery.

All of these data have value. Sometimes the data associated with an object, activity, or transaction have more value than the thing they measure. Experts in supply-chain logistics long ago figured out that the information about a shipping container's location is worth more than the physical container. Thus, one of Mayer-Schonberger and Cukier's most important insights: Unlike material things - the food we eat, a candle that burns - data's value does not diminish when it is used; it can be processed again and again and used many times for multiple purposes. One could nitpick here and note that a variety of physical, not just virtual, things meet the same metrics - notably gold. But the authors' essential point is correct.

Soon big-data analytics will cross a Rubicon: we won't have to guess or approximate what's going on with many activities, we will know. Until now, given the scale and complexities of commerce, industry, society, and life, you couldn't measure everything; you approximated by statistical sampling and estimation. That era is almost over. We won't have to, for example, estimate how many cars are on a road, we will count each and every one in real time as well as hundreds of related facts about each car. Ditto soon for such things as your heartbeat or blood glucose, and much more.

The fascinating thing about the scale of massive data sets is that, as Asimov predicted, they can reveal trends, even behaviors, that tell us what will happen without the need to know the 'why.' (That was the trope in the movie Minority Report, based on a 1956 story by another great sci-fi writer, Philip K. Dick.) With robust correlations, you don't need a theory to predict; you just know. This is where Mayer-Schonberger and Cukier quite properly devote much of their attention: Big data may not spell the 'end of theory,' but it does fundamentally transform the way we make sense of the world.

The authors are off-base, though, in their claim that it is somehow weird to know something will happen without having an explanatory theory. In much of science, it has long been the case that theory follows robust correlations. Observational data can yield enormously predictive tools. I note just one iconic example: gravity. We know not only that Newton's apple will fall but exactly how fast. We can calculate trajectories and predict when and where the Mars Rover lands. But we still have no idea what gravity really is or why it works. (Yes, I know about gravitons, gravity waves, curved-space theories - all fascinating but unproven.)

The 'why' of many things that we observe, from entropy to evolution, has eluded physicists, philosophers, and theologians. What's new about big data is the extension of our observational powers into so much, from the profound to the trivial. Big data may not so much change the way we make sense of the world as amplify our ability to make sense of nearly everything in it - from terrorism to disease to restaurant preferences to subatomic particles.

But big data is about more than mining existing information. The depth of the revolution has yet to unfold. Soon, data concerning trillions of objects and activities will be available; self-powered and self-networked sensors will track everything from our blood sugar to how many steps we take. Big Data regards this massive information stream as a given and doesn't explore the trajectory of the underlying technologies, but it's worth noting that the revolution is being propelled by the convergence of three technology domains: staggeringly powerful but cheap information engines (computing at scale), ubiquitous wireless broadband, and smart sensors. This kind of technology-infrastructure convergence is the hallmark of revolutions. Nearly a century ago, for instance, air travel was enabled by the convergent maturation of powerful propulsion engines, modern aluminum metallurgy, and the emerging petroleum industry.

Universities are now in a race to create not just course curricula, but also degree programs around big data. Deloitte forecasts a shortfall in trained specialists measuring in the tens of thousands. Big data may be the Sexiest Job of the 21st Century, as a Harvard Business Review article claimed. And Big Data's authors join other pundits in suggesting that analytics is now so important that it must become, in effect, part of K–12 education for everyone: "Mathematics and statistics, perhaps with a sprinkle of programming and network science, will be as foundational to the modern workplace as numeracy was a century ago and literacy before that." While true, this assertion misses a larger point. The revolutionary utility of information technology will come from its ubiquity, which ultimately demands not specialization, but general ease of use. The job of the specialists is to make the data tools disappear into the background of everyday business and life.

The real depth and breadth of the emergent revolution can be seen in the scale of infrastructure expansion: business surveys show $3 trillion in the global information-communications technology (ICT) infrastructure spending planned for the next decade. This level of spending puts Big Data infrastructure in the same league as Big Oil, projected to spend $5 trillion over the same decade. The ICT dollars hide an even greater expansion because of the rapid and continuing decline in the cost per byte of ICT hardware: an iPad's computing power would have cost $10,000 just 12 years ago.

All of this is bullish for the future of the global economy. Though they don't set out to do so, Mayer-Schonberger and Cukier provide some of the ammunition to counter the arguments of economic thinkers like Robert Gordon at Northwestern University, Niall Fergusson at Harvard, and Tyler Cowen at George Mason, all of whom see recent tech innovation as small change when held against the grand sweep of history. They suggest that transformational innovation is over. Some skepticism is warranted: the promise that information technologies will radically improve everything has yet to be fulfilled. In Big Data, though, you begin to see why the end-of-innovation economists are wrong.

There are important, even troublesome, public-policy and social implications. The ongoing controversy over government and NSA investigation methods using analytics is just one example: In addition to challenging privacy, these uses of big data raise another unique and troubling concern: the risk that we may judge people not just for their actual behavior but for the propensities the data suggest they have. Of course, angst over computing's potentially onerous outcomes is hardly new. As MIT mathematician Norbert Wiener wrote in 1949 regarding the advent of mainframes: "These new [electronic computing] machines have a great capacity for upsetting the present basis of industry." Indeed, we may not be sufficiently worried about the poorly understood and necessarily opaque big-data methods at the NSA. Mayer-Schonberger and Cukier suggest a way to regulate this new territory that is, or should be, controversial: "We envision external algorithmists acting as impartial auditors to review the accuracy or validity of big-data predictions whenever the government requires it, such as under court order or regulation" (emphasis added by me). Yikes.

Big Data will be well received in Silicon Valley. But I suspect some people there, and elsewhere, unhappy with Washington’s attempts to regulate tech in general, will reject the regulatory remedy the authors suggest. Already, the 2010 Affordable Care Act has foisted innovation-suppressing taxes and regulations on medical technologies, including software - and big data, again, is nothing if not software. Perhaps big data's next challenge will be its intersection with Big Government.

Books by Title

Books by Author

Books by Topic

Bits of Books - Books by Title

Big Data

Bits of Books To Impress