Big Data: Mayer-Schoenberger & Cukier
Second only to rampant plagiarism, the use by students of Wikipedia as a source in student essays is viewed as “pleasant to the eyes, and a tree to be desired to make one wise”. The use of Wikipedia allows students to cover a topic without too much actual reading of tedious academic texts; the brighter students realise that the footnotes are not just typographical decoration, but can lead to deeper sources, while the bright but lazy students will quote the referenced articles without actually reading them. None of this fits with a romantic view of independent scholarship; did all these elephants die to build a tower in which academics and students uncritically recycle articles off the internet?
To fight against this downward dumbing, teachers encourage students to read academic papers or at least books with a veneer of academic credibility, with references to sources, some passing acknowledgement to the contributions of past writers and maybe even a smidgen of theory. For students in business schools writing essays on the use of large-scale databases it will not take them long to come across Big Data by Mayer-Schoenberger and Cukier.
Mayer-Schonberger and Cukier’s book has been very successful in the US: it is currently 883 on the Amazon sales list of books in the US, but only at 4001 in the UK. The book exhibits all the signs of being an academic text: co-authored by a professor at Oxford University, with seventeen pages of notes, a nine page bibliography and a ten page index. The book takes a broad view of what constitutes “big data”, and uses a wide range of clearly written examples to explain its growing significance. The arguments that “big data” represents a break with conventional theorising about causation and that we should see a movement from “privacy by consent” to “privacy by accountability”. Both of these arguments are contentious. Nicolo Tempini, a PhD student at the London School of Economics, published an excellent critical review on the LSE website, with a summary on Amazon. However, it is not directly the validity of the book’s arguments that bothered me while reading it. The main flaw of this book is that in the process of producing the elegant cases to support the argument, the authors have been very selective in their sources, and in polishing the cases have generally over-interpreted very sketchy source material and in some cases have invented some new “facts”. These facts, now published in a widely read and critically acclaimed book, can then break out to infect academic essays and public discussions. To put it bluntly, the book demonstrates the primacy of style over substance in many business books. As is often the case, the authors seem to have spent more time polishing their prose than they spent finding a wide range of sources or checking their facts.
At its heart, this text is based on a very broadly trawled catch of published examples of innovative use of data. The underlying process seems to have been that an organisation did something with data, someone then wrote this up as a newspaper or magazine article, which finally Cukier and Mayer-Schonberger Googled upon and rewrote more elegantly to fit their theses. Stylistically this leads to very flat descriptions of the cases, lacking much sense of local colour, but more importantly it becomes what in less politically correct times would have been described as “Chinese whispers”. At each stage cases are subtly reinterpreted, with little evidence that in the final text facts are checked or alternative sources sought out.
The book includes an uncritical re-statement of Charles Duhigg’s claim that Target used customer data to target customers who were pregnant with relevant coupons. In his book “The Power of Habit” (p 195), based on interviews with Target employees, Duhigg says that Target’s analysis of customer purchases allowed them to “assign almost any regular shopper a pregnancy prediction score”. For Cukier and Mayer-Schonberger this claim morphs into “Target knows when a woman is pregnant without the mother-to-be explicitly telling it so”. What was described by Duhigg as a probabilistic prediction has become a certainty.
On page 155 Cukier and Mayer-Schonberger address how the release of anonymised data may enable individuals to be identified. The section says: “The company [Netflix] released 100 million rental records from nearly half a million users – and offered a bounty of a million dollars to any team that could improve its film recommendation system by at least 10 per-cent. Again personal identifiers had been carefully removed from the data. And yet again, a user was re-identified: a mother and closeted lesbian in America’s Midwest, who because of this later sued Netflix under the pseudonym ‘Jane Doe'”. As an authority they cite a 2009 article published in Wired by Ryan Singel. Crucially Singel’s article does not say that a user was identified – instead it says that “an in-the-closet lesbian mother is suing Netflix for privacy invasion, alleging the movie rental company made it possible for her to be outed” (my emphasis). Helpfully the web version of Singel’s article includes a reference to the class action claim. That the class action was quickly dropped is also not included by Cukier and Mayer-Schonberger. So an original untested claim that the release of the data might make it possible to identify customers elides into a claim, unsupported by references, that the release of data led to a specific customer being identified.
In both cases a single source has been massaged, and in the process new details have been introduced out of fresh air to strengthen the argument. In many of the cases cited in Big Data, trivial details have been changed to make the text read well; so when Walmart were stocking extra Pop-Tarts in preparation for hurricane Frances, they are now stacked at the entrance, which is a detail not in the original source, and when the father of the pregnant Minnesotan complains to the Target manager he now “shouts” when in Duhigg’s account he “said”. Academics often take a dim view of students using Wikipedia, but they also take a dim view of them embellishing secondary sources. The one strength of Wikipedia is that inaccuracies, whether significant or trivial, are challenged and corrected; one weakness of printed books is that once inaccuracies creep in they effectively become facts.