Like I’m in the Twilight Zone
Both the BBC and Guardian today reported a paper catchily titled “Private traits and attributes are predicatible from digital records of human behavior” by Kosinski, Stillwell and Graepel published in PNAS. The BBC article summarises the paper as “Sexuality, political leanings and even intelligence can be gleaned from the things you choose to ‘like’ on Facebook…”. As is often the case with quantitative research, the simplification of writing newspaper articles obscures the meaning of the numbers.
The BBC article continues: “The study used 58,000 volunteers who alongside their Facebook ‘likes’ and demographic information also provided psychometric testing results – designed to highlight personality traits. The Facebook likes were fed into algorithms and matched with the information from the personality tests. The algorithms proved 88% accurate for determining male sexuality, 95% accurate in distinguishing African-American from Caucasian-American and 85% for differentiating Republican from Democrat.”
A reader might infer from this that the algorithm can take a set of a male Facebook user’s ‘likes’ and, after several turns of the handle, spit out correctly 88 times out of a 100 whether that person is gay or straight. The methodology section of the paper describes the figures for discriminating between gay/straight, Christian/Muslim etc as “the prediction accuracy of dichotomous variables expressed in terms of the area under the receiver-operating characteristic curve (AUC), which is equivalent to the probability of correctly classifying two randomly selected users one from each class (e.g., male and female).” An Area under Curve of 0.88 for predicting male sexuality can be visualised as having a meter driven by a subject’s ‘likes’ which when you are presented with two randomly chosen subjects, one gay and one straight, and you scan them, the one giving the higher reading will correctly be the gay subject 88% of the times. If the meter was to be used to classify a single subject you would have to set the threshold level above which you would label a subject ‘gay’, and where this line is drawn becomes a tradeoff between the costs of false positives and false negatives. As only 4.3% of males research subjects were labelled as gay, based on their Facebook “interested in” statement, simply categorising everyone as straight would be correct 95.7% of the time, but the attraction of these methodologies is to winnow out minority groups.
The PNAS paper’s penultimate paragraph ends with a dystopian prediction of rampant social-sorting: “One can imagine situations in which such predictions, even if incorrect, could pose a threat to an individual’s well-being, freedom, or even life. Importantly, given the ever-increasing amount of digital traces people leave behind, it becomes difficult for individuals to control which of their attributes are being revealed. For example, merely avoiding explicitly homosexual content may be insufficient to prevent others from discovering one’s sexual orientation.” The final paragraph ends on a more redemptive note, suggesting users should be cautious in sharing their data to avoid being socially sorted: “It is our hope, however, that the trust and goodwill among parties interacting in the digital environment can be maintained by providing users with transparency and control over their information, leading to an individually controlled balance between the promises and perils of the Digital Age.”
The raw data for the analysis was harvested from a Facebook app, myPersonality, allowing psychometric data to be matched with the app user’s Facebook data. I was then wondering how the developers of an app for Facebook users to self-test agreed to share their data with researchers so concerned about the risks of data analysis. Although it is not stated in the PNAS paper, the app was developed by David Stillwell, one of the paper’s authors. This is described in the ‘history” section of their website as: “myPersonality started out in June 2007 as David’s personal side project, between finishing undergraduate and starting postgraduate studies. After becoming popular it was only then that we considered its research possibilities. As such, David’s PhD research (which was already proposed in February 2007) is in the area of decision-making. This is why myPersonality has close academic links but is a standalone business.” Stillwell is willing to share the data with other researchers wanting to cross-analyse Facebook and psychometric data, which is in principle a good thing, but, to pick up on the concerns expressed in the paper to avoid “situations in which such predictions, even if incorrect, could pose a threat to an individual’s well-being, freedom, or even life” the data includes gender, birthday, age and zip code, so would be easy to rehydrate back to identifiable individuals. The data is restricted to individuals resident in the USA and is held on a server in the USA, keeping it free of UK data protection legislation. The myPersonality app is no longer available on Facebook and the online consent form returns a blank page, so it is not clear whether the users of the app who consented to their data being analysed gave informed consent to how the data would be stored and shared.
The main point of the paper, that ‘likes’ on Facebook, are indicators, albeit not perfect indicators of a range of personal traits, especially ones related to enacted identity, such as sexual orientation, gender and religion, is not surprising. The most interesting aspect of this paper is the mismatch between the ostensible warnings about the dangers of this form of social sorting while the researchers are actively engaged in the business of social sorting and sharing of personal information.