Conducting User Research for Security and Privacy

I’ve gathered some tips from my experiences in conducting user studies for security and privacy research, using Mechanical Turk as a tool for recruiting subjects. If you’re interested in trying something like this, here are some things that may be useful as you start out: http://cse.poly.edu/~tehila/mTurkTips.html

An evolutionary game-theoretic model of privacy versus utility

I first thought about this while savouring coffee-soaked madeleine with a few good friends at Tous Les Jours on a Sunday evening a few moons ago (September 29). I personally think it is a question worth investigating, and I did not want to write about it until I worked out more details, but I have not found a peaceful stretch of time to do so. In the meantime, I will scribble about it here. Perhaps you could tell me whether it would be useful?

The seed was planted two days earlier when Seda Gurses organized a delightful meeting on the technical implications of the NSA and GCHQ revelations. In the meeting, a very intelligent acquaintance asked, “Why don’t we build a browser add-on or extension to encrypt everything we do on Facebook?”, to which I said, “But won’t this mean the end of Facebook?”

Having recently read Part 1 of Schneier’s insightful book Liars & Outliers, it occurred to me (while enjoying coffee and madeleine) that we could probably use game theory to model this relationship between privacy and utility. If Facebook is the host, then we are the little organisms that both sustain and need Facebook, just as Facebook needs us to survive at all. Life has seen many such examples of mutualism. Furthermore, we could study, as Maynard-Smith did, evolutionary stable strategies that sustain both Facebook and ourselves despite conflicting interests.

Let me explain better what I mean by all of this. This is not a profound observation, but I believe there is a deep relationship between privacy, utility and machine learning. If we could somehow measure the utility that we provide Facebook (perhaps by Facebook being able to extract meaningful statistics via machine learning), then consider how different privacy-preserving strategies would affect the utility we provide Facebook (which would, in turn, ultimately affect us). If we encrypt absolutely everything such that Facebook is unable to discern at all what we say and do, then Facebook the host will probably not survive, which deprives us of its useful services. On the other hand, if Facebook is able to discern everything we say and do, then we have no privacy left, which detracts substantially the value of its services. Our model must be able to capture both of these extremes, and study how different privacy-preserving strategies will affect both Facebook and ourselves in the long term.

So I hope you can see why I think an evolutionary game-theoretic model would capture how different privacy-preserving strategies would sustain both Facebook and ourselves despite conflicting interests. What follows are some rough ideas to explore:

  • Measure utility to Facebook by studying how different privacy-preserving strategies would affect, say, the effectiveness of targeted advertising.
  • Would a budget for encryption or only partial encryption of data result in a sustainable relationship?
  • Study different probability distributions of the risks of being identified (privacy loss). So far I have considered the utility to Facebook, and I think this might be useful to studying the privacy loss to ourselves, although I don’t yet see exactly how.
  • Impact of memory (a.k.a. “right to be forgotten”): if Facebook was forced to remember only a limited window of data, would Facebook be able to sustain itself? Consider how memory affects the evolution of cooperation.

If something does not make sense here, it is because the whole thing has not yet been carefully thought out. I would be interested to hear corrections or other feedback from you.

Human Mobility Modeling from Cellular Network Data

Today, the CSE department hosted Ramón Cáceres of AT&T Labs Research who presented a talk on “Human mobility modeling from cellular network data”. Here is what I remember of the talk:

  • Cell phone data and privacy: his group has access to anonymized Call Detail Records (CDRs) which do not directly contain personally identifiable information (PII), but contains things like a random universal identifier (UID), call duration, time of day, closest set of antennae and so on. These CDR records are usually inaccurate spatially in the order of a few miles due to coarse-grained antennae distribution.
  • Characteristics of mobility records: what can you learn from the CDRs? Quite a few important aggregate statistics that are important for societal tasks such as city planning. His group has discovered fairly accurate algorithms for determining where people live and work and what their daily commute patterns look like (validated by comparison to census data). From these primary characteristics, you can derive secondary characteristics such as estimating carbon footprint based on your daily commute pattern, and which regions contribute to the workforce of a city (laborshed).
  • Synthetic data modelling: This is great for AT&T, but how would other people (such as city planners or scientific researchers) benefit from this aggregate data without sacrificing individual privacy? We run into the classic dilemma of societal utility vs individual privacy. One way is not to release the original CDR database itself, but rather release a synthetic data model derived from the database that presents synthetic records of synthetic “people” that nevertheless realistically (though with some inaccuracy) samples the original database. He presented a visual comparison of the original database versus the synthetic data model: the latter is, as expected, not as accurate as the former, but it is good enough.
  • Differential privacy data modelling: His group wanted stronger privacy guarantees despite the synthetic data modeling, so they used the best definition of privacy we currently have — differential privacy (DP) — to further protect the synthetic data model. He then showed how his group added DP noise to the synthetic data model, and showed that the results are between the best accuracy with the original CDR database and worst accuracy with available public data.

While this work addresses different questions than the Unique in the Crowd paper, I like Ramón’s work better because it more useful than a work that simply attacks privacy without defending it. (To paraphrase Stanislaw Ulam, “Everybody attacks but nobody defends”.)

People asked many questions, but I will only talk about mine.I asked him how he plans to handle an online database with differential privacy, which remains a bit of an open problem as far as I know. He answered that, yes, this would be a problem when trying to protect the original CDR database because there is a limited privacy budget, but the problem does not apply to his application of DP to the synthetic data model (as opposed to the original CDR database).

Relevant papers:

Why are we losing privacy on the Internet?

Perhaps computers have changed the landscape so radically that our minds are not adequately prepared for the task. Much of our evolution as a species probably adapted us for circumstances that do not quite match our very unnatural modern settings. Consider why so many of us are unnecessarily harmed over and over again every day because automobiles demand that we adapt to speeds that we are not naturally prepared to handle. Consider why we find the general theory of relativity and much of quantum physics to be against our intuitive understanding of the world around us. Consider how very unnatural is much of technology.

It is to be expected that computer security and privacy problems are similarly confounding problems. How many of us are aware that electronic book readers watch what you read? Do we know who is watching us on web sites? How many of us have thought about how people with smartphones might accidentally tell others about where we are? Did you know that your smartphone sensors could give you away? People are so surprised by these findings and more that there have been congressional hearings about what computing giants can see about us. I could inundate you with seemingly endless examples, but I think you understand my point: it is a terrible cognitive burden to place on others when we expect them to understand every single thing that could go wrong in computing.

As David Brin and others have observed, we have always been able to watch what other people around us are doing. If we had expended the effort to connect the dots, then many hitherto obscure signals would be revealed. So what is different today? Perhaps the biggest differences are due to two new pieces of technology: computers and networks.

Memory is crucial to any computing machine. In fact, you get computers with different powers depending on, for example, how much memory the computer has, or how it accesses that memory. The most powerful computer that is known to be physically realizable requires infinite memory. Now, no single man-made computer has infinite memory, but one way to approximate that is to share memory over a network.

There are then two properties that emerge from this interaction. One is that we have written software that are very good at persisting data in durable memory, so much so that the European Union is trying to counteract this with the Right to be Forgotten. (As intellectuals love to point out, Thamus has warned Theuth about the effects of writing on memory and wisdom.) The other is that networks help to link otherwise disparate pieces of data, the result of which, as we know from endless research, is all sorts of privacy conundrums.

The other thing computers are very good at is making some (but not all) problems tractable or scalable. As Mitch Ratcliffe is supposed to have said, “A computer lets you make more mistakes faster than any invention in human history…with the possible exception of handguns and tequila.” Not only do security or privacy attacks happen much faster with computers, the profits and losses induced by networked computers are also more extreme. Additionally, networked computers permit cowards to be attackers: this asymmetry means not only that you may not know what they know about you, but also who knows these things about you.

Furthermore, privacy is a problem of the abstract versus the visceral: you do not literally see that information about you is seeping away from the machines around you.

Finally, one must also consider the feudal model of security and privacy on the Internet when thinking about this problem.

Why do you think we are losing privacy on the Internet?

Tagged , ,

Sloan Cybersecurity Lecture: Recap

As part of the FTC’s “Reclaim Your Name” initiative, FTC Commissioner Julie Brill delivered the Sloan Cybersecurity Lecture at NYU-Poly. Her talk focused on the rise of big data as a social force, the historical role of the FTC in privacy protection, and the roles that different parties (i.e. engineers, lawyers, policymakers, and advertising industry members) can play in ensuring both privacy and utility in the era of big data.

The lecture was followed by a lively and enlightening panel discussion, chaired by Katherine Strandburg (NYU). The panel members were Julie Brill (FTC), Jennifer Barrett Glasgow (Acxiom), Julia Angwin (WSJ), and Daniel Weitzner (MIT). The discussion centered on issues attending big data, with panelists discussing transparency, accountability, anonymity, and potential harm or discrimination that large-scale machine learning can facilitate. Finally, the panelists presented their views on the potential for privacy protection via legal or industry directives.

To find out more, read the lecture notes or the panel notes.

Tagged , ,