Monthly Archives: November 2013

An evolutionary game-theoretic model of privacy versus utility

I first thought about this while savouring coffee-soaked madeleine with a few good friends at Tous Les Jours on a Sunday evening a few moons ago (September 29). I personally think it is a question worth investigating, and I did not want to write about it until I worked out more details, but I have not found a peaceful stretch of time to do so. In the meantime, I will scribble about it here. Perhaps you could tell me whether it would be useful?

The seed was planted two days earlier when Seda Gurses organized a delightful meeting on the technical implications of the NSA and GCHQ revelations. In the meeting, a very intelligent acquaintance asked, “Why don’t we build a browser add-on or extension to encrypt everything we do on Facebook?”, to which I said, “But won’t this mean the end of Facebook?”

Having recently read Part 1 of Schneier’s insightful book Liars & Outliers, it occurred to me (while enjoying coffee and madeleine) that we could probably use game theory to model this relationship between privacy and utility. If Facebook is the host, then we are the little organisms that both sustain and need Facebook, just as Facebook needs us to survive at all. Life has seen many such examples of mutualism. Furthermore, we could study, as Maynard-Smith did, evolutionary stable strategies that sustain both Facebook and ourselves despite conflicting interests.

Let me explain better what I mean by all of this. This is not a profound observation, but I believe there is a deep relationship between privacy, utility and machine learning. If we could somehow measure the utility that we provide Facebook (perhaps by Facebook being able to extract meaningful statistics via machine learning), then consider how different privacy-preserving strategies would affect the utility we provide Facebook (which would, in turn, ultimately affect us). If we encrypt absolutely everything such that Facebook is unable to discern at all what we say and do, then Facebook the host will probably not survive, which deprives us of its useful services. On the other hand, if Facebook is able to discern everything we say and do, then we have no privacy left, which detracts substantially the value of its services. Our model must be able to capture both of these extremes, and study how different privacy-preserving strategies will affect both Facebook and ourselves in the long term.

So I hope you can see why I think an evolutionary game-theoretic model would capture how different privacy-preserving strategies would sustain both Facebook and ourselves despite conflicting interests. What follows are some rough ideas to explore:

  • Measure utility to Facebook by studying how different privacy-preserving strategies would affect, say, the effectiveness of targeted advertising.
  • Would a budget for encryption or only partial encryption of data result in a sustainable relationship?
  • Study different probability distributions of the risks of being identified (privacy loss). So far I have considered the utility to Facebook, and I think this might be useful to studying the privacy loss to ourselves, although I don’t yet see exactly how.
  • Impact of memory (a.k.a. “right to be forgotten”): if Facebook was forced to remember only a limited window of data, would Facebook be able to sustain itself? Consider how memory affects the evolution of cooperation.

If something does not make sense here, it is because the whole thing has not yet been carefully thought out. I would be interested to hear corrections or other feedback from you.

Human Mobility Modeling from Cellular Network Data

Today, the CSE department hosted Ramón Cáceres of AT&T Labs Research who presented a talk on “Human mobility modeling from cellular network data”. Here is what I remember of the talk:

  • Cell phone data and privacy: his group has access to anonymized Call Detail Records (CDRs) which do not directly contain personally identifiable information (PII), but contains things like a random universal identifier (UID), call duration, time of day, closest set of antennae and so on. These CDR records are usually inaccurate spatially in the order of a few miles due to coarse-grained antennae distribution.
  • Characteristics of mobility records: what can you learn from the CDRs? Quite a few important aggregate statistics that are important for societal tasks such as city planning. His group has discovered fairly accurate algorithms for determining where people live and work and what their daily commute patterns look like (validated by comparison to census data). From these primary characteristics, you can derive secondary characteristics such as estimating carbon footprint based on your daily commute pattern, and which regions contribute to the workforce of a city (laborshed).
  • Synthetic data modelling: This is great for AT&T, but how would other people (such as city planners or scientific researchers) benefit from this aggregate data without sacrificing individual privacy? We run into the classic dilemma of societal utility vs individual privacy. One way is not to release the original CDR database itself, but rather release a synthetic data model derived from the database that presents synthetic records of synthetic “people” that nevertheless realistically (though with some inaccuracy) samples the original database. He presented a visual comparison of the original database versus the synthetic data model: the latter is, as expected, not as accurate as the former, but it is good enough.
  • Differential privacy data modelling: His group wanted stronger privacy guarantees despite the synthetic data modeling, so they used the best definition of privacy we currently have — differential privacy (DP) — to further protect the synthetic data model. He then showed how his group added DP noise to the synthetic data model, and showed that the results are between the best accuracy with the original CDR database and worst accuracy with available public data.

While this work addresses different questions than the Unique in the Crowd paper, I like Ramón’s work better because it more useful than a work that simply attacks privacy without defending it. (To paraphrase Stanislaw Ulam, “Everybody attacks but nobody defends”.)

People asked many questions, but I will only talk about mine.I asked him how he plans to handle an online database with differential privacy, which remains a bit of an open problem as far as I know. He answered that, yes, this would be a problem when trying to protect the original CDR database because there is a limited privacy budget, but the problem does not apply to his application of DP to the synthetic data model (as opposed to the original CDR database).

Relevant papers: