Human Mobility Modeling from Cellular Network Data

Today, the CSE department hosted Ramón Cáceres of AT&T Labs Research who presented a talk on “Human mobility modeling from cellular network data”. Here is what I remember of the talk:

  • Cell phone data and privacy: his group has access to anonymized Call Detail Records (CDRs) which do not directly contain personally identifiable information (PII), but contains things like a random universal identifier (UID), call duration, time of day, closest set of antennae and so on. These CDR records are usually inaccurate spatially in the order of a few miles due to coarse-grained antennae distribution.
  • Characteristics of mobility records: what can you learn from the CDRs? Quite a few important aggregate statistics that are important for societal tasks such as city planning. His group has discovered fairly accurate algorithms for determining where people live and work and what their daily commute patterns look like (validated by comparison to census data). From these primary characteristics, you can derive secondary characteristics such as estimating carbon footprint based on your daily commute pattern, and which regions contribute to the workforce of a city (laborshed).
  • Synthetic data modelling: This is great for AT&T, but how would other people (such as city planners or scientific researchers) benefit from this aggregate data without sacrificing individual privacy? We run into the classic dilemma of societal utility vs individual privacy. One way is not to release the original CDR database itself, but rather release a synthetic data model derived from the database that presents synthetic records of synthetic “people” that nevertheless realistically (though with some inaccuracy) samples the original database. He presented a visual comparison of the original database versus the synthetic data model: the latter is, as expected, not as accurate as the former, but it is good enough.
  • Differential privacy data modelling: His group wanted stronger privacy guarantees despite the synthetic data modeling, so they used the best definition of privacy we currently have — differential privacy (DP) — to further protect the synthetic data model. He then showed how his group added DP noise to the synthetic data model, and showed that the results are between the best accuracy with the original CDR database and worst accuracy with available public data.

While this work addresses different questions than the Unique in the Crowd paper, I like Ramón’s work better because it more useful than a work that simply attacks privacy without defending it. (To paraphrase Stanislaw Ulam, “Everybody attacks but nobody defends”.)

People asked many questions, but I will only talk about mine.I asked him how he plans to handle an online database with differential privacy, which remains a bit of an open problem as far as I know. He answered that, yes, this would be a problem when trying to protect the original CDR database because there is a limited privacy budget, but the problem does not apply to his application of DP to the synthetic data model (as opposed to the original CDR database).

Relevant papers:

Leave a comment