Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Where You Are is Who You Are: User Identification by Matching Statistics
 
research article

Where You Are is Who You Are: User Identification by Matching Statistics

Movahedi Naini, Farid  
•
Unnikrishnan, Jayakrishnan  
•
Thiran, Patrick  
Show more
2016
IEEE Transactions on Information Forensics and Security

Most users of online services have unique behavioral or usage patterns. These behavioral patterns can be used to identify and track users by using only the observed patterns in the behavior. We study the task of identifying users from statistics of their behavioral patterns. Specifically, we focus on the setting in which we are given histograms of users’ data collected in two different experiments. In the first dataset, we assume that the users’ identities are anonymized or hidden and in the second dataset we assume that their identities are known. We study the task of identifying the users in the first dataset by matching the histograms of their data with the histograms from the second dataset. In a recent work [1], [2] the optimal algorithm for this user identification task was introduced. In this paper, we evaluate the effectiveness of this method on a wide range of datasets with up to 50, 000 users, and in a wide range of scenarios. Using datasets such as call data records, web browsing histories, and GPS trajectories, we demonstrate that a large fraction of users can be easily identified given only histograms of their data, and hence these histograms can act as users’ fingerprints. We also show that simultaneous identification of users achieves better performance compared to one-by-one user identification. Furthermore, we show that using the optimal method for identification does indeed give higher identification accuracies than heuristics-based approaches in such practical scenarios. The accuracies obtained under this optimal method can thus be used to quantify the maximum level of user identification that is possible in such settings. We show that the key factors affecting the accuracy of the optimal identification algorithm are the duration of the data collection, the number of users in the anonymized dataset, and the resolution of the dataset. We also analyze the effectiveness of k-anonymization in resisting user identification attacks on these datasets.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

07321027.pdf

Type

Publisher's Version

Version

http://purl.org/coar/version/c_970fb48d4fbd8a85

Access type

openaccess

Size

2.01 MB

Format

Adobe PDF

Checksum (MD5)

46aabdea4672d677750f9b56def49e81

Loading...
Thumbnail Image
Name

histMatching_codes_RR.zip

Access type

openaccess

Size

26.16 KB

Format

ZIP

Checksum (MD5)

cc0c53dbd209de94b21b663fe9fac2a4

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés