Hash-Combs: A Data Representation Scheme for Privacy-Aware Analytics

  • Abdelrahman AlMahmoud

Student thesis: Doctoral Thesis


In recent years, the exponential growth in the rate of data collection has raised some serious concerns about data privacy. The general public is becoming increasingly aware and wary of how their data is being shared and used encouraging laws and regulations to be introduced to govern their personal information's privacy. Such regulations can limit the value of shared data for analysis which can lead to the degradation of the end results. In this thesis, we address the challenge of performing privacy-preserving analysis over shared data by developing multi-hash structures (Hash-Combs) which apply standardized cryptographic techniques. Hash-Combs represent data as a multi-level hash that is generated by quantizing the dataset using multiple granularities. Compared to the literature, Hash-Combs scheme allows: 1) The use of standard cryptographic techniques to perform various types of analytics including distance-based ones, 2) The use of multiple levels of quantization to control accuracy, and 3) The use of privacy preservation techniques like differentially private noise on quantized data. To validate the performance of Hash-Combs, we use two real datasets that cover different data types (Numerical sensor data and IP network data). We observe through experiments the effects of using different numbers of dimensions, noise injection locations and other parameters on the performance of our scheme. We show that it is possible to obtain over 98% accuracy. Finally, we study the use of Hash-Combs on specific applications like analysing e-mail data for the purpose of spam detection. The study brought us to developing SPAMdoop, a tool designed to run efficiently on MapReduce platforms by routing data encoded by the same Hash-Combs to the same physical machines. We validate SPAMdoop by simulating spam campaigns and showing that it can detect parent spam templates with high efficiency.
Date of AwardJul 2019
Original languageAmerican English
SupervisorErnesto Damiani (Supervisor)


  • Privacy
  • Distance-Preserving Hashing
  • Spam
  • Privacy-Preserving Collaborative Analysis.

Cite this