In recent years, the exponential growth in the rate of data collection has raised some serious concerns about data privacy. The general public is becoming increasingly aware and wary of how their data is being shared and used encouraging laws and regulations to be introduced to govern their personal information's privacy. Such regulations can limit the value of shared data for analysis which can lead to the degradation of the end results. In this thesis, we address the challenge of performing privacy-preserving analysis over shared data by developing multi-hash structures (Hash-Combs) which apply standardized cryptographic techniques. Hash-Combs represent data as a multi-level hash that is generated by quantizing the dataset using multiple granularities. Compared to the literature, Hash-Combs scheme allows: 1) The use of standard cryptographic techniques to perform various types of analytics including distance-based ones, 2) The use of multiple levels of quantization to control accuracy, and 3) The use of privacy preservation techniques like differentially private noise on quantized data. To validate the performance of Hash-Combs, we use two real datasets that cover different data types (Numerical sensor data and IP network data). We observe through experiments the effects of using different numbers of dimensions, noise injection locations and other parameters on the performance of our scheme. We show that it is possible to obtain over 98% accuracy. Finally, we study the use of Hash-Combs on specific applications like analysing e-mail data for the purpose of spam detection. The study brought us to developing SPAMdoop, a tool designed to run efficiently on MapReduce platforms by routing data encoded by the same Hash-Combs to the same physical machines. We validate SPAMdoop by simulating spam campaigns and showing that it can detect parent spam templates with high efficiency.
| Date of Award | Jul 2019 |
|---|
| Original language | American English |
|---|
| Supervisor | ERNESTO Damiani (Supervisor) |
|---|
- Privacy
- Distance-Preserving Hashing
- Spam
- Privacy-Preserving Collaborative Analysis.
Hash-Combs: A Data Representation Scheme for Privacy-Aware Analytics
AlMahmoud, A. (Author). Jul 2019
Student thesis: Doctoral Thesis