Electronic Text Stylometry

  • Mahmoud Khonji

Student thesis: Doctoral Thesis

Abstract

Mahmoud Khonji, 'Electronic Text Stylometry', Ph.D. Thesis, Ph.D. in Engineering, Department of Electrical and Computer Engineering, Khalifa University of Science and Technology, United Arab Emirates, April, 2017. Electronic text stylometry problems are a subset of classification tasks where the objective is to statistically infer information about authors of input electronic texts by solely analysing the texts. Such statistically inferred information could be the identity of the authors of such texts, or their profile attributes. Examples of profile attributes can age group, ethnicity group, etc. Enhancing the accuracy of electronic text stylometry problem solvers can benefit various application domains such as forensics, market analysis, biometric authentication, etc. This can be attractive, specially in the context of increased textual publications via social networking websites [1, 2]. The key contributions of this thesis essentially fall under three categories: analysis that quantify various aspects of the current state of the literature on stylometry, improvements for the state-of-the-art in author identification to enhance their classification accuracy, and the release of our key software libraries under permissible open-source licenses. The importance of the first category of our contributions is because of the fact that numerous stylometry methods are proposed and only evaluated in isolation, without adequate comparisons against competitive methods. Therefore, in order to correctly draw any conclusions about the effiectiveness of the proposed methods, we had to re-implement and re-evaluate key methods, and in some cases, construct novel classification datasets (i.e. an author identification dataset on Emirati social media texts). Examples of our contributions that fall under this category are: * A generalization of all stylometry problems, including domain adaptation, in probability terms under a unified notation, which significantly enhanced our understanding of the problems as well as our proposed improvements. To the best of our knowledge, this is the most comprehensive formal description of stylometry problems to date. * The evaluation of over 23; 000 feature extraction functions. To the best of our knowledge, this is, by far, the most extensive feature evaluation of its kind in the literature of electronic text stylometry. This showed that classical n-grams allow for the identification of superior features than k-skip and syntactic n-grams. * The construction of the first author identification evaluation dataset for the domain of Emirati social media texts. This is quite important since the performance of existing stylometry problem solvers is unknown with respect to the domain of Emirati social media texts. This has also lead to the first evaluation of such solvers in this domain, which showed that achieving high classification accuracy (0:9833) is possible even with a space of 30 suspect authors. This classification accuracy rate is relatively larger than what is generally reported in the literature for similar suspects space i Abstract ii sizes, which suggests that the scalability issues of author attribution models are not as aggressively affected as implied in the literature. * An evaluation of the performance of author verification problem solvers when tested against texts that contain manual stylistic deception (e.g. texts that are written by adversaries that deliberately obfuscate their writing style, or imitate other authors). Which such evaluations are carried out against author attribution models, this is the first evaluation of its kind against author verification problem solvers. On the other hand, the importance for the second category of our contributions, namely enhancing the classification accuracy of the existing methods, stems from the fact that the classification accuracy of some of the key existing methods are sub-optimal, and, in some cases, even approach random chance guessing when facing certain challenging scenarios, such as the cross-domain author verification problem. Therefore, we believe that, before one enhances the run-time computational complexities of the algorithm, it is critical to, first, enhance the classification accuracies of the methods. Examples of such contributions are: * Improvements on the score aggregation methods that enhance the accuracy of the state-of-the-art Author Verification (AV) solver. When using optimum decision thresholds, all statistically significant gains in the classification accuracy indicate that our modifications result in a higher classification accuracy than the state-ofthe- art solver. This includes significant increases in the classification accuracy from 0:53 to 0:72. * A domain adaptation method that enhances the accuracy of the state-of-the-art in cross-domain problem scenarios. This includes significant increases in the classification accuracy from 0:54 to 0:62. To the best of our knowledge, this is the first domain adaptation method that demonstrates statistically significant gains in the classification accuracy of author verification problem solvers. Another key challenge that faces today's state of the literature on electronic text stylometry, is the fact that implementations of most of the proposed methods are not conveniently available. Since re-implementing them often comes at a significant cost (time and effort), we believe that this is a key reason for the lack of adequate evaluations of the proposed methods. For example, to the best of our knowledge, the literature lacks any evaluation that compare syntactic n-grams against k-skip n-grams while using a unified testing bed. Therefore, one of our contributions in this thesis is the release of our electronic text feature library, which, to the best of our knowledge, is the most extensive feature extraction library in the literature of electronic text stylometry. Additionally, since this library implements novel generalized forms of known feature extraction methods, it is capable of extracting features beyond those evaluated in the literature. Thanks to this, we have also identified successful novel variants of n-gram-based feature extraction methods. Indexing Terms: Stylometry, Author attribution, Author verification, Author profiling, Stylistic inconsistency, Forensics, Anti-forensics.
Date of AwardApr 2017
Original languageAmerican English
SupervisorYoussef Iraqi (Supervisor)

Keywords

  • Stylometry
  • Author attribution
  • Author verification
  • Author profiling
  • Stylistic inconsistency
  • Forensics
  • Anti-forensics.

Cite this

'