Arabic reCAPTCHA Service for Web Security and Old Arabic Manuscripts Digitization

  • Hanin B. Abubaker

Student thesis: Master's Thesis

Abstract

reCAPTCHA is a security measure that guards web applications against automated abuse by presenting a random auto-generated challenge to users to solve. These challenges have to be devised to be hard on computers to solve, yet easily solved by humans. Although reCAPTCHA has been developed in many languages, there is no available reCAPTCHA in Arabic. There is an immense need for the development of an Arabic reCAPTCHA service to secure millions of Arabic websites and to aid in digitizing old printed Arabic manuscripts which in turn increases the digital Arabic content. This thesis surveys the different CAPTCHA systems and the development of reCAPTCHA. In this thesis, we highlight the need for developing an Arabic reCAPTCHA service, and then we present an original cloud-based architecture, design, and implementation of an Arabic reCAPTCHA service. We also address and propose solutions and algorithms to a number of design and implementation challenges. First, we devise a scheme to properly extract word images from scanned pages to form reCAPTCHA challenges to be solved by humans. Second, we propose a classification mechanism for the extracted word images into known and unknown word sets. Third, we explore and propose two algorithms for processing user input to a reCAPTCHA challenge to prepare the service response for user verification, and at the same time, store the user guess for the digitization process. Fourth, we propose a solution to maintain data integrity while handling multiple user requests for reCAPTCHA challenges. Fifth, we design a rational database schema to provide storage efficiency, data sharing, and data integrity for word images and user input. Moreover, we show how the different components and subservices of our proposed Arabic reCAPTCHA system can be deployed on a public cloud as that of Amazon Web Services (AWS). iii To validate the efficacy of the service, we conduct an experimental study to measure digitization accuracy and user experience satisfaction. The study shows that digitization accuracy of 96.3% was attained and that 72.2% of the audience preferred solving Arabic reCAPTCHA challenges over English reCAPTCHA for Arabic websites. Keywords: reCAPTCHA; digitization; cloud-based services; crowdsourcing
Date of AwardJun 2016
Original languageAmerican English
SupervisorKhaled Salah (Supervisor)

Keywords

  • reCAPTCHA; digitization; cloud-based services; crowdsourcing

Cite this

'