Automatic Arabic Summarization

  • Lamees Al Qassem

Student thesis: Master's Thesis

Abstract

Text summarization has been a field of intensive research over the last 50 years, especially for commonly used and relatively simple-grammar languages such as English. Moreover, the unprecedented growth in the amount of online information available in many languages to users and businesses, including news articles and social media, has made it difficult and time consuming for users to identify and consume sought after content. Hence, an automatic text summarization for various languages to generate accurate and relevant summaries from the huge amount of information available is essential nowadays. Techniques and methodologies for Arabic text summarization are still immature due to the inherent complexity of the Arabic language in terms of both structure and morphology. This thesis attempts to improve the performance of Arabic text summarization. At first and after highlighting the need for automatic text summarization systems, we survey the existing Arabic text summarization approaches and discuss their limitations. We then propose a new Arabic text summarization approach that is based on a new noun extraction method and fuzzy logic. In fact, we believe that our system is the first Arabic text summarizer that uses fuzzy logic to improve the summarization accuracy. Given that it has already been shown by other researchers that fuzzy logic improves the performance of text summarization in English, we will show that these benefits can be replicated for Arabic text summarization. The summarizer consists of four main modules: pre-processing, noun extraction, feature extraction and fuzzy logic. Pre-processing prepares the text before sentences are further processed and summarized. Noun extraction extracts the nouns from the text output of the pre-processor. The extracted nouns are used to score the importance of sentences. Here, we developed a rule-based noun extraction system that extracts nouns according to Arabic grammar rules. The system is evaluated against the widely used Stanford Arabic Part of Speech (POS) tagger. The results show that our proposed method is more efficient and achieves comparable benchmark accuracies. After that the feature extraction module extracts key features representing the importance of the sentences. Finally, the extracted features/scores are input into the fuzzy logic module to generate the final scores for the sentences. The sentences scores indicate how important a sentence is within the whole article. The sentences with the highest scores are selected to generate the final summary. The summarization system is evaluated on EASC corpus using Recall-Oriented Understudy for Gisting Evaluation (ROUGE) evaluation metric and compared against popular state of the art Arabic text summarization systems. The results indicate that our Fuzzy logic approach with noun extraction outperforms existing systems.
Date of AwardDec 2017
Original languageAmerican English
SupervisorHassan Barada (Supervisor)

Keywords

  • Arabic Natural language processing
  • text summarization
  • Fuzzy Logic
  • Information Retrieval
  • corpus.

Cite this

'