Predicting the Functions of Un-Annotated Proteins Associated With Cancer by Extracting Information from Biomedical Literature

Student thesis: Master's Thesis


In the post-genomic era, researchers have realized the importance of studying protein/gene inter- actions to understand biochemical pathways. The most reliable practice to determine these interactions is through experimental methods. However, most of these methods can't keep up with the rapid growth in the size of biological knowledge to be studied. One of the most important alternative approaches to tackle the massive growth in the biomedical literature is through text mining. Mining the biomedical literature has resulted in an incredible number of valuable computational algorithms in bioinformatics. We present a text mining system called Gene Interaction Rare Event Miner (GIREM) that constructs a gene-gene-interaction network for the human genome using information extracted from the biomedical literature. It identifies functionally related genes based on their co-occurrences in the abstracts of biomedical literature. GIREM extracts the pair of genes found to be associated with each other at three different levels of the text. It counts the number of times a pair of genes co-occur in the abstracts (i.e., abstract level) and the in the sentences separately (i.e., sentence level). GIREM aims at enhancing biomedical literature text mining approaches by recognizing the semantic relationship between each co-occurrence of a pair of genes in the text using the syntactic structures of sentences and linguistic theories (i.e., semantic level). It uses a novel rare-event classifier to classify and construct the gene-gene-interaction network. In this study, we utilize a linear rare-event classifier (Weighted Logistic Regression) and a non-linear alternative (Weighted Kernel Logistic Regression). Understanding the genetic networks and their role in chronic diseases (e.g., cancer) is one of the essential objectives of biological researchers. We analyze the constructed network of genes by using different network centrality measures to decide on the importance of each gene. Specifically, we apply betweenness, closeness, eigenvector and degree centrality metrics to rank the central genes of the network and to identify possible cancer-related genes. We evaluated the top 15 ranked genes for different cancer types. The results show that GIREM has the potential for improving the prediction accuracy of identifying gene-gene interaction and disease-gene associations. Indexing Terms: Text Mining, Biological NLP, Biomedical Literature, Gene-Gene Interaction, Network Analysis, Disease-Gene Association.
Date of AwardDec 2017
Original languageAmerican English
SupervisorKamal Taha (Supervisor)


  • Text Mining
  • Biological NLP
  • Biomedical Literature
  • Gene-Gene Interaction
  • Network Analysis
  • Disease-Gene Association.

Cite this