Machine Learning Approaches for Oil- Contamination Feature Detection in Metagenomic Datasets

  • Mary Krystelle Taniegra Catacutan

Student thesis: Master's Thesis

Abstract

Oil exploitation often leads to environmental perturbations, which negatively affect the ecosystem and human health. Timely diagnosing such contamination is crucial, especially in smallscale spills that could eventually lead to an enormous scale disaster if left undetected. Microbial communities are known to rapidly shift following changes in their environment, making them ideal biomarkers. Machine learning (ML) is well known for drawing meaningful biological patterns in complex datasets such as metagenomics. However, little research has been conducted to demonstrate the performance of ML for the diagnosis of environmental contamination. As an initial study, the processing of microbial input data must be explored to enhance predictive power as a key step towards diagnostic and predictive ML tools for environmental monitoring. Publicly available datasets were collected from various geographical locations and environments to attain a robust ML model. Here, new methods for data normalization and differential abundance (DA) analysis were explored, and different classification ML models were compared. The findings show that ANCOM-BC was a powerful tool for DA analysis with a low error standard compared to DESeq2. However, when used for ML input normalization, DESeq2 (variance stabilizing transform) outperformed ANCOM-BC (log-transformed bias-adjusted). In contrast, the type of feature selection (domain-knowledge based or embedded methods) had little effect on the ML performance outcome. Synthetic Minority Oversampling Technique (SMOTE) was also used to address data imbalance. The best ML models were Random Forest and Quadratic Support Vector Machine, which both achieved an AUC of 0.96, followed by Neural Network (AUC 0.95), KNN (AUC 0.94), and Decision Tree (AUC 0.84). The current proposed pipeline was compared with a state-of-the-art pipeline, 'SIAMCAT', which achieved an AUC of 0.95. This study demonstrates the successful prediction of oil contamination using global microbial community information. And possible oil contamination microbial fingerprints was identified as a successful proof of principle.
Date of AwardDec 2021
Original languageAmerican English

Keywords

  • Metagenomics
  • machine learning
  • oil contamination
  • environmental monitoring
  • feature detection.

Cite this

'