FalconProtein: Finetuning Falcon Foundation Model for Protein Engineering

  • Wenjun Yang
  • , Yiqing Shen
  • , Zehong Wang
  • , Rui Zhao
  • , Qitong Lu
  • , Xinsheng Liu
  • , Yungeng Liu
  • , Merouane Debbah
  • , Yu Guang Wang
  • , Shir Li Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large Language Models (LLMs) have demonstrated zero-shot generalization capabilities in analyzing and predicting protein properties through natural language interactions. However, existing protein-focused datasets for LLM fine-tuning, such as ProteinLMDataset with its 17.46 billion tokens for pre-training and 893,000 instructions for fine-tuning, face limitations. Specifically, they include insufficient coverage of protein functional properties, inadequate protein-protein interaction data, and limited integration of contextual information from biomedical literature. To overcome these challenges, we present ProteinPFAIDataset, which integrates data from UniProt and PubMed, comprising 72.8 million tokens for Supervised Fine-Tuning (SFT). ProteinPFAIDataset encompasses important protein characteristics including enzyme activities, molecular functions, pH dependence, tissue specificity, temperature sensitivity, subunit structure, and disease associations. Additionally, we propose a novel knowledge graph-based approach that incorporates over 300,000 biomedical literature entries, providing rich contextual information about protein functions and interactions. To validate the effectiveness of our dataset, we fine-tuned Falcon2-11B LLM, resulting in a model we call Falcon2-11B-PFAI. The fine-tuned model achieved state-of-the-art performance on ProteinLMBench, improving accuracy from 47.10% to 58.37%. The dataset is available at https://huggingface.co/datasets/xiaorui1/PFAI. The fine-tuned model is available at https://huggingface.co/xiaorui1/PFAI/tree/main.

Original languageBritish English
Title of host publicationProceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
EditorsMario Cannataro, Huiru Zheng, Lin Gao, Jianlin Cheng, Joao Luis de Miranda, Ester Zumpano, Xiaohua Hu, Young-Rae Cho, Taesung Park
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5409-5417
Number of pages9
ISBN (Electronic)9798350386226
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024 - Lisbon, Portugal
Duration: 3 Dec 20246 Dec 2024

Publication series

NameProceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024

Conference

Conference2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
Country/TerritoryPortugal
CityLisbon
Period3/12/246/12/24

Fingerprint

Dive into the research topics of 'FalconProtein: Finetuning Falcon Foundation Model for Protein Engineering'. Together they form a unique fingerprint.

Cite this