TY - GEN
T1 - FalconProtein
T2 - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
AU - Yang, Wenjun
AU - Shen, Yiqing
AU - Wang, Zehong
AU - Zhao, Rui
AU - Lu, Qitong
AU - Liu, Xinsheng
AU - Liu, Yungeng
AU - Debbah, Merouane
AU - Wang, Yu Guang
AU - Li Wang, Shir
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Large Language Models (LLMs) have demonstrated zero-shot generalization capabilities in analyzing and predicting protein properties through natural language interactions. However, existing protein-focused datasets for LLM fine-tuning, such as ProteinLMDataset with its 17.46 billion tokens for pre-training and 893,000 instructions for fine-tuning, face limitations. Specifically, they include insufficient coverage of protein functional properties, inadequate protein-protein interaction data, and limited integration of contextual information from biomedical literature. To overcome these challenges, we present ProteinPFAIDataset, which integrates data from UniProt and PubMed, comprising 72.8 million tokens for Supervised Fine-Tuning (SFT). ProteinPFAIDataset encompasses important protein characteristics including enzyme activities, molecular functions, pH dependence, tissue specificity, temperature sensitivity, subunit structure, and disease associations. Additionally, we propose a novel knowledge graph-based approach that incorporates over 300,000 biomedical literature entries, providing rich contextual information about protein functions and interactions. To validate the effectiveness of our dataset, we fine-tuned Falcon2-11B LLM, resulting in a model we call Falcon2-11B-PFAI. The fine-tuned model achieved state-of-the-art performance on ProteinLMBench, improving accuracy from 47.10% to 58.37%. The dataset is available at https://huggingface.co/datasets/xiaorui1/PFAI. The fine-tuned model is available at https://huggingface.co/xiaorui1/PFAI/tree/main.
AB - Large Language Models (LLMs) have demonstrated zero-shot generalization capabilities in analyzing and predicting protein properties through natural language interactions. However, existing protein-focused datasets for LLM fine-tuning, such as ProteinLMDataset with its 17.46 billion tokens for pre-training and 893,000 instructions for fine-tuning, face limitations. Specifically, they include insufficient coverage of protein functional properties, inadequate protein-protein interaction data, and limited integration of contextual information from biomedical literature. To overcome these challenges, we present ProteinPFAIDataset, which integrates data from UniProt and PubMed, comprising 72.8 million tokens for Supervised Fine-Tuning (SFT). ProteinPFAIDataset encompasses important protein characteristics including enzyme activities, molecular functions, pH dependence, tissue specificity, temperature sensitivity, subunit structure, and disease associations. Additionally, we propose a novel knowledge graph-based approach that incorporates over 300,000 biomedical literature entries, providing rich contextual information about protein functions and interactions. To validate the effectiveness of our dataset, we fine-tuned Falcon2-11B LLM, resulting in a model we call Falcon2-11B-PFAI. The fine-tuned model achieved state-of-the-art performance on ProteinLMBench, improving accuracy from 47.10% to 58.37%. The dataset is available at https://huggingface.co/datasets/xiaorui1/PFAI. The fine-tuned model is available at https://huggingface.co/xiaorui1/PFAI/tree/main.
UR - https://www.scopus.com/pages/publications/85217280232
U2 - 10.1109/BIBM62325.2024.10822514
DO - 10.1109/BIBM62325.2024.10822514
M3 - Conference contribution
AN - SCOPUS:85217280232
T3 - Proceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
SP - 5409
EP - 5417
BT - Proceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
A2 - Cannataro, Mario
A2 - Zheng, Huiru
A2 - Gao, Lin
A2 - Cheng, Jianlin
A2 - de Miranda, Joao Luis
A2 - Zumpano, Ester
A2 - Hu, Xiaohua
A2 - Cho, Young-Rae
A2 - Park, Taesung
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 December 2024 through 6 December 2024
ER -