TY - JOUR
T1 - Domain adaptation of a SMILES chemical transformer to SELFIES with limited computational resources
AU - Alhmoudi, Obaid Khaleifah
AU - Aboushanab, Mahmoud
AU - Veetil, Muhammed Thameem Unnichiram
AU - Elkamel, Ali
AU - AlHammadi, Ali A.
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - Accurate molecular property prediction requires input representations that preserve substructural details and maintain syntactic consistency. SMILES (Simplified Molecular Input Line Entry System), while widely used, does not guarantee validity and allows multiple representations for the same compound. SELFIES (Self-Referencing Embedded Strings) addresses these limitations through a robust grammar that ensures structural validity. This study investigates whether a SMILES-pretrained transformer, ChemBERTa-zinc-base-v1, can be adapted to SELFIES using domain-adaptive pretraining without modifying the tokenizer or model architecture. Approximately 700,000 SELFIES-formatted molecules from PubChem were used for adaptation, completed within 12 h on a single NVIDIA A100 GPU. Embedding-level evaluation included t-distributed stochastic neighbor embedding (t-SNE), cosine similarity, and regression on twelve QM9 properties using frozen transformer weights. The domain-adapted model outperformed the original SMILES baseline and slightly outperformed the performance of ChemBERTa-77 M-MLM across most targets, despite a 100-fold difference in pretraining scale. For downstream evaluation, the model was fine-tuned end-to-end on ESOL, FreeSolv, and Lipophilicity, achieving root mean squared error (RMSE) values of 0.944, 2.511, and 0.746, respectively. These results demonstrate that SELFIES-based adaptation offers a cost-efficient alternative for molecular property prediction, without relying on molecular descriptors, 3D features, or large-scale infrastructure.
AB - Accurate molecular property prediction requires input representations that preserve substructural details and maintain syntactic consistency. SMILES (Simplified Molecular Input Line Entry System), while widely used, does not guarantee validity and allows multiple representations for the same compound. SELFIES (Self-Referencing Embedded Strings) addresses these limitations through a robust grammar that ensures structural validity. This study investigates whether a SMILES-pretrained transformer, ChemBERTa-zinc-base-v1, can be adapted to SELFIES using domain-adaptive pretraining without modifying the tokenizer or model architecture. Approximately 700,000 SELFIES-formatted molecules from PubChem were used for adaptation, completed within 12 h on a single NVIDIA A100 GPU. Embedding-level evaluation included t-distributed stochastic neighbor embedding (t-SNE), cosine similarity, and regression on twelve QM9 properties using frozen transformer weights. The domain-adapted model outperformed the original SMILES baseline and slightly outperformed the performance of ChemBERTa-77 M-MLM across most targets, despite a 100-fold difference in pretraining scale. For downstream evaluation, the model was fine-tuned end-to-end on ESOL, FreeSolv, and Lipophilicity, achieving root mean squared error (RMSE) values of 0.944, 2.511, and 0.746, respectively. These results demonstrate that SELFIES-based adaptation offers a cost-efficient alternative for molecular property prediction, without relying on molecular descriptors, 3D features, or large-scale infrastructure.
UR - https://www.scopus.com/pages/publications/105010025491
U2 - 10.1038/s41598-025-05017-w
DO - 10.1038/s41598-025-05017-w
M3 - Article
C2 - 40603899
AN - SCOPUS:105010025491
SN - 2045-2322
VL - 15
JO - Scientific Reports
JF - Scientific Reports
IS - 1
M1 - 23627
ER -