조회수 39
자료실
AI 활용신약DB 상세
조회수 39
본 DB에서는 해당 논문의 원본데이터 및 이를 활용한 다른 논문의 전처리 데이터의 정보를 제공하고 있습니다.
원저작물에 대한 권리는 해당 연구자 및 기관에 있습니다.
Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski’s rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.
Data Curation
In this article, we have chosen a set of 14,500 complexes from the Protein Data Bank (PDB)17, expanding upon our previous PLAS-5k27 dataset. The selection criteria for these complexes focused on proteins that are complex with small molecules (ligands) or peptides.
Dataset Preparation
We followed the preprocessing and calculation protocol similar to previous work27, in our current study. A brief account of the methods is given here. The initial structures of the complexes were taken from PDB17. Protein chains with missing residues were modelled as loop regions using UCSF Chimera28,29. Further, the protein chains were protonated at a physiological pH, 7.4 using H++ server30. The tleap program of ambertools31,32 was used to build the input files of each complex system (protein-ligand, cofactors and crystal water molecules) files required for MD simulations. The crystal waters were modelled using a TIP3P force field33 The proteins were modelled using Amber ff14SB force field34 in the all-atom model, and parameters of the ligand and cofactors were taken from General AMBER force field (GAFF2)35 using antechamber program36. Each complex was solvated in an orthorhombic TIP3P water box with a 10 Å extension from the protein surface. More detailed information on the dataset preparation is discussed in our earlier work with 5k complexes27 and the flowchart for data preparation is shown in Fig. 1. The counter ions were added to maintain the charge neutrality of the system.
In addition to the dataset version, PLAS-20k is also available publicly at (https://healthcare.iiit.ac.in/d4/plas20k/plas20k.html). The list of PDB ids that are part of PLAS-20k is provided and can be downloaded from the website. The PDB id search icon in the database opens a specific 3D structure along with energy components (Van der Waals interaction energy, electrostatic energy, polar and non-polar solvation free energies in conjunction with binding affinity) from the MD trajectories using the MMPBSA method. An example of HIV-1 protease complex (PDB id: 1hxw) is shown in Supplementary Figure S1.
Molecular Heterogeneity of PLAS-20k
To characterize the extent of diversity of PLAS-20k over PLAS-5k (in terms of eminent molecular properties), we have undertaken a t-SNE (t-distributed stochastic neighbor embedding) distribution analyses over the PLAS-5k, and PLAS-20k datasets (Figure S2). The non-linear molecular properties were fetched from corresponding SMILES strings of the ligands, evidently including the Lipinski’s rule of 5. Interestingly, we find that the t-SNE distribution cover more sample space for PLAS-20k over PLAS-5k. This underscores the fact that the current results are based on a dataset with additional diversity of PLAS-20k over its predecessor (PLAS-5k).
Overall Structures of the Protein-Ligand Complexes
Though there are a lot of advances in predicting PL binding affinity through machine learning methods, the incorporation of receptor flexibility remains a major bottleneck. In the present work, we propose a novel dataset based on binding affinities of PL complexes retrieved from MD simulations. The binding affinities were calculated by considering the flexibility of both protein and ligand. The simulated complexes were validated by calculating the RMSD with respect to the experimental structure. The protein structures were superimposed to calculate RMSD of protein and ligand. These calculations have been performed over 200 frames (40 from each simulation trajectory) and the corresponding distributions are shown in Supplementary Figure S3. The long tails of RMSD distributions of protein and ligand are evident due to the flexibility of the complex during the simulations.