조회수 61
자료실
AI 활용신약DB 상세
조회수 61
본 DB에서는 해당 논문의 원본데이터 및 이를 활용한 다른 논문의 전처리 데이터의 정보를 제공하고 있습니다.
원저작물에 대한 권리는 해당 연구자 및 기관에 있습니다.
Background
There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets.
Results
Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance.
Conclusions
Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.
Bioassay Datasets
A variety of datasets have been chosen for this study. Unfortunately due to computer memory limitations (Weka can only utilise 2 gigabytes of heap space for Windows systems), only small to medium datasets have been selected. However, the datasets are from the differing types of screening that can be performed using HTS technology (both primary and confirmatory screening) and they have varying sizes and minority classes. 21 datasets were created from the screening data. Table 2 shows a summary of the datasets used for this study. For four of the primary screening bioassays where there are corresponding confirmatory results, datasets have been created where the false positives from the primary screen are relabelled as Inactive. For the smaller confirmatory bioassay datasets, two types of data representation are used in order to see if adding more information improves the classification results.