Home About us Contact | |||
Cross-validation Test (cross-validation + test)
Selected AbstractsPrediction of protein structural class by amino acid and polypeptide compositionFEBS JOURNAL, Issue 17 2002Rui-yan Luo A new approach of predicting structural classes of protein domain sequences is presented in this paper. Besides the amino acid composition, the composition of several dipeptides, tripeptides, tetrapeptides, pentapeptides and hexapeptides are taken into account based on the stepwise discriminant analysis. The result of jackknife test shows that this new approach can lead to higher predictive sensitivity and specificity for reduced sequence similarity datasets. Considering the dataset PDB40-B constructed by Brenner and colleagues, 75.2% protein domain sequences are correctly assigned in the jackknife test for the four structural classes: all-,, all-,, ,/, and , + ,, which is improved by 19.4% in jackknife test and 25.5% in resubstitution test, in contrast with the component-coupled algorithm using amino acid composition alone (AAC approach) for the same dataset. In the cross-validation test with dataset PDB40-J constructed by Park and colleagues, more than 80% predictive accuracy is obtained. Furthermore, for the dataset constructed by Chou and Maggiona, the accuracy of 100% and 99.7% can be easily achieved, respectively, in the resubstitution test and in the jackknife test merely taking the composition of dipeptides into account. Therefore, this new method provides an effective tool to extract valuable information from protein sequences, which can be used for the systematic analysis of small or medium size protein sequences. The computer programs used in this paper are available on request. [source] Winter diatom blooms in a regulated river in South Korea: explanations based on evolutionary computationFRESHWATER BIOLOGY, Issue 10 2007DONG-KYUN KIM Summary 1. An ecological model was developed using genetic programming (GP) to predict the time-series dynamics of the diatom, Stephanodiscus hantzschii for the lower Nakdong River, South Korea. Eight years of weekly data showed the river to be hypertrophic (chl. a, 45.1 ± 4.19 ,g L,1, mean ± SE, n = 427), and S. hantzschii annually formed blooms during the winter to spring flow period (late November to March). 2. A simple non-linear equation was created to produce a 3-day sequential forecast of the species biovolume, by means of time series optimization genetic programming (TSOGP). Training data were used in conjunction with a GP algorithm utilizing 7 years of limnological variables (1995,2001). The model was validated by comparing its output with measurements for a specific year with severe blooms (1994). The model accurately predicted timing of the blooms although it slightly underestimated biovolume (training r2 = 0.70, test r2 = 0.78). The model consisted of the following variables: dam discharge and storage, water temperature, Secchi transparency, dissolved oxygen (DO), pH, evaporation and silica concentration. 3. The application of a five-way cross-validation test suggested that GP was capable of developing models whose input variables were similar, although the data are randomly used for training. The similarity of input variable selection was approximately 51% between the best model and the top 20 candidate models out of 150 in total (based on both Root Mean Squared Error and the determination coefficients for the test data). 4. Genetic programming was able to determine the ecological importance of different environmental variables affecting the diatoms. A series of sensitivity analyses showed that water temperature was the most sensitive parameter. In addition, the optimal equation was sensitive to DO, Secchi transparency, dam discharge and silica concentration. The analyses thus identified likely causes of the proliferation of diatoms in ,river-reservoir hybrids' (i.e. rivers which have the characteristics of a reservoir during the dry season). This result provides specific information about the bloom of S. hantzschii in river systems, as well as the applicability of inductive methods, such as evolutionary computation to river-reservoir hybrid systems. [source] Applying pattern recognition methods plus quantum and physico-chemical molecular descriptors to analyze the anabolic activity of structurally diverse steroidsJOURNAL OF COMPUTATIONAL CHEMISTRY, Issue 3 2008Yoanna Marķa Alvarez-Ginarte Abstract The great cost associated with the development of new anabolic,androgenic steroid (AASs) makes necessary the development of computational methods that shorten the drug discovery pipeline. Toward this end, quantum, and physicochemical molecular descriptors, plus linear discriminant analysis (LDA) were used to analyze the anabolic/androgenic activity of structurally diverse steroids and to discover novel AASs, as well as also to give a structural interpretation of their anabolic,androgenic ratio (AAR). The obtained models are able to correctly classify 91.67% (86.27%) of the AASs in the training (test) sets, respectively. The results of predictions on the 10% full-out cross-validation test also evidence the robustness of the obtained model. Moreover, these classification functions are applied to an "in house" library of chemicals, to find novel AASs. Two new AASs are synthesized and tested for in vivo activity. Although both AASs are less active than some commercially AASs, this result leaves a door open to a virtual variational study of the structure of the two compounds, to improve their biological activity. The LDA-assisted QSAR models presented here, could significantly reduce the number of synthesized and tested AASs, as well as could increase the chance of finding new chemical entities with higher AAR. © 2007 Wiley Periodicals, Inc. J Comput Chem, 2008 [source] Predicting P-glycoprotein substrates by a quantitative structure,activity relationship modelJOURNAL OF PHARMACEUTICAL SCIENCES, Issue 4 2004Vijay K. Gombar Abstract A quantitative structure,activity relationship (QSAR) model has been developed to predict whether a given compound is a P-glycoprotein (Pgp) substrate or not. The training set consisted of 95 compounds classified as substrates or non-substrates based on the results from in vitro monolayer efflux assays. The two-group linear discriminant model uses 27 statistically significant, information-rich structure quantifiers to compute the probability of a given structure to be a Pgp substrate. Analysis of the descriptors revealed that the ability to partition into membranes, molecular bulk, and the counts and electrotopological values of certain isolated and bonded hydrides are important structural attributes of substrates. The model fits the data with sensitivity of 100% and specificity of 90.6% in the jackknifed cross-validation test. A prediction accuracy of 86.2% was obtained on a test set of 58 compounds. Examination of the eight "mispredicted" compounds revealed two distinct categories. Five mispredictions were explained by experimental limitations of the efflux assay; these compounds had high permeability and/or were inhibitors of calcein-AM transport. Three mispredictions were due to limitations of the chemical space covered by the current model. The Pgp QSAR model provides an in silico screen to aid in compound selection and in vitro efflux assay prioritization. © 2004 Wiley-Liss, Inc. and the American Pharmacists Association J Pharm Sci 93: 957,968, 2004 [source] Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotidesFEBS JOURNAL, Issue 15 2001Ju Wang The published sequence of the Vibrio cholerae genome indicates that, in addition to the genes that encode proteins of known and unknown function, there are 1577 ORFs identified as conserved hypothetical or hypothetical gene candidates. Because the annotation is not 100% accurate, it is not known which of the 1577 ORFs are true protein-coding genes. In this paper, an algorithm based on the Z curve method, with sensitivity, specificity and accuracy greater than 98%, is used to solve this problem. Twenty-fold cross-validation tests show that the accuracy of the algorithm is 98.8%. A detailed discussion of the mechanism of the algorithm is also presented. It was found that 172 of the 1577 ORFs are unlikely to be protein-coding genes. The number of protein-coding genes in the V. cholerae genome was re-estimated and found to be ,,3716. This result should be of use in microarray analysis of gene expression in the genome, because the cost of preparing chips may be somewhat decreased. A computer program was written to calculate a coding score called VCZ for gene identification in the genome. Coding/noncoding is simply determined by VCZ > 0/VCZ < 0. The program is freely available on request for academic use. [source] Using support vector machines for prediction of protein structural classes based on discrete wavelet transformJOURNAL OF COMPUTATIONAL CHEMISTRY, Issue 8 2009Jian-Ding Qiu Abstract The prediction of secondary structure is a fundamental and important component in the analytical study of protein structure and functions. How to improve the predictive accuracy of protein structural classification by effectively incorporating the sequence-order effects is an important and challenging problem. In this study, a new method, in which the support vector machine combines with discrete wavelet transform, is developed to predict the protein structural classes. Its performance is assessed by cross-validation tests. The predicted results show that the proposed approach can remarkably improve the success rates, and might become a useful tool for predicting the other attributes of proteins as well. © 2008 Wiley Periodicals, Inc. J Comput Chem 2009 [source] Satyrinae butterflies from Sardinia and Corsica show a kaleidoscopic intraspecific biogeography (Lepidoptera, Nymphlidae)BIOLOGICAL JOURNAL OF THE LINNEAN SOCIETY, Issue 1 2010LEONARDO DAPPORTO The Mediterranean islands of Sardinia and Corsica are known for their multitude of endemics. Butterflies in particular have received much attention. However, no comprehensive studies aiming to compare populations of butterflies from Sardinia and Corsica with those from the neighbouring mainland and Sicily have been carried out. In the present study, the eleven Satyrinae species inhabiting Sardinia and Corsica islands were examined and compared with continental and Sicilian populations by means of geometric morphometrics of male genitalia. Relative warp computation, discriminant analyses, hierarchical clustering, and cross-validation tests were used to identify coherent distributional patterns including both islands and mainland populations. The eleven species showed multifaceted distributional patterns, although three main conclusions can be drawn: (1) populations from North Africa and Spain are generally different from those belonging to the Italian Peninsula; (2) populations from Sardinia and Sicily often resemble the North Africa/Spain ones; Corsica shows transitional populations similar to those from France; and (3) sea barriers represent filters to dispersal, although their efficacy appears to be unrelated to their extension. Indeed, the short sea straits between Sardinia and Corsica and between Sicily and the Italian Peninsula revealed a strong effectiveness with respect to preventing faunal exchanges; populations giving onto sea channels between Corsica and Northern Italy and between Sicily and Tunisia showed a higher similarity. A comparison of island and mainland distributions of the eleven taxa have helped to unravel the complex co-occurrence of historical factors, refugial dynamics, and recent (post-glacial) dispersal with respect to shaping the populations of Mediterranean island butterflies. © 2010 The Linnean Society of London, Biological Journal of the Linnean Society, 2010, 100, 195,212. [source] |