Home About us Contact

Machine Learning Methods (machine + learning_methods)

Distribution by Scientific Domains

Chemistry	53%
Information Science and Computing	40%

Selected Abstracts

Modeling and predicting binding affinity of phencyclidine-like compounds using machine learning methods

JOURNAL OF CHEMOMETRICS, Issue 1 2010
Ozlem Erdas
Abstract Machine learning methods have always been promising in the science and engineering fields, and the use of these methods in chemistry and drug design has advanced especially since the 1990s. In this study, molecular electrostatic potential (MEP) surfaces of phencyclidine-like (PCP-like) compounds are modeled and visualized in order to extract features that are useful in predicting binding affinities. In modeling, the Cartesian coordinates of MEP surface points are mapped onto a spherical self-organizing map (SSOM). The resulting maps are visualized using electrostatic potential (ESP) values. These values also provide features for a prediction system. Support vector machines and partial least-squares method are used for predicting binding affinities of compounds. Copyright © 2009 John Wiley & Sons, Ltd. [source]

SORTAL ANAPHORA RESOLUTION IN MEDLINE ABSTRACTS

COMPUTATIONAL INTELLIGENCE, Issue 1 2007
Manabu Torii
This paper reports our investigation of machine learning methods applied to anaphora resolution for biology texts, particularly paper abstracts. Our primary concern is the investigation of features and their combinations for effective anaphora resolution. In this paper, we focus on the resolution of demonstrative phrases and definite determiner phrases, the two most prevalent forms of anaphoric expressions that we find in biology research articles. Different resolution models are developed for demonstrative and definite determiner phrases. Our work shows that models may be optimized differently for each of the phrase types. Also, because a significant number of definite determiner phrases are not anaphoric, we induce a model to detect anaphoricity, i.e., a model that classifies phrases as either anaphoric or nonanaphoric. We propose several novel features that we call highlighting features, and consider their utility particularly for processing paper abstracts. The system using the highlighting features achieved accuracies of 78% and 71% for demonstrative phrases and definite determiner phrases, respectively. The use of the highlighting features reduced the error rate by about 10%. [source]

Implicit Surface Modelling with a Globally Regularised Basis of Compact Support

COMPUTER GRAPHICS FORUM, Issue 3 2006
C. Walder
We consider the problem of constructing a globally smooth analytic function that represents a surface implicitly by way of its zero set, given sample points with surface normal vectors. The contributions of the paper include a novel means of regularising multi-scale compactly supported basis functions that leads to the desirable interpolation properties previously only associated with fully supported bases. We also provide a regularisation framework for simpler and more direct treatment of surface normals, along with a corresponding generalisation of the representer theorem lying at the core of kernel-based machine learning methods. We demonstrate the techniques on 3D problems of up to 14 million data points, as well as 4D time series data and four-dimensional interpolation between three-dimensional shapes. Categories and Subject Descriptors (according to ACM CCS): I.3.5 [Computer Graphics]: Curve, surface, solid, and object representations [source]

Acquiring knowledge with limited experience

EXPERT SYSTEMS, Issue 3 2007
Der-Chiang Li
Abstract: From computational learning theory, sample size in machine learning problems indeed affects the learning performance. Since only few samples can be obtained in the early stages of a system and fewer exemplars usually lead to a low learning accuracy, this research compares different machine learning methods through their classification accuracies to improve small-data-set learning. Techniques used in this paper include the mega-trend diffusion technique, a backpropagation neural network, a support vector machine, and decision trees to explore the machine learning issue with two real medical data sets concerning cancer. The result of the experiment shows that the mega-trend diffusion technique and backpropagation approaches are effective methods of small-data-set learning. [source]

Flexible constraints for regularization in learning from data

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, Issue 6 2004
Eyke Hüllermeier
By its very nature, inductive inference performed by machine learning methods mainly is data driven. Still, the incorporation of background knowledge,if available,can help to make inductive inference more efficient and to improve the quality of induced models. Fuzzy set,based modeling techniques provide a convenient tool for making expert knowledge accessible to computational methods. In this article, we exploit such techniques within the context of the regularization (penalization) framework of inductive learning. The basic idea is to express knowledge about an underlying data-generating process in terms of flexible constraints and to penalize those models violating these constraints. An optimal model is one that achieves an optimal trade-off between fitting the data and satisfying the constraints. © 2004 Wiley Periodicals, Inc. [source]

Modeling and predicting binding affinity of phencyclidine-like compounds using machine learning methods

In silico prediction and screening of ,-secretase inhibitors by molecular descriptors and machine learning methods

JOURNAL OF COMPUTATIONAL CHEMISTRY, Issue 6 2010
Xue-Gang Yang
Abstract ,-Secretase inhibitors have been explored for the prevention and treatment of Alzheimer's disease (AD). Methods for prediction and screening of ,-secretase inhibitors are highly desired for facilitating the design of novel therapeutic agents against AD, especially when incomplete knowledge about the mechanism and three-dimensional structure of ,-secretase. We explored two machine learning methods, support vector machine (SVM) and random forest (RF), to develop models for predicting ,-secretase inhibitors of diverse structures. Quantitative analysis of the receiver operating characteristic (ROC) curve was performed to further examine and optimize the models. Especially, the Youden index (YI) was initially introduced into the ROC curve of RF so as to obtain an optimal threshold of probability for prediction. The developed models were validated by an external testing set with the prediction accuracies of SVM and RF 96.48 and 98.83% for ,-secretase inhibitors and 98.18 and 99.27% for noninhibitors, respectively. The different feature selection methods were used to extract the physicochemical features most relevant to ,-secretase inhibition. To the best of our knowledge, the RF model developed in this work is the first model with a broad applicability domain, based on which the virtual screening of ,-secretase inhibitors against the ZINC database was performed, resulting in 368 potential hit candidates. © 2009 Wiley Periodicals, Inc. J Comput Chem, 2010 [source]

Identification of small molecule aggregators from large compound libraries by support vector machines

JOURNAL OF COMPUTATIONAL CHEMISTRY, Issue 4 2010
Hanbing Rao
Abstract Small molecule aggregators non-specifically inhibit multiple unrelated proteins, rendering them therapeutically useless. They frequently appear as false hits and thus need to be eliminated in high-throughput screening campaigns. Computational methods have been explored for identifying aggregators, which have not been tested in screening large compound libraries. We used 1319 aggregators and 128,325 non-aggregators to develop a support vector machines (SVM) aggregator identification model, which was tested by four methods. The first is five fold cross-validation, which showed comparable aggregator and significantly improved non-aggregator identification rates against earlier studies. The second is the independent test of 17 aggregators discovered independently from the training aggregators, 71% of which were correctly identified. The third is retrospective screening of 13M PUBCHEM and 168K MDDR compounds, which predicted 97.9% and 98.7% of the PUBCHEM and MDDR compounds as non-aggregators. The fourth is retrospective screening of 5527 MDDR compounds similar to the known aggregators, 1.14% of which were predicted as aggregators. SVM showed slightly better overall performance against two other machine learning methods based on five fold cross-validation studies of the same settings. Molecular features of aggregation, extracted by a feature selection method, are consistent with published profiles. SVM showed substantial capability in identifying aggregators from large libraries at low false-hit rates. © 2009 Wiley Periodicals, Inc.J Comput Chem, 2010 [source]

Machine learning approaches for predicting compounds that interact with therapeutic and ADMET related proteins

JOURNAL OF PHARMACEUTICAL SCIENCES, Issue 11 2007
H. Li
Abstract Computational methods for predicting compounds of specific pharmacodynamic and ADMET (absorption, distribution, metabolism, excretion and toxicity) property are useful for facilitating drug discovery and evaluation. Recently, machine learning methods such as neural networks and support vector machines have been explored for predicting inhibitors, antagonists, blockers, agonists, activators and substrates of proteins related to specific therapeutic and ADMET property. These methods are particularly useful for compounds of diverse structures to complement QSAR methods, and for cases of unavailable receptor 3D structure to complement structure-based methods. A number of studies have demonstrated the potential of these methods for predicting such compounds as substrates of P-glycoprotein and cytochrome P450 CYP isoenzymes, inhibitors of protein kinases and CYP isoenzymes, and agonists of serotonin receptor and estrogen receptor. This article is intended to review the strategies, current progresses and underlying difficulties in using machine learning methods for predicting these protein binders and as potential virtual screening tools. Algorithms for proper representation of the structural and physicochemical properties of compounds are also evaluated. © 2007 Wiley-Liss, Inc. and the American Pharmacists Association J Pharm Sci 96: 2838,2860, 2007 [source]

Predicting project delivery rates using the Naive,Bayes classifier

JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE, Issue 3 2002
B. Stewart
Abstract The importance of accurate estimation of software development effort is well recognized in software engineering. In recent years, machine learning approaches have been studied as possible alternatives to more traditional software cost estimation methods. The objective of this paper is to investigate the utility of the machine learning algorithm known as the Naive,Bayes classifier for estimating software project effort. We present empirical experiments with the Benchmark 6 data set from the International Software Benchmarking Standards Group to estimate project delivery rates and compare the performance of the Naive,Bayes approach to two other machine learning methods,model trees and neural networks. A project delivery rate is defined as the number of effort hours per function point. The approach described is general and can be used to analyse not only software development data but also data on software maintenance and other types of software engineering. The paper demonstrates that the Naive,Bayes classifier has a potential to be used as an alternative machine learning tool for software development effort estimation. Copyright © 2002 John Wiley & Sons, Ltd. [source]

Positional effects on citation and readership in arXiv

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 11 2009
Asif-ul Haque
arXiv.org mediates contact with the literature for entire scholarly communities, providing both archival access and daily email and web announcements of new materials. We confirm and extend a surprising correlation between article position in these initial announcements and later citation impact, due primarily to intentional "self-promotion" by authors. There is, however, also a pure "visibility" effect: the subset of articles accidentally in early positions fared measurably better in the long-term citation record. Articles in astrophysics (astro-ph) and two large subcommunities of theoretical high energy physics (hep-th and hep-ph) announced in position 1, for example, respectively received median numbers of citations 83%, 50%, and 100% higher than those lower down, while the subsets there accidentally had 44%, 38%, and 71% visibility boosts. We also consider the positional effects on early readership. The median numbers of early full text downloads for astro-ph, hep-th, and hep-ph articles announced in position 1 were 82%, 61%, and 58% higher than for lower positions, respectively, and those there accidentally had medians visibility-boosted by 53%, 44%, and 46%. Finally, we correlate a variety of readership features with long-term citations, using machine learning methods, and conclude with some observations on impact metrics and the dangers of recommender mechanisms. [source]

Computational methods in authorship attribution

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 1 2009
Moshe Koppel
Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample. In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case, the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine learning methods can be adapted to handle the special challenges of that variant. [source]

Selection criteria for drug-like compounds

MEDICINAL RESEARCH REVIEWS, Issue 3 2003
Ingo Muegge
Abstract The fast identification of quality lead compounds in the pharmaceutical industry through a combination of high throughput synthesis and screening has become more challenging in recent years. Although the number of available compounds for high throughput screening (HTS) has dramatically increased, large-scale random combinatorial libraries have contributed proportionally less to identify novel leads for drug discovery projects. Therefore, the concept of ,drug-likeness' of compound selections has become a focus in recent years. In parallel, the low success rate of converting lead compounds into drugs often due to unfavorable pharmacokinetic parameters has sparked a renewed interest in understanding more clearly what makes a compound drug-like. Various approaches have been devised to address the drug-likeness of molecules employing retrospective analyses of known drug collections as well as attempting to capture ,chemical wisdom' in algorithms. For example, simple property counting schemes, machine learning methods, regression models, and clustering methods have been employed to distinguish between drugs and non-drugs. Here we review computational techniques to address the drug-likeness of compound selections and offer an outlook for the further development of the field. © 2003 Wiley Periodicals, Inc. Med Res Rev, 23, No. 3, 302-321, 2003 [source]

Status of HTS Data Mining Approaches

MOLECULAR INFORMATICS, Issue 4 2004
Alexander Böcker
Abstract High-throughput screening of large compound collections results in large sets of data. This review gives an overview of the most frequently employed computational techniques for the analysis of such data and the establishment of first QSAR models. Various methods for descriptor selection, classification and data mining are discussed. Recent trends include the application of kernel-based machine learning methods for the design of focused libraries and compilation of target-family biased compound collections. [source]