Home About us Contact | |||
Predictive Performance (predictive + performance)
Selected AbstractsQuantifying the Predictive Performance of Prognostic Models for Censored Survival Data with Time-Dependent CovariatesBIOMETRICS, Issue 2 2008R. Schoop Summary Prognostic models in survival analysis typically aim to describe the association between patient covariates and future outcomes. More recently, efforts have been made to include covariate information that is updated over time. However, there exists as yet no standard approach to assess the predictive accuracy of such updated predictions. In this article, proposals from the literature are discussed and a conditional loss function approach is suggested, illustrated by a publicly available data set. [source] Modelling species distributions in Britain: a hierarchical integration of climate and land-cover dataECOGRAPHY, Issue 3 2004Richard G. Pearson A modelling framework for studying the combined effects of climate and land-cover changes on the distribution of species is presented. The model integrates land-cover data into a correlative bioclimatic model in a scale-dependent hierarchical manner, whereby Artificial Neural Networks are used to characterise species' climatic requirements at the European scale and land-cover requirements at the British scale. The model has been tested against an alternative non-hierarchical approach and has been applied to four plant species in Britain: Rhynchospora alba, Erica tetralix, Salix herbacea and Geranium sylvaticum. Predictive performance has been evaluated using Cohen's Kappa statistic and the area under the Receiver Operating Characteristic curve, and a novel approach to identifying thresholds of occurrence which utilises three levels of confidence has been applied. Results demonstrate reasonable to good predictive performance for each species, with the main patterns of distribution simulated at both 10 km and 1 km resolutions. The incorporation of land-cover data was found to significantly improve purely climate-driven predictions for R. alba and E. tetralix, enabling regions with suitable climate but unsuitable land-cover to be identified. The study thus provides an insight into the roles of climate and land-cover as determinants of species' distributions and it is demonstrated that the modelling approach presented can provide a useful framework for making predictions of distributions under scenarios of changing climate and land-cover type. The paper confirms the potential utility of multi-scale approaches for understanding environmental limitations to species' distributions, and demonstrates that the search for environmental correlates with species' distributions must be addressed at an appropriate spatial scale. Our study contributes to the mounting evidence that hierarchical schemes are characteristic of ecological systems. [source] Predictive Ability of Pretransplant Comorbidities to Predict Long-Term Graft Loss and DeathAMERICAN JOURNAL OF TRANSPLANTATION, Issue 3 2009G. Machnicki Whether to include additional comorbidities beyond diabetes in future kidney allocation schemes is controversial. We investigated the predictive ability of multiple pretransplant comorbidities for graft and patient survival. We included first-kidney transplant deceased donor recipients if Medicare was the primary payer for at least one year pretransplant. We extracted pretransplant comorbidities from Medicare claims with the Clinical Classifications Software (CCS), Charlson and Elixhauser comorbidities and used Cox regressions for graft loss, death with function (DWF) and death. Four models were compared: (1) Organ Procurement Transplant Network (OPTN) recipient and donor factors, (2) OPTN + CCS, (3) OPTN + Charlson and (4) OPTN + Elixhauser. Patients were censored at 9 years or loss to follow-up. Predictive performance was evaluated with the c-statistic. We examined 25 270 transplants between 1995 and 2002. For graft loss, the predictive value of all models was statistically and practically similar (Model 1: 0.61 [0.60 0.62], Model 2: 0.63 [0.62 0.64], Models 3 and 4: 0.62 [0.61 0.63]). For DWF and death, performance improved to 0.70 and was slightly better with the CCS. Pretransplant comorbidities derived from administrative claims did not identify factors not collected on OPTN that had a significant impact on graft outcome predictions. This has important implications for the revisions to the kidney allocation scheme. [source] Rule Quality Measures for Rule Induction Systems: Description and EvaluationCOMPUTATIONAL INTELLIGENCE, Issue 3 2001Aijun An A rule quality measure is important to a rule induction system for determining when to stop generalization or specialization. Such measures are also important to a rule-based classification procedure for resolving conflicts among rules. We describe a number of statistical and empirical rule quality formulas and present an experimental comparison of these formulas on a number of standard machine learning datasets. We also present a meta-learning method for generating a set of formula-behavior rules from the experimental results which show the relationships between a formula's performance and the characteristics of a dataset. These formula-behavior rules are combined into formula-selection rules that can be used in a rule induction system to select a rule quality formula before rule induction. We will report the experimental results showing the effects of formula-selection on the predictive performance of a rule induction system. [source] Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splinesDIVERSITY AND DISTRIBUTIONS, Issue 3 2007Jane Elith ABSTRACT Current circumstances , that the majority of species distribution records exist as presence-only data (e.g. from museums and herbaria), and that there is an established need for predictions of species distributions , mean that scientists and conservation managers seek to develop robust methods for using these data. Such methods must, in particular, accommodate the difficulties caused by lack of reliable information about sites where species are absent. Here we test two approaches for overcoming these difficulties, analysing a range of data sets using the technique of multivariate adaptive regression splines (MARS). MARS is closely related to regression techniques such as generalized additive models (GAMs) that are commonly and successfully used in modelling species distributions, but has particular advantages in its analytical speed and the ease of transfer of analysis results to other computational environments such as a Geographic Information System. MARS also has the advantage that it can model multiple responses, meaning that it can combine information from a set of species to determine the dominant environmental drivers of variation in species composition. We use data from 226 species from six regions of the world, and demonstrate the use of MARS for distribution modelling using presence-only data. We test whether (1) the type of data used to represent absence or background and (2) the signal from multiple species affect predictive performance, by evaluating predictions at completely independent sites where genuine presence,absence data were recorded. Models developed with absences inferred from the total set of presence-only sites for a biological group, and using simultaneous analysis of multiple species to inform the choice of predictor variables, performed better than models in which species were analysed singly, or in which pseudo-absences were drawn randomly from the study area. The methods are fast, relatively simple to understand, and useful for situations where data are limited. A tutorial is included. [source] Modelling species distributions in Britain: a hierarchical integration of climate and land-cover dataECOGRAPHY, Issue 3 2004Richard G. Pearson A modelling framework for studying the combined effects of climate and land-cover changes on the distribution of species is presented. The model integrates land-cover data into a correlative bioclimatic model in a scale-dependent hierarchical manner, whereby Artificial Neural Networks are used to characterise species' climatic requirements at the European scale and land-cover requirements at the British scale. The model has been tested against an alternative non-hierarchical approach and has been applied to four plant species in Britain: Rhynchospora alba, Erica tetralix, Salix herbacea and Geranium sylvaticum. Predictive performance has been evaluated using Cohen's Kappa statistic and the area under the Receiver Operating Characteristic curve, and a novel approach to identifying thresholds of occurrence which utilises three levels of confidence has been applied. Results demonstrate reasonable to good predictive performance for each species, with the main patterns of distribution simulated at both 10 km and 1 km resolutions. The incorporation of land-cover data was found to significantly improve purely climate-driven predictions for R. alba and E. tetralix, enabling regions with suitable climate but unsuitable land-cover to be identified. The study thus provides an insight into the roles of climate and land-cover as determinants of species' distributions and it is demonstrated that the modelling approach presented can provide a useful framework for making predictions of distributions under scenarios of changing climate and land-cover type. The paper confirms the potential utility of multi-scale approaches for understanding environmental limitations to species' distributions, and demonstrates that the search for environmental correlates with species' distributions must be addressed at an appropriate spatial scale. Our study contributes to the mounting evidence that hierarchical schemes are characteristic of ecological systems. [source] Random forest can predict 30-day mortality of spontaneous intracerebral hemorrhage with remarkable discriminationEUROPEAN JOURNAL OF NEUROLOGY, Issue 7 2010S. -Y. Background and purpose:, Risk-stratification models based on patient and disease characteristics are useful for aiding clinical decisions and for comparing the quality of care between different physicians or hospitals. In addition, prediction of mortality is beneficial for optimizing resource utilization. We evaluated the accuracy and discriminating power of the random forest (RF) to predict 30-day mortality of spontaneous intracerebral hemorrhage (SICH). Methods:, We retrospectively studied 423 patients admitted to the Taichung Veterans General Hospital who were diagnosed with spontaneous SICH within 24 h of stroke onset. The initial evaluation data of the patients were used to train the RF model. Areas under the receiver operating characteristic curves (AUC) were used to quantify the predictive performance. The performance of the RF model was compared to that of an artificial neural network (ANN), support vector machine (SVM), logistic regression model, and the ICH score. Results:, The RF had an overall accuracy of 78.5% for predicting the mortality of patients with SICH. The sensitivity was 79.0%, and the specificity was 78.4%. The AUCs were as follows: RF, 0.87 (0.84,0.90); ANN, 0.81 (0.77,0.85); SVM, 0.79 (0.75,0.83); logistic regression, 0.78 (0.74,0.82); and ICH score, 0.72 (0.68,0.76). The discriminatory power of RF was superior to that of the other prediction models. Conclusions:, The RF provided the best predictive performance amongst all of the tested models. We believe that the RF is a suitable tool for clinicians to use in predicting the 30-day mortality of patients after SICH. [source] Sensitivity analysis of prior model probabilities and the value of prior knowledge in the assessment of conceptual model uncertainty in groundwater modellingHYDROLOGICAL PROCESSES, Issue 8 2009Rodrigo Rojas Abstract A key point in the application of multi-model Bayesian averaging techniques to assess the predictive uncertainty in groundwater modelling applications is the definition of prior model probabilities, which reflect the prior perception about the plausibility of alternative models. In this work the influence of prior knowledge and prior model probabilities on posterior model probabilities, multi-model predictions, and conceptual model uncertainty estimations is analysed. The sensitivity to prior model probabilities is assessed using an extensive numerical analysis in which the prior probability space of a set of plausible conceptualizations is discretized to obtain a large ensemble of possible combinations of prior model probabilities. Additionally, the value of prior knowledge about alternative models in reducing conceptual model uncertainty is assessed by considering three example knowledge states, expressed as quantitative relations among the alternative models. A constrained maximum entropy approach is used to find the set of prior model probabilities that correspond to the different prior knowledge states. For illustrative purposes, a three-dimensional hypothetical setup approximated by seven alternative conceptual models is employed. Results show that posterior model probabilities, leading moments of the predictive distributions and estimations of conceptual model uncertainty are very sensitive to prior model probabilities, indicating the relevance of selecting proper prior probabilities. Additionally, including proper prior knowledge improves the predictive performance of the multi-model approach, expressed by reductions of the multi-model prediction variances by up to 60% compared with a non-informative case. However, the ratio between-model to total variance does not substantially decrease. This suggests that the contribution of conceptual model uncertainty to the total variance cannot be further reduced based only on prior knowledge about the plausibility of alternative models. These results advocate including proper prior knowledge about alternative conceptualizations in combination with extra conditioning data to further reduce conceptual model uncertainty in groundwater modelling predictions. Copyright © 2009 John Wiley & Sons, Ltd. [source] MHC Class II epitope predictive algorithmsIMMUNOLOGY, Issue 3 2010Morten Nielsen Summary Major histocompatibility complex class II (MHC-II) molecules sample peptides from the extracellular space, allowing the immune system to detect the presence of foreign microbes from this compartment. To be able to predict the immune response to given pathogens, a number of methods have been developed to predict peptide,MHC binding. However, few methods other than the pioneering TEPITOPE/ProPred method have been developed for MHC-II. Despite recent progress in method development, the predictive performance for MHC-II remains significantly lower than what can be obtained for MHC-I. One reason for this is that the MHC-II molecule is open at both ends allowing binding of peptides extending out of the groove. The binding core of MHC-II-bound peptides is therefore not known a priori and the binding motif is hence not readily discernible. Recent progress has been obtained by including the flanking residues in the predictions. All attempts to make ab initio predictions based on protein structure have failed to reach predictive performances similar to those that can be obtained by data-driven methods. Thousands of different MHC-II alleles exist in humans. Recently developed pan-specific methods have been able to make reasonably accurate predictions for alleles that were not included in the training data. These methods can be used to define supertypes (clusters) of MHC-II alleles where alleles within each supertype have similar binding specificities. Furthermore, the pan-specific methods have been used to make a graphical atlas such as the MHCMotifviewer, which allows for visual comparison of specificities of different alleles. [source] Assessing the predictive performance of artifIcial neural network-based classifiers based on different data preprocessing methods, distributions and training mechanismsINTELLIGENT SYSTEMS IN ACCOUNTING, FINANCE & MANAGEMENT, Issue 4 2005Adrian Costea We analyse the implications of three different factors (preprocessing method, data distribution and training mechanism) on the classification performance of artificial neural networks (ANNs). We use three preprocessing approaches: no preprocessing, division by the maximum absolute values and normalization. We study the implications of input data distributions by using five datasets with different distributions: the real data, uniform, normal, logistic and Laplace distributions. We test two training mechanisms: one belonging to the gradient-descent techniques, improved by a retraining procedure, and the other is a genetic algorithm (GA), which is based on the principles of natural evolution. The results show statistically significant influences of all individual and combined factors on both training and testing performances. A major difference with other related studies is the fact that for both training mechanisms we train the network using as starting solution the one obtained when constructing the network architecture. In other words we use a hybrid approach by refining a previously obtained solution. We found that when the starting solution has relatively low accuracy rates (80,90%) the GA clearly outperformed the retraining procedure, whereas the difference was smaller to non-existent when the starting solution had relatively high accuracy rates (95,98%). As reported in other studies, we found little to no evidence of crossover operator influence on the GA performance. Copyright © 2005 John Wiley & Sons, Ltd. [source] Computation of strongly swirling confined flows with cubic eddy-viscosity turbulence modelsINTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN FLUIDS, Issue 12 2003Xiaodong Yang Graduate Student Abstract An investigation on the predictive performance of four cubic eddy-viscosity turbulence models for two strongly swirling confined flows is presented. Comparisons of the prediction with the experiments show clearly the superiority of cubic models over the linear k,,model. The linear k,,model does not contain any mechanism to describe the stabilizing effects of swirling motion and as a consequence it performs poorly. Cubic models return a lower level of Reynolds stresses and the combined forced-free vortex profiles of tangential velocity close to the measurements in response to the interaction between swirl-induced curvature and stresses. However, a fully developed rotating pipe flow is too simple to contain enough flow physics, so the calibration of cubic terms is still a topic of investigation. It is shown that explicit algebraic stress models require fewer calibrations and contain more flow physics. Copyright © 2003 John Wiley & Sons, Ltd. [source] Forecasting financial volatility of the Athens stock exchange daily returns: an application of the asymmetric normal mixture GARCH modelINTERNATIONAL JOURNAL OF FINANCE & ECONOMICS, Issue 4 2010Anastassios A. Drakos Abstract In this paper we model the return volatility of stocks traded in the Athens Stock Exchange using alternative GARCH models. We employ daily data for the period January 1998 to November 2008 allowing us to capture possible positive and negative effects that may be due to either contagion or idiosyncratic sources. The econometric analysis is based on the estimation of a class of five GARCH models under alternative assumptions with respect to the error distribution. The main findings of our analysis are: first, based on a battery of diagnostic tests it is shown that the normal mixture asymmetric GARCH (NM-AGARCH) models perform better in modeling the volatility of stock returns. Second, it is shown that with the use of the Kupiec's tests for in-sample and out-of-sample forecasting performance the evidence is mixed as the choice of the appropriate volatility model depends on the trading position under consideration. Third, at the 99% confidence interval the NM-AGARCH model with skewed Student-distribution outperforms all other competing models both for in-sample and out-of-sample forecasting performance. This increase in predictive performance for higher confidence intervals of the NM-AGARCH model with skewed Student-distribution makes this specification consistent with the requirements of the Basel II agreement. Copyright © 2010 John Wiley & Sons, Ltd. [source] Cost-sensitive learning and decision making for massachusetts pip claim fraud dataINTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, Issue 12 2004Stijn Viaene In this article, we investigate the issue of cost-sensitive classification for a data set of Massachusetts closed personal injury protection (PIP) automobile insurance claims that were previously investigated for suspicion of fraud by domain experts and for which we obtained cost information. After a theoretical exposition on cost-sensitive learning and decision-making methods, we then apply these methods to the claims data at hand to contrast the predictive performance of the documented methods for a selection of decision tree and rule learners. We use standard logistic regression and (smoothed) naive Bayes as benchmarks. © 2004 Wiley Periodicals, Inc. Int J Int Syst 19: 1197,1215, 2004. [source] Regression by L1 regularization of smart contrasts and sums (ROSCAS) beats PLS and elastic net in latent variable modelJOURNAL OF CHEMOMETRICS, Issue 5 2009Cajo J. F. ter Braak Abstract This paper proposes a regression method, ROSCAS, which regularizes smart contrasts and sums of regression coefficients by an L1 penalty. The contrasts and sums are based on the sample correlation matrix of the predictors and are suggested by a latent variable regression model. The contrasts express the idea that a priori correlated predictors should have similar coefficients. The method has excellent predictive performance in situations, where there are groups of predictors with each group representing an independent feature that influences the response. In particular, when the groups differ in size, ROSCAS can outperform LASSO, elastic net, partial least squares (PLS) and ridge regression by a factor of two or three in terms of mean squared error. In other simulation setups and on real data, ROSCAS performs competitively. Copyright © 2009 John Wiley & Sons, Ltd. [source] Application of artificial neural network modelling to identify severely ill patients whose aminoglycoside concentrations are likely to fall below therapeutic concentrationsJOURNAL OF CLINICAL PHARMACY & THERAPEUTICS, Issue 5 2003S. Yamamura PhD Summary Objective:, Identification of ICU patients whose concentrations are likely to fall below therapeutic concentrations using artificial neural network (ANN) modelling and individual patient physiologic data. Method:, Data on indicators of disease severity and some physiologic data were collected from 89 ICU patients who received arbekacin (ABK) and 61 who received amikacin (AMK). Three-layer ANN modelling and multivariate logistic regression analysis were used to predict the plasma concentrations of the aminoglycosides (ABK and AMK) in the severely ill patients. Results:, Predictive performance analysis showed that the sensitivity and specificity of ANN modelling was superior to multivariate logistic regression analysis. For accurate modelling, a predictable range should be inferred from the data structure before the analysis. Restriction of the predictable region, based on the data structure, increased predictive performance. Conclusion:, ANN analysis was superior to multivariate logistic regression analysis in predicting which patients would have plasma concentrations lower than the minimum therapeutic concentration. To improve predictive performance, the predictable range should be inferred from the data structure before prediction. When applying ANN modelling in clinical settings, the predictive performance and predictable region should be investigated in detail to avoid the risk of harm to severely ill patients. [source] Incorporating structural characteristics for identification of protein methylation sitesJOURNAL OF COMPUTATIONAL CHEMISTRY, Issue 9 2009Dray-Ming Shien Abstract Studies over the last few years have identified protein methylation on histones and other proteins that are involved in the regulation of gene transcription. Several works have developed approaches to identify computationally the potential methylation sites on lysine and arginine. Studies of protein tertiary structure have demonstrated that the sites of protein methylation are preferentially in regions that are easily accessible. However, previous studies have not taken into account the solvent-accessible surface area (ASA) that surrounds the methylation sites. This work presents a method named MASA that combines the support vector machine with the sequence and structural characteristics of proteins to identify methylation sites on lysine, arginine, glutamate, and asparagine. Since most experimental methylation sites are not associated with corresponding protein tertiary structures in the Protein Data Bank, the effective solvent-accessible prediction tools have been adopted to determine the potential ASA values of amino acids in proteins. Evaluation of predictive performance by cross-validation indicates that the ASA values around the methylation sites can improve the accuracy of prediction. Additionally, an independent test reveals that the prediction accuracies for methylated lysine and arginine are 80.8 and 85.0%, respectively. Finally, the proposed method is implemented as an effective system for identifying protein methylation sites. The developed web server is freely available at http://MASA.mbc.nctu.edu.tw/. © 2009 Wiley Periodicals, Inc. J Comput Chem, 2009 [source] Soft energy function and generic evolutionary method for discriminating native from nonnative protein conformationsJOURNAL OF COMPUTATIONAL CHEMISTRY, Issue 9 2008Yi-yuan Chiu Abstract We have developed a soft energy function, termed GEMSCORE, for the protein structure prediction, which is one of emergent issues in the computational biology. The GEMSORE consists of the van der Waals, the hydrogen-bonding potential and the solvent potential with 12 parameters which are optimized by using a generic evolutionary method. The GEMSCORE is able to successfully identify 86 native proteins among 96 target proteins on six decoy sets from more 70,000 near-native structures. For these six benchmark datasets, the predictive performance of the GEMSCORE, based on native structure ranking and Z -scores, was superior to eight other energy functions. Our method is based solely on a simple and linear function and thus is considerably faster than other methods that rely on the additional complex calculations. In addition, the GEMSCORE recognized 17 and 2 native structures as the first and the second rank, respectively, among 21 targets in CASP6 (Critical Assessment of Techniques for Protein Structure Prediction). These results suggest that the GEMSCORE is fast and performs well to discriminate between native and nonnative structures from thousands of protein structure candidates. We believe that GEMSCORE is robust and should be a useful energy function for the protein structure prediction. © 2008 Wiley Periodicals, Inc. J Comput Chem 2008 [source] Evaluating predictive performance of value-at-risk models in emerging markets: a reality checkJOURNAL OF FORECASTING, Issue 2 2006Yong Bao Abstract We investigate the predictive performance of various classes of value-at-risk (VaR) models in several dimensions,unfiltered versus filtered VaR models, parametric versus nonparametric distributions, conventional versus extreme value distributions, and quantile regression versus inverting the conditional distribution function. By using the reality check test of White (2000), we compare the predictive power of alternative VaR models in terms of the empirical coverage probability and the predictive quantile loss for the stock markets of five Asian economies that suffered from the 1997,1998 financial crisis. The results based on these two criteria are largely compatible and indicate some empirical regularities of risk forecasts. The Riskmetrics model behaves reasonably well in tranquil periods, while some extreme value theory (EVT)-based models do better in the crisis period. Filtering often appears to be useful for some models, particularly for the EVT models, though it could be harmful for some other models. The CaViaR quantile regression models of Engle and Manganelli (2004) have shown some success in predicting the VaR risk measure for various periods, generally more stable than those that invert a distribution function. Overall, the forecasting performance of the VaR models considered varies over the three periods before, during and after the crisis. Copyright © 2006 John Wiley & Sons, Ltd. [source] Quick prediction of the retention of solutes in 13 thin layer chromatographic screening systems on silica gel by classification and regression treesJOURNAL OF SEPARATION SCIENCE, JSS, Issue 15 2008ukasz Komsta Abstract The use of classification and regression trees (CART) was studied in a quantitative structure,retention relationship (QSRR) context to predict the retention in 13 thin layer chromatographic screening systems on a silica gel, where large datasets of interlaboratory determined retention are available. The response (dependent variable) was the rate mobility (RM) factor, while a set of atomic contributions and functional substituent counts was used as an explanatory dataset. The trees were investigated against optimal complexity (number of the leaves) by external validation and internal crossvalidation. Their predictive performance is slightly lower than full atomic contribution model, but the main advantage is the simplicity. The retention prediction with the proposed trees can be done without computer or even pocket calculator. [source] EVALUATION OF A STREAM AQUIFER ANALYSIS TEST USING ANALYTICAL SOLUTIONS AND FIELD DATA,JOURNAL OF THE AMERICAN WATER RESOURCES ASSOCIATION, Issue 3 2004Garey A. Fox ABSTRACT: Considerable advancements have been made in the development of analytical solutions for predicting the effects of pumping wells on adjacent streams and rivers. However, these solutions have not been sufficiently evaluated against field data. The objective of this research is to evaluate the predictive performance of recently proposed analytical solutions for unsteady stream depletion using field data collected during a stream/aquifer analysis test at the Tamarack State Wildlife Area in eastern Colorado. Two primary stream/aquifer interactions exist at the Tamarack site: (1) between the South Platte River and the alluvial aquifer and (2) between a backwater stream and the alluvial aquifer. A pumping test is performed next to the backwater stream channel. Drawdown measured in observation wells is matched to predictions by recently proposed analytical solutions to derive estimates of aquifer and streambed parameters. These estimates are compared to documented aquifer properties and field measured streambed conductivity. The analytical solutions are capable of estimating reasonable values of both aquifer and streambed parameters with one solution capable of simultaneously estimating delayed aquifer yield and stream flow recharge. However, for long term water management, it is reasonable to use simplified analytical solutions not concerned with early-time delayed yield effects. For this site, changes in the water level in the stream during the test and a varying water level profile at the beginning of the pumping test influence the application of the analytical solutions. [source] Sparse partial least squares regression for simultaneous dimension reduction and variable selectionJOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES B (STATISTICAL METHODOLOGY), Issue 1 2010Hyonho Chun Summary., Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very large p and small n paradigm. We derive a similar result for a multivariate response regression with partial least squares. We then propose a sparse partial least squares formulation which aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors. We provide an efficient implementation of sparse partial least squares regression and compare it with well-known variable selection and dimension reduction approaches via simulation experiments. We illustrate the practical utility of sparse partial least squares regression in a joint analysis of gene expression and genomewide binding data. [source] Modelling concurrency of events in on-line auctions via spatiotemporal semiparametric modelsJOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES C (APPLIED STATISTICS), Issue 1 2007Wolfgang Jank Summary., We introduce a semiparametric approach for modelling the effect of concurrent events on an outcome of interest. Concurrency manifests itself as temporal and spatial dependences. By temporal dependence we mean the effect of an event in the past. Modelling this effect is challenging since events arrive at irregularly spaced time intervals. For the spatial part we use an abstract notion of ,feature space' to conceptualize distances among a set of item features. We motivate our model in the context of on-line auctions by modelling the effect of concurrent auctions on an auction's price. Our concurrency model consists of three components: a transaction-related component that accounts for auction design and bidding competition, a spatial component that takes into account similarity between item features and a temporal component that accounts for recently closed auctions. To construct each of these we borrow ideas from spatial and mixed model methodology. The power of this model is illustrated on a large and diverse set of laptop auctions on eBay.com. We show that our model results in superior predictive performance compared with a set of competitor models. The model also allows for new insight into the factors that drive price in on-line auctions and their relationship to bidding competition, auction design, product variety and temporal learning effects. [source] A hierarchical Bayesian model for predicting the functional consequences of amino-acid polymorphismsJOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES C (APPLIED STATISTICS), Issue 1 2005Claudio J. Verzilli Summary., Genetic polymorphisms in deoxyribonucleic acid coding regions may have a phenotypic effect on the carrier, e.g. by influencing susceptibility to disease. Detection of deleterious mutations via association studies is hampered by the large number of candidate sites; therefore methods are needed to narrow down the search to the most promising sites. For this, a possible approach is to use structural and sequence-based information of the encoded protein to predict whether a mutation at a particular site is likely to disrupt the functionality of the protein itself. We propose a hierarchical Bayesian multivariate adaptive regression spline (BMARS) model for supervised learning in this context and assess its predictive performance by using data from mutagenesis experiments on lac repressor and lysozyme proteins. In these experiments, about 12 amino-acid substitutions were performed at each native amino-acid position and the effect on protein functionality was assessed. The training data thus consist of repeated observations at each position, which the hierarchical framework is needed to account for. The model is trained on the lac repressor data and tested on the lysozyme mutations and vice versa. In particular, we show that the hierarchical BMARS model, by allowing for the clustered nature of the data, yields lower out-of-sample misclassification rates compared with both a BMARS and a frequen-tist MARS model, a support vector machine classifier and an optimally pruned classification tree. [source] Continental speciation in the tropics: contrasting biogeographic patterns of divergence in the Uroplatus leaf-tailed gecko radiation of MadagascarJOURNAL OF ZOOLOGY, Issue 4 2008C. J. Raxworthy Abstract A fundamental expectation of vicariance biogeography is for contemporary cladogenesis to produce spatial congruence between speciating sympatric clades. The Uroplatus leaf-tailed geckos represent one of most spectacular reptile radiations endemic to the continental island of Madagascar, and thus serve as an excellent group for examining patterns of continental speciation within this large and comparatively isolated tropical system. Here we present the first phylogeny that includes complete taxonomic sampling for the group, and is based on morphology and molecular (mitochondrial and nuclear DNA) data. This study includes all described species, and we also include data for eight new species. We find novel outgroup relationships for Uroplatus and find strongest support for Paroedura as its sister taxon. Uroplatus is estimated to have initially diverged during the mid-Tertiary in Madagascar, and includes two major speciose radiations exhibiting extensive spatial overlap and estimated contemporary periods of speciation. All sister species are either allopatric or parapatric. However, we found no evidence for biogeographic congruence between these sympatric clades, and dispersal events are prevalent in the dispersal,vicariance biogeographic analyses, which we estimate to date to the Miocene. One sister-species pair exhibits isolated distributions that we interpret as biogeographic relicts, and two sister-species pairs have parapatric distributions separated by elevation. Integrating ecological niche models with our phylogenetic results finds both conserved and divergent niches between sister species. We also found substantial intra-specific genetic variation, and for the three most widespread species, poor intra-specific predictive performance for ecological niche models across the latitudinal span of Madagascar. These latter results indicate the potential for intra-specific niche specialization along environmental gradients, and more generally, this study suggests a complex speciation history for this group in Madagascar, which appears to include multiple speciation processes. [source] Reliable prediction of T-cell epitopes using neural networks with novel sequence representationsPROTEIN SCIENCE, Issue 5 2003Morten Nielsen Abstract In this paper we describe an improved neural network method to predict T-cell class I epitopes. A novel input representation has been developed consisting of a combination of sparse encoding, Blosum encoding, and input derived from hidden Markov models. We demonstrate that the combination of several neural networks derived using different sequence-encoding schemes has a performance superior to neural networks derived using a single sequence-encoding scheme. The new method is shown to have a performance that is substantially higher than that of other methods. By use of mutual information calculations we show that peptides that bind to the HLA A*0204 complex display signal of higher order sequence correlations. Neural networks are ideally suited to integrate such higher order correlations when predicting the binding affinity. It is this feature combined with the use of several neural networks derived from different and novel sequence-encoding schemes and the ability of the neural network to be trained on data consisting of continuous binding affinities that gives the new method an improved performance. The difference in predictive performance between the neural network methods and that of the matrix-driven methods is found to be most significant for peptides that bind strongly to the HLA molecule, confirming that the signal of higher order sequence correlation is most strongly present in high-binding peptides. Finally, we use the method to predict T-cell epitopes for the genome of hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design. [source] Discrete-time survival treesTHE CANADIAN JOURNAL OF STATISTICS, Issue 1 2009Imad Bou-hamad MSC 2000: Primary 62N99; secondary 62G08 Abstract Tree-based methods are frequently used in studies with censored survival time. Their structure and ease of interpretability make them useful to identify prognostic factors and to predict conditional survival probabilities given an individual's covariates. The existing methods are tailor-made to deal with a survival time variable that is measured continuously. However, survival variables measured on a discrete scale are often encountered in practice. The authors propose a new tree construction method specifically adapted to such discrete-time survival variables. The splitting procedure can be seen as an extension, to the case of right-censored data, of the entropy criterion for a categorical outcome. The selection of the final tree is made through a pruning algorithm combined with a bootstrap correction. The authors also present a simple way of potentially improving the predictive performance of a single tree through bagging. A simulation study shows that single trees and bagged-trees perform well compared to a parametric model. A real data example investigating the usefulness of personality dimensions in predicting early onset of cigarette smoking is presented. The Canadian Journal of Statistics 37: 17-32; 2009 © 2009 Statistical Society of Canada Arbres de survie à temps discret Les méthodes d'arbres sont fréquemment utilisées lors d'études impliquant des données censurées. La structure d'un arbre ainsi que la facilité avec laquelle il peut être interprété font de lui un outil utile afin d'identifier des facteurs de pronostique et de prédire les probabilités de survie conditionnelles d'un individu étant donné ses covariables. Les méthodes existantes ont été développées pour traiter une variable temporelle continue. En pratique, il arrive fréquemment que la variable mesurant le temps de survie soit mesurée selon une échelle discrète. Les auteurs proposent une nouvelle méthode pour construire un arbre qui est spécialement adaptée aux variables de survie à temps discret. Le critère de division peut être vu comme étant une extension, au cas de censure à droite, du critère d'entropie pour une variable catégorielle. La sélection de l'arbre final est basée sur une méthode d'élagage combinée avec une correction bootstrap. Les auteurs présentent également une méthode simple pour améliorer, potentiellement, la performance d'un seul arbre avec le bagging. Une étude par simulation montre que des arbres seuls et des arbres "baggés" performent bien comparativement à un modèle paramétrique. Les auteurs présentent aussi une illustration de la nouvelle méthode avec des vraies données qui investiguent l'utilité d'utiliser des dimensions de la personnalité afin de prévoir le début de l'utilisation de la cigarette. La revue canadienne de statistique 37: 17-32; 2009 © 2009 Société statistique du Canada [source] The use of propofol and remifentanil for the anaesthetic management of a super-obese patientANAESTHESIA, Issue 8 2007L. La Colla Summary Morbid obesity is defined as body mass index (BMI) >,35 kg.m,2, and super-obesity as BMI >,55 kg.m,2. We report the case of a 290-kg super-obese patient scheduled for open bariatric surgery. A propofol-remifentanil TCI (target controlled infusion) was chosen as the anaesthetic technique both for sedation during awake fibreoptic nasotracheal intubation and for maintenance of anaesthesia during surgery. Servin's weight correction formula was used for propofol. Arterial blood samples were taken at fixed time points to assess the predictive performance of the TCI system. A significant difference between measured and predicted plasma propofol concentrations was found. After performing a computer simulation, we found that predictive performance would have improved significantly if we had used an unadjusted pharmacokinetic set. However, in conclusion (despite the differences between measured and predicted plasma propofol concentrations), the use of a propofol-remifentanil TCI technique both for sedation during awake fibreoptic intubation and for Bispectral Index-guided propofol-remifentanil anaesthesia resulted in a rapid and effective induction, and operative stability and a rapid emergence, allowing rapid extubation in the operating room and an uneventful recovery. [source] The Impact of Injury Coding Schemes on Predicting Hospital Mortality After Pediatric InjuryACADEMIC EMERGENCY MEDICINE, Issue 7 2009Randall S. Burd MD Abstract Objectives:, Accurate adjustment for injury severity is needed to evaluate the effectiveness of trauma management. While the choice of injury coding scheme used for modeling affects performance, the impact of combining coding schemes on performance has not been evaluated. The purpose of this study was to use Bayesian logistic regression to develop models predicting hospital mortality in injured children and to compare the performance of models developed using different injury coding schemes. Methods:, Records of children (age < 15 years) admitted after injury were obtained from the National Trauma Data Bank (NTDB) and the National Pediatric Trauma Registry (NPTR) and used to train Bayesian logistic regression models predicting mortality using three injury coding schemes (International Classification of Disease-9th revision [ICD-9] injury codes, the Abbreviated Injury Scale [AIS] severity scores, and the Barell matrix) and their combinations. Model performance was evaluated using independent data from the NTDB and the Kids' Inpatient Database 2003 (KID). Results:, Discrimination was optimal when modeling both ICD-9 and AIS severity codes (area under the receiver operating curve [AUC] = 0.921 [NTDB] and 0.967 [KID], Hosmer-Lemeshow [HL] h-statistic = 115 [NTDB] and 147 [KID]), while calibration was optimal when modeling coding based on the Barell matrix (AUC = 0.882 [NTDB] and 0.936 [KID], HL h-statistic = 19 [NTDB] and 69 [KID]). When compared to models based on ICD-9 codes alone, models that also included AIS severity scores and coding from the Barell matrix showed improved discrimination and calibration. Conclusions:, Mortality models that incorporate additional injury coding schemes perform better than those based on ICD-9 codes alone in the setting of pediatric trauma. Combining injury coding schemes may be an effective approach for improving the predictive performance of empirically derived estimates of injury mortality. [source] The Performance of Risk Prediction ModelsBIOMETRICAL JOURNAL, Issue 4 2008Thomas A. Gerds Abstract For medical decision making and patient information, predictions of future status variables play an important role. Risk prediction models can be derived with many different statistical approaches. To compare them, measures of predictive performance are derived from ROC methodology and from probability forecasting theory. These tools can be applied to assess single markers, multivariable regression models and complex model selection algorithms. This article provides a systematic review of the modern way of assessing risk prediction models. Particular attention is put on proper benchmarks and resampling techniques that are important for the interpretation of measured performance. All methods are illustrated with data from a clinical study in head and neck cancer patients. (© 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim) [source] Population pharmacokinetic modelling of gentamicin and vancomycin in patients with unstable renal function following cardiothoracic surgeryBRITISH JOURNAL OF CLINICAL PHARMACOLOGY, Issue 2 2006Christine E. Staatz Aims To describe the population pharmacokinetics of gentamicin and vancomycin in cardiothoracic surgery patients with unstable renal function. Methods Data collected during routine care were analyzed using NONMEM. Linear relationships between creatinine clearance (CLCr) and drug clearance (CL) were identified, and two approaches to modelling changing CLCr were examined. The first included baseline (BCOV) and difference from baseline (DCOV) effects and the second allowed the influence of CLCr to vary between individuals. Final model predictive performance was evaluated using independent data. The data sets were then combined and parameters re-estimated. Results Model building was performed using data from 96 (gentamicin) and 102 (vancomycin) patients, aged 17,87 years. CLCr ranged from 9 to 172 ml min,1 and changes varied from ,76 to 58 ml min,1 (gentamicin) and ,86 to 93 ml min,1 (vancomycin). Inclusion of BCOV and DCOV improved the fit of the gentamicin data but had little effect on that for vancomycin. Inclusion of interindividual variability (IIV) in the influence of CLcr resulted in a poorly characterized model for gentamicin and had no effect on vancomycin modelling. No bias was seen in population compared with individual CL estimates in independent data from 39 (gentamicin) and 37 (vancomycin) patients. Mean (95% CI) differences were 4% (,3, 11%) and 2% (,2, 6%), respectively. Final estimates were: CLGent (l h,1) = 2.81 × (1 + 0.015 × (BCOVCLCr -BCOVCLCr,Median) + 0.0174 × DCOVCLCr); CLVanc (l h,1) = 2.97 × (1 + 0.0205 ×, (CLCr -CLCr,Median)). IIV in CL was 27% for both drugs. Conclusions A parameter describing individual changes in CLcr with time improves population pharmacokinetic modelling of gentamicin but not vancomycin in clinically unstable patients. [source] |