Home About us Contact | |||
Classification Trees (classification + tree)
Terms modified by Classification Trees Selected AbstractsThe Statistical Analysis of Judicial Decisions and Legal Rules with Classification TreesJOURNAL OF EMPIRICAL LEGAL STUDIES, Issue 2 2010Jonathan P. Kastellec A key question in the quantitative study of legal rules and judicial decision making is the structure of the relationship between case facts and case outcomes. Legal doctrine and legal rules are general attempts to define this relationship. This article summarizes and utilizes a statistical method relatively unexplored in political science and legal scholarship,classification trees,that offers a flexible way to study legal doctrine. I argue that this method, while not replacing traditional statistical tools for studying judicial decisions, can better capture many aspects of the relationship between case facts and case outcomes. To illustrate the method's advantages, I conduct classification tree analyses of search and seizure cases decided by the U.S. Supreme Court and confession cases decided by the courts of appeals. These analyses illustrate the ability of classification trees to increase our understanding of legal rules and legal doctrine. [source] New computational algorithm for the prediction of protein folding typesINTERNATIONAL JOURNAL OF QUANTUM CHEMISTRY, Issue 1 2001Nikola, tambuk Abstract We present a new computational algorithm for the prediction of a secondary protein structure. The method enables the evaluation of ,- and ,-protein folding types from the nucleotide sequences. The procedure is based on the reflected Gray code algorithm of nucleotide,amino acid relationships, and represents the extension of Swanson's procedure in Ref. 4. It is shown that six-digit binary notation of each codon enables the prediction of ,- and ,-protein folds by means of the error-correcting linear block triple-check code. We tested the validity of the method on the test set of 140 proteins (70 ,- and 70 ,-folds). The test set consisted of standard ,- and ,-protein classes from Jpred and SCOP databases, with nucleotide sequence available in the GenBank database. 100% accurate classification of ,- and ,-protein folds, based on 39 dipeptide addresses derived by the error-correcting coding procedure was obtained by means of the logistic regression analysis (p<0.00000001). Classification tree and machine learning sequential minimal optimization (SMO) classifier confirmed the results by means 97.1% and 90% accurate classification, respectively. Protein fold prediction quality tested by means of leave-one-out cross-validation was a satisfactory 82.1% for the logistic regression and 81.4% for the SMO classifier. The presented procedure of computational analysis can be helpful in detecting the type of protein folding from the newly sequenced exon regions. The method enables quick, simple, and accurate prediction of ,- and ,-protein folds from the nucleotide sequence on a personal computer. © 2001 John Wiley & Sons, Inc. Int J Quant Chem 84: 13,22, 2001 [source] The drought tolerance limit of Fagus sylvatica forest on limestone in southwestern GermanyJOURNAL OF VEGETATION SCIENCE, Issue 6 2008Stefanie Gärtner Abstract Question: What components of drought influence the drought limit of Fagus sylvatica forests? This study contributes to the ongoing discussion regarding the future of Fagus as a major component of central European forests. Location: The drought limit of F. sylvatica at its ecotone with forest dominated by Quercus pubescens, Q. petraea and their hybrids in two limestone regions (Klettgau, Schwäbische Alb) in southwestern Germany was compared. Methods: Vegetation relevés were classified and a gradient analysis was performed. The vegetation pattern was analysed with several drought relevant variables. Classification trees were used to determine the drought limits of the Fagus forest. Results: The Fagus, Quercus and the ecotone forests were floristically characterized. The lower humidity in the submontane Klettgau, compared to the montane Schwäbische Alb, was compensated for by greater soil moisture (ASWSC). Therefore, Fagus forest in the Schwäbische Alb grew on sites with ASWSC values similar to those of ecotone forest in Klettgau. Conclusions: The interaction between climatic and edaphic drought related factors demonstrates that drought is a complex edaphic-climatic factor. Both components contribute to limiting the distribution of Fagus. For the two regions in southwestern Germany, and under the existing climatic conditions, it could be shown that Fagus is able to dominate forests on soils with very low ASWSC (, 68 l.m -2). [source] Animal behaviour and marine protected areas: incorporating behavioural data into the selection of marine protected areas for an endangered killer whale populationANIMAL CONSERVATION, Issue 2 2010E. Ashe Abstract Like many endangered wildlife populations, the viability and conservation status of ,southern resident' killer whales Orcinus orca in the north-east Pacific may be affected by prey limitation and repeated disturbance by human activities. Marine protected areas (MPAs) present an attractive option to mitigate impacts of anthropogenic activities, but they run the risk of tokenism if placed arbitrarily. Notwithstanding recreational and industrial marine traffic, the number of commercial vessels in the local whalewatching fleet is approaching the number of killer whales to be watched. Resident killer whales have been shown to be more vulnerable to vessel disturbance while feeding than during resting, travelling or socializing activities, therefore protected-areas management strategies that target feeding ,hotspots' should confer greater conservation benefit than those that protect habitat generically. Classification trees and spatially explicit generalized additive models were used to model killer whale habitat use and whale behaviour in inshore waters of Washington State (USA) and British Columbia (BC, Canada). Here we propose a candidate MPA that is small (i.e. a few square miles), but seemingly important. Killer whales were predicted to be 2.7 times as likely to be engaged in feeding activity in this site than they were in adjacent waters. A recurring challenge for cetacean MPAs is the need to identify areas that are large enough to be biologically meaningful while being small enough to allow effective management of human activities within those boundaries. Our approach prioritizes habitat that animals use primarily for the activity in which they are most responsive to anthropogenic disturbance. [source] Assessing burn severity and comparing soil water repellency, Hayman Fire, ColoradoHYDROLOGICAL PROCESSES, Issue 1 2006Sarah A. Lewis Abstract An important element of evaluating a large wildfire is to assess its effects on the soil in order to predict the potential watershed response. After the 55 000 ha Hayman Fire on the Colorado Front Range, 24 soil and vegetation variables were measured to determine the key variables that could be used for a rapid field assessment of burn severity. The percentage of exposed mineral soil and litter cover proved to be the best predictors of burn severity in this environment. Two burn severity classifications, one from a statistical classification tree and the other a Burned Area Emergency Response (BAER) burn severity map, were compared with measured ,ground truth' burn severity at 183 plots and were 56% and 69% accurate, respectively. This study also compared water repellency measurements made with the water drop penetration time (WDPT) test and a mini-disk infiltrometer (MDI) test. At the soil surface, the moderate and highly burned sites had the strongest water repellency, yet were not significantly different from each other. Areas burned at moderate severity had 1·5 times more plots that were strongly water repellent at the surface than the areas burned at high severity. However, the high severity plots most likely had a deeper water repellent layer that was not detected with our surface tests. The WDPT and MDI values had an overall correlation of r = ,0·64(p < 0·0001) and appeared to be compatible methods for assessing soil water repellency in the field. Both tests represent point measurements of a soil characteristic that has large spatial variability; hence, results from both tests reflect that variability, accounting for much of the remaining variance. The MDI is easier to use, takes about 1 min to assess a strongly water repellent soil and provides two indicators of water repellency: the time to start of infiltration and a relative infiltration rate. Copyright © 2005 John Wiley & Sons, Ltd. [source] Pilot study of latewood-width of conifers as an indicator of variability of summer rainfall in the North American monsoon regionINTERNATIONAL JOURNAL OF CLIMATOLOGY, Issue 6 2001David M. Meko Abstract The variability of the North American Monsoon System (NAMS) is important to the precipitation climatology of Mexico and the southwestern United States. Tree-ring studies have been widely applied to climatic reconstruction in western North America, but as yet, have not addressed the NAMS. One reason is the need for highly resolved seasonal dendroclimatic information. Latewood-width, the portion of the annual tree ring laid down late in the growing season, can potentially yield such information. This paper describes a pilot study of the regional summer precipitation signal in latewood-width from a network of five Pseudotsuga menziesii chronologies developed in the heart of the region of NAMS influence in Arizona, USA. Exploratory data analysis reveals that the summer precipitation signal in latewood is strong, but not equally so over the full range of summer precipitation. Scatter in the relationship increases toward higher levels of precipitation. Adjustment for removal of inter-correlation with earlywood-width appears to strengthen the summer precipitation signal in latewood-width. To demonstrate a possible application to climatic reconstruction, the latewood precipitation signal is modelled using a nonlinear model,a binary recursive classification tree (CT) that attempts to classify summers as dry or not dry from threshold values of latewood-width. The model identifies narrow latewood-width as an effective predictor of a dry summer. Of 14 summers classified dry by latewood-width for the period 1868,1992, 13 are actually dry by the instrumental precipitation record. The results suggest that geographical expansion of coverage by latewood-width chronologies and further development of statistical methods may lead to successful reconstruction of variability of the NAMS on century time scales. Copyright © 2001 Royal Meteorological Society [source] Patients with Hip Fracture: Subgroups and Their OutcomesJOURNAL OF AMERICAN GERIATRICS SOCIETY, Issue 7 2002Elizabeth A. Eastwood PhD OBJECTIVES: To present several alternative approaches to describing the range and functional outcomes of patients with hip fracture. DESIGN: Prospective study with concurrent medical records data collection and patient and proxy interviews at the time of hospitalization and 6 months later. SETTING: Four hospitals in the New York metropolitan area. PARTICIPANTS: Five hundred seventy-one hospitalized adults aged 50 and older with hip fracture between July 1997 and August 1998. MEASUREMENTS: Rates of return to function in four physical domains, mortality, and nursing home residence at 6 months. Cluster analysis was used to describe the heterogeneity among the sample and identify variations in 6-month mortality, nursing home residence, and level of functioning and to develop a patient classification tree with associated patient outcomes at 6 months postfracture. RESULTS: In locomotion, transfers, and self-care, 33% to 37% of patients returned to their prior level of function by 6 months, including those needing assistance, but only 24% were independent in locomotion at 6 months. Cluster analysis identified eight patient subgroups that had distinct baseline features and variable outcomes at 6 months. The patient classification tree used four variables: atypical functional status (independent in locomotion but dependent in other domains); nursing home residence; independence/dependence in self-care; and age younger than 85 or 85 and older that identified five subgroups with variable 6-month outcomes that clinicians may use to predict likely outcomes for their patients. CONCLUSION: Patients with hip fracture are heterogeneous with respect to baseline and outcome characteristics. Clinicians may be better able to give patients and caregivers information on expected outcomes based on presenting characteristics used in the classification tree. [source] Hybrid Bayesian networks: making the hybrid Bayesian classifier robust to missing training dataJOURNAL OF CHEMOMETRICS, Issue 5 2003Nathaniel A. Woody Abstract Many standard classification methods are incapable of handling missing values in a sample. Instead, these methods must rely on external filling methods in order to estimate the missing values. The hybrid network proposed in this paper is an extension of the hybrid classifier that is robust to missing values. The hybrid network is produced by performing empirical Bayesian network structure learning to create a Bayesian network that retains its classification ability in the presence of missing data in both training and test cases. The performance of the hybrid network is measured by calculating a misclassification rate when data are removed from a dataset. These misclassification curves are then compared against similar curves produced from the hybrid classifier and from a classification tree. Copyright © 2003 John Wiley & Sons, Ltd. [source] The Interaction of Reward Genes With Environmental Factors in Contribution to Alcoholism in Mexican AmericansALCOHOLISM, Issue 12 2009Yanlei Du Background:, Alcoholism is a polygenic disorder resulting from reward deficiency; polymorphisms in reward genes including serotonin transporter (5-HTT)-linked polymorphic region (5-HTTLPR), A118G in opioid receptor mu1 (OPRM1), and ,141C Insertion/Deletion (Ins/Del) in dopamine receptor D2 (DRD2) as well as environmental factors (education and marital status) might affect the risk of alcoholism. Objective of the current study was to examine the main and interacting effect of these 3 polymorphisms and 2 environmental factors in contribution to alcoholism in Mexican Americans. Methods:, Genotyping of 5-HTTLPR, OPRM1 A118G, and DRD2-141C Ins/Del was performed in 365 alcoholics and 338 nonalcoholic controls of Mexican Americans who were gender- and age-matched. Alcoholics were stratified according to tertiles of MAXDRINKS, which denotes the largest number of drinks consumed in one 24-hour period. Data analysis was done in the entire data set and in each alcoholic stratum. Multinomial logistic regression was conducted to explore the main effect of 3 polymorphisms and 2 environmental factors (education and marital status); classification tree, generalized multifactor dimensionality reduction (GMDR) analysis, and polymorphism interaction analysis version 2.0 (PIA 2) program were used to study factor interaction. Results:, Main effect of education, OPRM1, and DRD2 was detected in alcoholic stratum of moderate and/or largest MAXDRINKS with education ,12 years, OPRM1 118 A/A, and DRD2 ,141C Ins/Ins being risk factors. Classification tree analysis, GMDR analysis, and PIA 2 program all supported education*OPRM1 interaction in alcoholics of largest MAXDRINKS with education ,12 years coupled with OPRM1 A/A being a high risk factor; dendrogram showed synergistic interaction between these 2 factors; dosage-effect response was also observed for education*OPRM1 interaction. No definite effect of marital status and 5-HTTLPR in pathogenesis of alcoholism was observed. Conclusions:, Our results suggest main effect of education background, OPRM1 A118G, and DRD2 ,141C Ins/Del as well as education*OPRM1 interaction in contribution to moderate and/or severe alcoholism in Mexican Americans. Functional relevance of these findings still needs to be explored. [source] A hierarchical Bayesian model for predicting the functional consequences of amino-acid polymorphismsJOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES C (APPLIED STATISTICS), Issue 1 2005Claudio J. Verzilli Summary., Genetic polymorphisms in deoxyribonucleic acid coding regions may have a phenotypic effect on the carrier, e.g. by influencing susceptibility to disease. Detection of deleterious mutations via association studies is hampered by the large number of candidate sites; therefore methods are needed to narrow down the search to the most promising sites. For this, a possible approach is to use structural and sequence-based information of the encoded protein to predict whether a mutation at a particular site is likely to disrupt the functionality of the protein itself. We propose a hierarchical Bayesian multivariate adaptive regression spline (BMARS) model for supervised learning in this context and assess its predictive performance by using data from mutagenesis experiments on lac repressor and lysozyme proteins. In these experiments, about 12 amino-acid substitutions were performed at each native amino-acid position and the effect on protein functionality was assessed. The training data thus consist of repeated observations at each position, which the hierarchical framework is needed to account for. The model is trained on the lac repressor data and tested on the lysozyme mutations and vice versa. In particular, we show that the hierarchical BMARS model, by allowing for the clustered nature of the data, yields lower out-of-sample misclassification rates compared with both a BMARS and a frequen-tist MARS model, a support vector machine classifier and an optimally pruned classification tree. [source] Enhancing a regional vegetation map with predictive models of dominant plant species in chaparralAPPLIED VEGETATION SCIENCE, Issue 1 2002Janet Franklin Abstract. Data from more than 900 vegetation plots surveyed in the evergreen shrublands of southern California were used to develop predictions of the distributions of eight dominant shrub species for a 3880 km2 region. The predictions, based on classification tree (CT) models, were validated using independent field data collected during a vegetation survey conducted in the 1930s. Presence and absence were correctly predicted an average of 75% of the time for the eight species. At the same time, these models minimized false positives, so that presence was predicted in the correct proportion of the cases for most species. The areal proportion of the landscape on which the species were predicted to occur was in the same rank order, and of the same magnitude, as their frequency (proportion of plots in which they occurred) within the field data sets. Predictive maps of species presence were overlaid and combined with an existing regional vegetation map. The shrub species ,assemblages' that resulted from this procedure had analogs with vegetation series defined using field data in previous studies. The resulting multiple species map will be used in a landscape simulation model of fire disturbance and succession. [source] Application and evaluation of classification trees for screening unwanted plantsAUSTRAL ECOLOGY, Issue 5 2006PETER CALEY Abstract Risk assessment systems for introduced species are being developed and applied globally, but methods for rigorously evaluating them are still in their infancy. We explore classification and regression tree models as an alternative to the current Australian Weed Risk Assessment system, and demonstrate how the performance of screening tests for unwanted alien species may be quantitatively compared using receiver operating characteristic (ROC) curve analysis. The optimal classification tree model for predicting weediness included just four out of a possible 44 attributes of introduced plants examined, namely: (i) intentional human dispersal of propagules; (ii) evidence of naturalization beyond native range; (iii) evidence of being a weed elsewhere; and (iv) a high level of domestication. Intentional human dispersal of propagules in combination with evidence of naturalization beyond a plants native range led to the strongest prediction of weediness. A high level of domestication in combination with no evidence of naturalization mitigated the likelihood of an introduced plant becoming a weed resulting from intentional human dispersal of propagules. Unlikely intentional human dispersal of propagules combined with no evidence of being a weed elsewhere led to the lowest predicted probability of weediness. The failure to include intrinsic plant attributes in the model suggests that either these attributes are not useful general predictors of weediness, or data and analysis were inadequate to elucidate the underlying relationship(s). This concurs with the historical pessimism that we will ever be able to accurately predict invasive plants. Given the apparent importance of propagule pressure (the number of individuals of an species released), future attempts at evaluating screening model performance for identifying unwanted plants need to account for propagule pressure when collating and/or analysing datasets. The classification tree had a cross-validated sensitivity of 93.6% and specificity of 36.7%. Based on the area under the ROC curve, the performance of the classification tree in correctly classifying plants as weeds or non-weeds was slightly inferior (Area under ROC curve = 0.83 ± 0.021 (±SE)) to that of the current risk assessment system in use (Area under ROC curve = 0.89 ± 0.018 (±SE)), although requires many fewer questions to be answered. [source] Differences in spatial predictions among species distribution modeling methods vary with species traits and environmental predictorsECOGRAPHY, Issue 6 2009Alexandra D. Syphard Prediction maps produced by species distribution models (SDMs) influence decision-making in resource management or designation of land in conservation planning. Many studies have compared the prediction accuracy of different SDM modeling methods, but few have quantified the similarity among prediction maps. There has also been little systematic exploration of how the relative importance of different predictor variables varies among model types and affects map similarity. Our objective was to expand the evaluation of SDM performance for 45 plant species in southern California to better understand how map predictions vary among model types, and to explain what factors may affect spatial correspondence, including the selection and relative importance of different environmental variables. Four types of models were tested. Correlation among maps was highest between generalized linear models (GLMs) and generalized additive models (GAMs) and lowest between classification trees and GAMs or GLMs. Correlation between Random Forests (RFs) and GAMs was the same as between RFs and classification trees. Spatial correspondence among maps was influenced the most by model prediction accuracy (AUC) and species prevalence; map correspondence was highest when accuracy was high and prevalence was intermediate (average prevalence for all species was 0.124). Species functional type and the selection of climate variables also influenced map correspondence. For most (but not all) species, climate variables were more important than terrain or soil in predicting their distributions. Environmental variable selection varied according to modeling method, but the largest differences were between RFs and GLMs or GAMs. Although prediction accuracy was equal for GLMs, GAMs, and RFs, the differences in spatial predictions suggest that it may be important to evaluate the results of more than one model to estimate the range of spatial uncertainty before making planning decisions based on map outputs. This may be particularly important if models have low accuracy or if species prevalence is not intermediate. [source] Forecasting daily high ozone concentrations by classification treesENVIRONMETRICS, Issue 2 2004F. Bruno Abstract This article proposes the use of classification trees (CART) as a suitable technique for forecasting the daily exceedance of ozone standards established by Italian law. A model is formulated for predicting, 1 and 2 days beforehand, the most probable class of the maximum daily urban ozone concentration in the city of Bologna. The standard employed is the so-called ,warning level' (180,,g/m3). Meteorological forecasted variables are considered as predictors. Pollution data show a considerable discrepancy between the dimensions of the two classes of events. The first class includes those days when the observed maximum value exceeds the established standard, while the second class contains those when the observed maximum value does not exceed the said standard. Due to this peculiarity, model selection procedures using cross-validation usually lead to overpruning. We can overcome this drawback by means of techniques which replicate observations, through the modification of their inclusion probabilities in the cross-validation sets. Copyright © 2004 John Wiley & Sons, Ltd. [source] Effects of sample and grid size on the accuracy and stability of regression-based snow interpolation methodsHYDROLOGICAL PROCESSES, Issue 14 2010J. Ignacio López Moreno Abstract This work analyses the responses of four regression-based interpolation methods for predicting snowpack distribution to changes in the number of data points (sample size) and resolution of the employed digital elevation model (DEM). For this purpose, we used data obtained from intensive and random sampling of snow depth (991 measurements) in a small catchment (6 km2) in the Pyrenees, Spain. Linear regression, classification trees, generalized additive models (GAMs), and a recent method based on a correction made by applying tree classification to GAM residuals were used to calculate snow-depth distribution based on terrain characteristics under different combinations of sample size and DEM spatial resolution (grid size). The application of a tree classification to GAM residuals yielded the highest accuracy scores and the most stable models. The other tested methods yielded scores with slightly lower accuracy and varying levels of robustness under different conditions of grid and sample size. The accuracy of the model predictions declined with decreasing resolution of DEMs and sample size; however, the sensitivities of the models to the number of data points showed threshold values, which has implications (when planning fieldwork) for optimizing the relation between the effort expended in gathering data and the quality of the results. Copyright © 2009 John Wiley & Sons, Ltd. [source] Factors explaining the abundance of rodents in the city of Luang Prabang, Lao PDR, as revealed by field and household surveysINTEGRATIVE ZOOLOGY (ELECTRONIC), Issue 1 2008Prasartthong PROMKERD Abstract A field and a household survey, the latter of which included inspections and interviews with the residents of a total of 1370 properties, were conducted in 2004 in 30 villages of the city of Luang Prabang, Lao PDR, in order to assess the degree of rodent infestation and to identify potential factors influencing infestations. Roof rats, Rattus rattus, and the Polynesian rat, Rattus exulans, were the only rodents found in the city, and trapping results showed a clear dominance of roof rats (80,90% of all individuals). Measurements of rodent activity using tracking patches correlated positively with the trapping data, and revealed a significantly higher degree of rat infestation during the rainy season (September) than during the dry season (November). If households in the vicinity of the sampling locations were considered, villagers' accounts of indoor rodent infestations recorded during the household survey correlated positively with measurements of rodent activity. At least every second household reported indoor infestations. Using explorative statistical analyses (classification trees, factor analysis) we checked the predictive or explanatory value of up to 28 variables assessed during household inspections for villagers' observations on rodent infestation as the dependent variable. Trophic factors such as exposed food (indoors) and garbage (outdoors), and structural features such as open ceilings (indoors) and rat harborage in gardens (outdoors) ranked highest as explanatory variables. Assessment of a small sample of roof rat droppings collected inside houses revealed the presence of the potential disease agents Salmonella javiana, Cryptosporidium parvum, Giardia duodenalis and the parasitic nematode Calodium hepaticum (syn. Capillaria hepatica). These results underline the need for an appropriate rodent management strategy for the city, whereby simple sanitation and rodent-proofing measures could be cheap means of reducing rat infestation rates. [source] The Statistical Analysis of Judicial Decisions and Legal Rules with Classification TreesJOURNAL OF EMPIRICAL LEGAL STUDIES, Issue 2 2010Jonathan P. Kastellec A key question in the quantitative study of legal rules and judicial decision making is the structure of the relationship between case facts and case outcomes. Legal doctrine and legal rules are general attempts to define this relationship. This article summarizes and utilizes a statistical method relatively unexplored in political science and legal scholarship,classification trees,that offers a flexible way to study legal doctrine. I argue that this method, while not replacing traditional statistical tools for studying judicial decisions, can better capture many aspects of the relationship between case facts and case outcomes. To illustrate the method's advantages, I conduct classification tree analyses of search and seizure cases decided by the U.S. Supreme Court and confession cases decided by the courts of appeals. These analyses illustrate the ability of classification trees to increase our understanding of legal rules and legal doctrine. [source] Digital soil mapping in Germany,a reviewJOURNAL OF PLANT NUTRITION AND SOIL SCIENCE, Issue 3 2006Thorsten Behrens Abstract Digital soil mapping as a tool to generate spatial soil information provides solutions for the growing demand for high-resolution soil maps worldwide. Even in highly developed countries like Germany, digital soil mapping becomes essential due to the decreasing, time-consuming, and expensive field surveys which are no longer affordable by the soil surveys of the individual federal states. This article summarizes the present state of soil survey in Germany in terms of digitally available soil data, applied digital soil mapping, and research in the broader field of pedometrics and discusses future perspectives. Based on the geomorphologic conditions in Germany, relief is a major driving force in soil genesis. This is expressed by the digital,soil mapping research which highlights the great importance of digital terrain attributes in combination with information on parent material in soil prediction. An example of digital soil mapping using classification trees in Thuringia is given as an introduction in digital soil-class mapping based on correlations to environmental covariates within the scope of the German classification system. [source] Modelling fault-proneness statistically over a sequence of releases: a case studyJOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE, Issue 3 2001Magnus C. Ohlsson Abstract Many of today's software systems evolve through a series of releases that add new functionality and features, in addition to the results of corrective maintenance. As the systems evolve over time it is necessary to keep track of and manage their problematic components. Our focus is to track system evolution and to react before the systems become difficult to maintain. To do the tracking, we use a method based on a selection of statistical techniques. In the case study we report here that had historical data available primarily on corrective maintenance, we apply the method to four releases of a system consisting of 130 components. In each release, components are classified as fault-prone if the number of defect reports written against them are above a certain threshold. The outcome from the case study shows stabilizing principal components over the releases, and classification trees with lower thresholds in their decision nodes. Also, the variables used in the classification trees' decision nodes are related to changes in the same files. The discriminant functions use more variables than the classification trees and are more difficult to interpret. Box plots highlight the findings from the other analyses. The results show that for a context of corrective maintenance, principal components analysis together with classification trees are good descriptors for tracking software evolution. Copyright © 2001 John Wiley & Sons, Ltd. [source] Bibliomining for automated collection development in a digital library setting: Using data mining to discover Web-based scholarly research worksJOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 12 2003Scott Nicholson This research creates an intelligent agent for automated collection development in a digital library setting. It uses a predictive model based on facets of each Web page to select scholarly works. The criteria came from the academic library selection literature, and a Delphi study was used to refine the list to 41 criteria. A Perl program was designed to analyze a Web page for each criterion and applied to a large collection of scholarly and nonscholarly Web pages. Bibliomining, or data mining for libraries, was then used to create different classification models. Four techniques were used: logistic regression, nonparametric discriminant analysis, classification trees, and neural networks. Accuracy and return were used to judge the effectiveness of each model on test datasets. In addition, a set of problematic pages that were difficult to classify because of their similarity to scholarly research was gathered and classified using the models. The resulting models could be used in the selection process to automatically create a digital library of Web-based scholarly research works. In addition, the technique can be extended to create a digital library of any type of structured electronic information. [source] Application and evaluation of classification trees for screening unwanted plantsAUSTRAL ECOLOGY, Issue 5 2006PETER CALEY Abstract Risk assessment systems for introduced species are being developed and applied globally, but methods for rigorously evaluating them are still in their infancy. We explore classification and regression tree models as an alternative to the current Australian Weed Risk Assessment system, and demonstrate how the performance of screening tests for unwanted alien species may be quantitatively compared using receiver operating characteristic (ROC) curve analysis. The optimal classification tree model for predicting weediness included just four out of a possible 44 attributes of introduced plants examined, namely: (i) intentional human dispersal of propagules; (ii) evidence of naturalization beyond native range; (iii) evidence of being a weed elsewhere; and (iv) a high level of domestication. Intentional human dispersal of propagules in combination with evidence of naturalization beyond a plants native range led to the strongest prediction of weediness. A high level of domestication in combination with no evidence of naturalization mitigated the likelihood of an introduced plant becoming a weed resulting from intentional human dispersal of propagules. Unlikely intentional human dispersal of propagules combined with no evidence of being a weed elsewhere led to the lowest predicted probability of weediness. The failure to include intrinsic plant attributes in the model suggests that either these attributes are not useful general predictors of weediness, or data and analysis were inadequate to elucidate the underlying relationship(s). This concurs with the historical pessimism that we will ever be able to accurately predict invasive plants. Given the apparent importance of propagule pressure (the number of individuals of an species released), future attempts at evaluating screening model performance for identifying unwanted plants need to account for propagule pressure when collating and/or analysing datasets. The classification tree had a cross-validated sensitivity of 93.6% and specificity of 36.7%. Based on the area under the ROC curve, the performance of the classification tree in correctly classifying plants as weeds or non-weeds was slightly inferior (Area under ROC curve = 0.83 ± 0.021 (±SE)) to that of the current risk assessment system in use (Area under ROC curve = 0.89 ± 0.018 (±SE)), although requires many fewer questions to be answered. [source] |