Classification Error (classification + error)

Distribution by Scientific Domains

Selected Abstracts

The use of indicator taxa as representatives of communities in bioassessment

Summary 1. Sampling and processing of benthic macroinvertebrate samples is time consuming and expensive. Although a number of cost-cutting options exist, a frequently asked question is how representative a subset of data is of the whole community, in particular in areas where habitat diversity is high (like Dutch surface water habitats). 2. Weighted averaging was used to reassign 650 samples to a typology of 40 community types, testing the representativeness of different subsets of data: (i) four different types of data (presence/absence, raw, 2log- and ln-transformed abundance), (ii) three subsets of ,indicator' taxa (taxa with indicator weights 4,12, 7,12, and 10,12) and (iii) single taxonomic groups (n = 14) by determining the classification error. 3. 2log- and ln-transformed abundances resulted in the lowest classification error, whilst the use of qualitative data resulted in a reduction of 10% of the samples assigned to their original community type compared to the use of ln-transformed abundance data. 4. Samples from community types with a high number of unique indicator taxa had the lowest classification error, and classification error increased as similarity among community types increased. Using a subset of indicator taxa resulted in a maximum increase of the classification error of 15% when only taxa with an indicator weight 10,12 were included (error = 49.1%). 5. Use of single taxonomic groups resulted in high classification error, the lowest classification error was found using Trichoptera (68%), and was related to the frequency of the taxonomic group among samples and the indicator weights of the taxa. 6. Our findings that the use of qualitative data, subsets of indicator taxa or single taxonomic groups resulted in high classification error implies low taxonomic redundancy, and supports the use of all taxa in characterising a macroinvertebrate community, in particular in areas where habitat diversity is high. [source]

Power of Tests for a Dichotomous Independent Variable Measured with Error

Daniel F. McCaffrey
Objective. To examine the implications for statistical power of using predicted probabilities for a dichotomous independent variable, rather than the actual variable. Data Sources/Study Setting. An application uses 271,479 observations from the 2000 to 2002 CAHPS Medicare Fee-for-Service surveys. Study Design and Data. A methodological study with simulation results and a substantive application to previously collected data. Principle Findings. Researchers often must employ key dichotomous predictors that are unobserved but for which predictions exist. We consider three approaches to such data: the classification estimator (1); the direct substitution estimator (2); the partial information maximum likelihood estimator (3, PIMLE). The efficiency of (1) (its power relative to testing with the true variable) roughly scales with the square of one less the classification error. The efficiency of (2) roughly scales with the R2 for predicting the unobserved dichotomous variable, and is usually more powerful than (1). Approach (3) is most powerful, but for testing differences in means of 0.2,0.5 standard deviations, (2) is typically more than 95 percent as efficient as (3). Conclusions. The information loss from not observing actual values of dichotomous predictors can be quite large. Direct substitution is easy to implement and interpret and nearly as efficient as the PIMLE. [source]

Regressive Interest Rate Expectations and Mortgage Instrument Choice in the United Kingdom Housing Market

David Leece
The paper considers the choice of mortgage instrument when the rate of interest is fixed for a short duration, with reversion to a variable (bullet) rate mortgage contract. The research is the first direct test for regressive interest rate expectations using United Kingdom data while testing for wealth and portfolio effects. The econometric modeling uses a variety of nonparametric and parametric techniques to control for classification error in the dependent variable. There is evidence that households adopt regressive interest rate expectations. The lack of statistical significance of wealth and portfolio effects confirms the short run cash flow perspective of United Kingdom mortgage choices. [source]

Evaluating the Ability of Tree-Based Methods and Logistic Regression for the Detection of SNP-SNP Interaction

M. García-Magariños
Summary Most common human diseases are likely to have complex etiologies. Methods of analysis that allow for the phenomenon of epistasis are of growing interest in the genetic dissection of complex diseases. By allowing for epistatic interactions between potential disease loci, we may succeed in identifying genetic variants that might otherwise have remained undetected. Here we aimed to analyze the ability of logistic regression (LR) and two tree-based supervised learning methods, classification and regression trees (CART) and random forest (RF), to detect epistasis. Multifactor-dimensionality reduction (MDR) was also used for comparison. Our approach involves first the simulation of datasets of autosomal biallelic unphased and unlinked single nucleotide polymorphisms (SNPs), each containing a two-loci interaction (causal SNPs) and 98 ,noise' SNPs. We modelled interactions under different scenarios of sample size, missing data, minor allele frequencies (MAF) and several penetrance models: three involving both (indistinguishable) marginal effects and interaction, and two simulating pure interaction effects. In total, we have simulated 99 different scenarios. Although CART, RF, and LR yield similar results in terms of detection of true association, CART and RF perform better than LR with respect to classification error. MAF, penetrance model, and sample size are greater determining factors than percentage of missing data in the ability of the different techniques to detect true association. In pure interaction models, only RF detects association. In conclusion, tree-based methods and LR are important statistical tools for the detection of unknown interactions among true risk-associated SNPs with marginal effects and in the presence of a significant number of noise SNPs. In pure interaction models, RF performs reasonably well in the presence of large sample sizes and low percentages of missing data. However, when the study design is suboptimal (unfavourable to detect interaction in terms of e.g. sample size and MAF) there is a high chance of detecting false, spurious associations. [source]

Automated Test Assembly for Cognitive Diagnosis Models Using a Genetic Algorithm

Matthew Finkelman
Much recent psychometric literature has focused on cognitive diagnosis models (CDMs), a promising class of instruments used to measure the strengths and weaknesses of examinees. This article introduces a genetic algorithm to perform automated test assembly alongside CDMs. The algorithm is flexible in that it can be applied whether the goal is to minimize the average number of classification errors, minimize the maximum error rate across all attributes being measured, hit a target set of error rates, or optimize any other prescribed objective function. Under multiple simulation conditions, the algorithm compared favorably with a standard method of automated test assembly, successfully finding solutions that were appropriate for each stated goal. [source]

Analysis of Misclassified Correlated Binary Data Using a Multivariate Probit Model when Covariates are Subject to Measurement Error

Surupa Roy
Abstract A multivariate probit model for correlated binary responses given the predictors of interest has been considered. Some of the responses are subject to classification errors and hence are not directly observable. Also measurements on some of the predictors are not available; instead the measurements on its surrogate are available. However, the conditional distribution of the unobservable predictors given the surrogate is completely specified. Models are proposed taking into account either or both of these sources of errors. Likelihood-based methodologies are proposed to fit these models. To ascertain the effect of ignoring classification errors and /or measurement error on the estimates of the regression and correlation parameters, a sensitivity study is carried out through simulation. Finally, the proposed methodology is illustrated through an example. [source]