Outlying Observations (outlying + observation)

Distribution by Scientific Domains


Selected Abstracts


OUTLYING OBSERVATIONS AND MISSING VALUES: HOW SHOULD THEY BE HANDLED?

CLINICAL AND EXPERIMENTAL PHARMACOLOGY AND PHYSIOLOGY, Issue 5-6 2008
John Ludbrook
SUMMARY 1The problems of, and best solutions for, outlying observations and missing values are very dependent on the sizes of the experimental groups. For original articles published in Clinical and Experimental Pharmacology and Physiology during 2006,2007, the range of group sizes ranged from three to 44 (,small groups'). In surveys, epidemiological studies and clinical trials, the group sizes range from 100s to 1000s (,large groups'). 2How can one detect outlying (extreme) observations? The best methods are graphical, for instance: (i) a scatterplot, often with mean±2 s; and (ii) a box-and-whisker plot. Even with these, it is a matter of judgement whether observations are truly outlying. 3It is permissable to delete or replace outlying observations if an independent explanation for them can be found. This may be, for instance, failure of a piece of measuring equipment or human error in operating it. If the observation is deleted, it can then be treated as a missing value. Rarely, the appropriate portion of the study can be repeated. 4It is decidedly not permissable to delete unexplained extreme values. Some of the acceptable strategies for handling them are: (i) transform the data and proceed with conventional statistical analyses; (ii) use the mean for location, but use permutation (randomization) tests for comparing means; and (iii) use robust methods for describing location (e.g. median, geometric mean, trimmed mean), for indicating dispersion (range, percentiles), for comparing locations and for regression analysis. 5What can be done about missing values? Some strategies are: (i) ignore them; (ii) replace them by hand if the data set is small; and (iii) use computerized imputation techniques to replace them if the data set is large (e.g. regression or EM (conditional Expectation, Maximum likelihood estimation) methods). 6If the missing values are ignored, or even if they are replaced, it is essential to test whether the individuals with missing values are otherwise indistinguishable from the remainder of the group. If the missing values have not occurred at random, but are associated with some property of the individuals being studied, the subsequent analysis may be biased. [source]


How well can animals navigate?

ENVIRONMETRICS, Issue 4 2006
Estimating the circle of confusion from tracking data
Abstract State-space models have recently been shown to effectively model animal movement. In this paper we illustrate how such models can be used to improve our knowledge of animal navigation ability, something which is poorly understood. This work is of great interest when modeling the behavior of animals that are migrating, often over tremendously large distances. We use the term circle of confusion, first proposed by Kendall (1974), to describe the general inability of an animal to know its location precisely. Our modeling strategy enables us to statistically describe the circle of confusion associated with any animal movements where departure and destination points are known. For illustration, we use ARGOS satellite telemetry of leatherback turtles migrating over a distance of approximately 4000,km in the Atlantic Ocean. Robust features of the model enable one to deal with outlying observations, highly characteristic of these types of data. Although specifically designed for data obtained using satellite telemetry, our approach is generalizable to other common kinds of movement data such as archival tag data. Copyright © 2005 John Wiley & Sons, Ltd. [source]


A robust PCR method for high-dimensional regressors

JOURNAL OF CHEMOMETRICS, Issue 8-9 2003
Mia Hubert
Abstract We consider the multivariate calibration model which assumes that the concentrations of several constituents of a sample are linearly related to its spectrum. Principal component regression (PCR) is widely used for the estimation of the regression parameters in this model. In the classical approach it combines principal component analysis (PCA) on the regressors with least squares regression. However, both stages yield very unreliable results when the data set contains outlying observations. We present a robust PCR (RPCR) method which also consists of two parts. First we apply a robust PCA method for high-dimensional data on the regressors, then we regress the response variables on the scores using a robust regression method. A robust RMSECV value and a robust R2 value are proposed as exploratory tools to select the number of principal components. The prediction error is also estimated in a robust way. Moreover, we introduce several diagnostic plots which are helpful to visualize and classify the outliers. The robustness of RPCR is demonstrated through simulations and the analysis of a real data set. Copyright © 2003 John Wiley & Sons, Ltd. [source]


Toward robust QSPR models: Synergistic utilization of robust regression and variable elimination

JOURNAL OF COMPUTATIONAL CHEMISTRY, Issue 6 2008
Rainer Grohmann
Abstract Widely used regression approaches in modeling quantitative structure,property relationships, such as PLS regression, are highly susceptible to outlying observations that will impair the prognostic value of a model. Our aim is to compile homogeneous datasets as the basis for regression modeling by removing outlying compounds and applying variable selection. We investigate different approaches to create robust, outlier-resistant regression models in the field of prediction of drug molecules' permeability. The objective is to join the strength of outlier detection and variable elimination increasing the predictive power of prognostic regression models. In conclusion, outlier detection is employed to identify multiple, homogeneous data subsets for regression modeling. © 2007 Wiley Periodicals, Inc. J Comput Chem 2008 [source]


Robustness of alternative non-linearity tests for SETAR models

JOURNAL OF FORECASTING, Issue 3 2004
Wai-Sum Chan
Abstract In recent years there has been a growing interest in exploiting potential forecast gains from the non-linear structure of self-exciting threshold autoregressive (SETAR) models. Statistical tests have been proposed in the literature to help analysts check for the presence of SETAR-type non-linearities in an observed time series. It is important to study the power and robustness properties of these tests since erroneous test results might lead to misspecified prediction problems. In this paper we investigate the robustness properties of several commonly used non-linearity tests. Both the robustness with respect to outlying observations and the robustness with respect to model specification are considered. The power comparison of these testing procedures is carried out using Monte Carlo simulation. The results indicate that all of the existing tests are not robust to outliers and model misspecification. Finally, an empirical application applies the statistical tests to stock market returns of the four little dragons (Hong Kong, South Korea, Singapore and Taiwan) in East Asia. The non-linearity tests fail to provide consistent conclusions most of the time. The results in this article stress the need for a more robust test for SETAR-type non-linearity in time series analysis and forecasting. Copyright © 2004 John Wiley & Sons, Ltd. [source]


Selecting explanatory variables with the modified version of the Bayesian information criterion

QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, Issue 6 2008
gorzata Bogdan
Abstract We consider the situation in which a large database needs to be analyzed to identify a few important predictors of a given quantitative response variable. There is a lot of evidence that in this case classical model selection criteria, such as the Akaike information criterion or the Bayesian information criterion (BIC), have a strong tendency to overestimate the number of regressors. In our earlier papers, we developed the modified version of BIC (mBIC), which enables the incorporation of prior knowledge on a number of regressors and prevents overestimation. In this article, we review earlier results on mBIC and discuss the relationship of this criterion to the well-known Bonferroni correction for multiple testing and the Bayes oracle, which minimizes the expected costs of inference. We use computer simulations and a real data analysis to illustrate the performance of the original mBIC and its rank version, which is designed to deal with data that contain some outlying observations. Copyright © 2008 John Wiley & Sons, Ltd. [source]


OUTLYING OBSERVATIONS AND MISSING VALUES: HOW SHOULD THEY BE HANDLED?

CLINICAL AND EXPERIMENTAL PHARMACOLOGY AND PHYSIOLOGY, Issue 5-6 2008
John Ludbrook
SUMMARY 1The problems of, and best solutions for, outlying observations and missing values are very dependent on the sizes of the experimental groups. For original articles published in Clinical and Experimental Pharmacology and Physiology during 2006,2007, the range of group sizes ranged from three to 44 (,small groups'). In surveys, epidemiological studies and clinical trials, the group sizes range from 100s to 1000s (,large groups'). 2How can one detect outlying (extreme) observations? The best methods are graphical, for instance: (i) a scatterplot, often with mean±2 s; and (ii) a box-and-whisker plot. Even with these, it is a matter of judgement whether observations are truly outlying. 3It is permissable to delete or replace outlying observations if an independent explanation for them can be found. This may be, for instance, failure of a piece of measuring equipment or human error in operating it. If the observation is deleted, it can then be treated as a missing value. Rarely, the appropriate portion of the study can be repeated. 4It is decidedly not permissable to delete unexplained extreme values. Some of the acceptable strategies for handling them are: (i) transform the data and proceed with conventional statistical analyses; (ii) use the mean for location, but use permutation (randomization) tests for comparing means; and (iii) use robust methods for describing location (e.g. median, geometric mean, trimmed mean), for indicating dispersion (range, percentiles), for comparing locations and for regression analysis. 5What can be done about missing values? Some strategies are: (i) ignore them; (ii) replace them by hand if the data set is small; and (iii) use computerized imputation techniques to replace them if the data set is large (e.g. regression or EM (conditional Expectation, Maximum likelihood estimation) methods). 6If the missing values are ignored, or even if they are replaced, it is essential to test whether the individuals with missing values are otherwise indistinguishable from the remainder of the group. If the missing values have not occurred at random, but are associated with some property of the individuals being studied, the subsequent analysis may be biased. [source]