Home About us Contact

Real Data Sets (real + data_set)

Distribution by Scientific Domains

Mathematics and Statistics	51%
Chemistry	22%
Life Sciences	14%
Earth and Environmental Science	3%
4 Other Domains	10%

Selected Abstracts

Fitting copulas to bivariate earthquake data: the seismic gap hypothesis revisited

ENVIRONMETRICS, Issue 3 2008
Aristidis K. Nikoloulopoulos
Abstract The seismic gap hypothesis assumes that the intensity of an earthquake and the time elapsed from the previous one are positively related. Previous works on this topic were based on particular assumptions for the joint distribution implying specific type of dependence. We investigate this hypothesis using copulas. Copulas are flexible for modelling the dependence structure far from assuming simple linear correlation structures and, thus, allow for better examination of this controversial aspect of geophysical research. In fact, via copulas, marginal properties and dependence structure can be separated. We propose a model averaging approach in order to allow for model uncertainty and diminish the effect of the choice of a particular copula. This enlarges the range of potential dependence structures that can be investigated. Application to a real data set is provided. Copyright © 2007 John Wiley & Sons, Ltd. [source]

Nonparametric prediction intervals for the future rainfall records,

ENVIRONMETRICS, Issue 5 2006
Mohammad Z. Raqab
Abstract Prediction of records plays an important role in the environmental applications, especially, prediction of rainfall extremes, highest water levels, sea surface, and air record temperatures. In this paper, based on the observed records drawn from a sequence sample of independent and identically random variables, we develop prediction intervals as well as prediction upper and lower bounds for records from another independent sequence. We extend the prediction problem to include prediction regions for joint upper records from a future sequence sample. The Bonferouni's inequality is used to choose appropriate prediction coefficients for the joint prediction. A real data set representing the records of the annual (January 1,December 31) rainfall at Los Angeles Civic Center is addressed to illustrate the proposed prediction procedures in the environmental applications. Copyright © 2005 John Wiley & Sons, Ltd. [source]

Haplotype analysis in the presence of informatively missing genotype data

GENETIC EPIDEMIOLOGY, Issue 4 2006
Nianjun Liu
Abstract It is common to have missing genotypes in practical genetic studies, but the exact underlying missing data mechanism is generally unknown to the investigators. Although some statistical methods can handle missing data, they usually assume that genotypes are missing at random, that is, at a given marker, different genotypes and different alleles are missing with the same probability. These include those methods on haplotype frequency estimation and haplotype association analysis. However, it is likely that this simple assumption does not hold in practice, yet few studies to date have examined the magnitude of the effects when this simplifying assumption is violated. In this study, we demonstrate that the violation of this assumption may lead to serious bias in haplotype frequency estimates, and haplotype association analysis based on this assumption can induce both false-positive and false-negative evidence of association. To address this limitation in the current methods, we propose a general missing data model to characterize missing data patterns across a set of two or more markers simultaneously. We prove that haplotype frequencies and missing data probabilities are identifiable if and only if there is linkage disequilibrium between these markers under our general missing data model. Simulation studies on the analysis of haplotypes consisting of two single nucleotide polymorphisms illustrate that our proposed model can reduce the bias both for haplotype frequency estimates and association analysis due to incorrect assumption on the missing data mechanism. Finally, we illustrate the utilities of our method through its application to a real data set. Genet. Epidemiol. 2006. © 2006 Wiley-Liss, Inc. [source]

Reorganizing web sites based on user access patterns

INTELLIGENT SYSTEMS IN ACCOUNTING, FINANCE & MANAGEMENT, Issue 1 2002
Yongjian Fu
In this paper, an approach for reorganizing Web sites based on user access patterns is proposed. Our goal is to build adaptive Web sites by evolving site structure to facilitate user access. The approach consists of three steps: preprocessing, page classification, and site reorganization. In preprocessing, pages on a Web site are processed to create an internal representation of the site. Page access information of its users is extracted from the Web server log. In page classification, the Web pages on the site are classified into two categories, index pages and content pages, based on the page access information. After the pages are classified, in site reorganization, the Web site is examined to find better ways to organize and arrange the pages on the site. An algorithm for reorganizing Web sites has been developed. Our experiments on a large real data set show that the approach is efficient and practical for adaptive Web sites. Copyright © 2002 John Wiley & Sons, Ltd. [source]

Skew-normal linear calibration: a Bayesian perspective

JOURNAL OF CHEMOMETRICS, Issue 8 2008
Cléber da Costa Figueiredo
Abstract In this paper, we present a Bayesian approach for estimation in the skew-normal calibration model, as well as the conditional posterior distributions which are useful for implementing the Gibbs sampler. Data transformation is thus avoided by using the methodology proposed. Model fitting is implemented by proposing the asymmetric deviance information criterion, ADIC, a modification of the ordinary DIC. We also report an application of the model studied by using a real data set, related to the relationship between the resistance and the elasticity of a sample of concrete beams. Copyright © 2008 John Wiley & Sons, Ltd. [source]

Correlation optimized warping and dynamic time warping as preprocessing methods for chromatographic data

JOURNAL OF CHEMOMETRICS, Issue 5 2004
Giorgio Tomasi
Abstract Two different algorithms for time-alignment as a preprocessing step in linear factor models are studied. Correlation optimized warping and dynamic time warping are both presented in the literature as methods that can eliminate shift-related artifacts from measurements by correcting a sample vector towards a reference. In this study both the theoretical properties and the practical implications of using signal warping as preprocessing for chromatographic data are investigated. The connection between the two algorithms is also discussed. The findings are illustrated by means of a case study of principal component analysis on a real data set, including manifest retention time artifacts, of extracts from coffee samples stored under different packaging conditions for varying storage times. We concluded that for the data presented here dynamic time warping with rigid slope constraints and correlation optimized warping are superior to unconstrained dynamic time warping; both considerably simplify interpretation of the factor model results. Unconstrained dynamic time warping was found to be too flexible for this chromatographic data set, resulting in an overcompensation of the observed shifts and suggesting the unsuitability of this preprocessing method for this type of signals. Copyright © 2004 John Wiley & Sons, Ltd. [source]

A robust PCR method for high-dimensional regressors

JOURNAL OF CHEMOMETRICS, Issue 8-9 2003
Mia Hubert
Abstract We consider the multivariate calibration model which assumes that the concentrations of several constituents of a sample are linearly related to its spectrum. Principal component regression (PCR) is widely used for the estimation of the regression parameters in this model. In the classical approach it combines principal component analysis (PCA) on the regressors with least squares regression. However, both stages yield very unreliable results when the data set contains outlying observations. We present a robust PCR (RPCR) method which also consists of two parts. First we apply a robust PCA method for high-dimensional data on the regressors, then we regress the response variables on the scores using a robust regression method. A robust RMSECV value and a robust R2 value are proposed as exploratory tools to select the number of principal components. The prediction error is also estimated in a robust way. Moreover, we introduce several diagnostic plots which are helpful to visualize and classify the outliers. The robustness of RPCR is demonstrated through simulations and the analysis of a real data set. Copyright © 2003 John Wiley & Sons, Ltd. [source]

Combining univariate calibration information through a mixed-effects model

JOURNAL OF CHEMOMETRICS, Issue 2 2003
Jason J. Z. Liao
Abstract It is common practice to calibrate a common value by combining information from different sources such as days, people, instruments and laboratories. Under each individual source a univariate calibration can be used to calibrate the unknown. Then the common unknown can be estimated by combining the estimates from each source as a weighted mean (Johnson DJ, Krishnamoorthy,K. J. Am. Statist. Assoc. 1996; 91: 1707,1715) or through a multivariate calibration setting by combining information first and then estimating the common value (Liao JJZ. J. Chemometrics 2001; 15: 789,794). In this paper a mixed-effects model approach is proposed to combine good characteristics from both approaches. Simulations show that the mixed-effects model has better bias and mean squared error (MSE) performance than the univariate and multivariate approaches. A real data set is used to demonstrate the good characteristics of the mixed-effects model approach. Copyright © 2003 John Wiley & Sons, Ltd. [source]

Modeling Randomness in Judging Rating Scales with a Random-Effects Rating Scale Model

JOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2006
Wen-Chung Wang
This study presents the random-effects rating scale model (RE-RSM) which takes into account randomness in the thresholds over persons by treating them as random-effects and adding a random variable for each threshold in the rating scale model (RSM) (Andrich, 1978). The RE-RSM turns out to be a special case of the multidimensional random coefficients multinomial logit model (MRCMLM) (Adams, Wilson, & Wang, 1997) so that the estimation procedures for the MRCMLM can be directly applied. The results of the simulation indicated that when the data were generated from the RSM, using the RSM and the RE-RSM to fit the data made little difference: both resulting in accurate parameter recovery. When the data were generated from the RE-RSM, using the RE-RSM to fit the data resulted in unbiased estimates, whereas using the RSM resulted in biased estimates, large fit statistics for the thresholds, and inflated test reliability. An empirical example of 10 items with four-point rating scales was illustrated in which four models were compared: the RSM, the RE-RSM, the partial credit model (Masters, 1982), and the constrained random-effects partial credit model. In this real data set, the need for a random-effects formulation becomes clear. [source]

Attribution of tumour lethality and estimation of the time to onset of occult tumours in the absence of cause-of-death Information

JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES C (APPLIED STATISTICS), Issue 2 2000
H. Ahn
A new statistical approach is developed for estimating the carcinogenic potential of drugs and other chemical substances used by humans. Improved statistical methods are developed for rodent tumorigenicity assays that have interval sacrifices but not cause-of-death data. For such experiments, this paper proposes a nonparametric maximum likelihood estimation method for estimating the distributions of the time to onset of and the time to death from the tumour. The log-likelihood function is optimized using a constrained direct search procedure. Using the maximum likelihood estimators, the number of fatal tumours in an experiment can be imputed. By applying the procedure proposed to a real data set, the effect of calorie restriction is investigated. In this study, we found that calorie restriction delays the tumour onset time significantly for pituitary tumours. The present method can result in substantial economic savings by relieving the need for a case-by-case assignment of the cause of death or context of observation by pathologists. The ultimate goal of the method proposed is to use the imputed number of fatal tumours to modify Peto's International Agency for Research on Cancer test for application to tumorigenicity assays that lack cause-of-death data. [source]

Semiparametric inference on a class of Wiener processes

JOURNAL OF TIME SERIES ANALYSIS, Issue 2 2009
Xiao Wang
Abstract., This article studies the estimation of a nonhomogeneous Wiener process model for degradation data. A pseudo-likelihood method is proposed to estimate the unknown parameters. An attractive algorithm is established to compute the estimator under this pseudo-likelihood formulation. We establish the asymptotic properties of the estimator, including consistency, convergence rate and asymptotic distribution. Random effects can be incorporated into the model to represent the heterogeneity of degradation paths by letting the mean function be random. The Wiener process model is extended naturally to a normal inverse Gaussian process model and similar pseudo-likelihood inference is developed. A score test is used to test the presence of the random effects. Simulation studies are conducted to validate the method and we apply our method to a real data set in the area of health structure monitoring. [source]

Nonparametric Estimation and Testing in Panels of Intercorrelated Time Series

JOURNAL OF TIME SERIES ANALYSIS, Issue 6 2004
Vidar Hjellvik
Abstract., We consider nonparametric estimation and testing of linearity in a panel of intercorrelated time series. We place the emphasis on the situation where there are many time series in the panel but few observations for each of the series. The intercorrelation is described by a latent process, and a conditioning argument involving this process plays an important role in deriving the asymptotic theory. To be accurate the asymptotic distribution of the test functional of linearity requires a very large number of observations, and bootstrapping gives much better finite sample results. A number of simulation experiments and an illustration on a real data set are included. [source]

A Combinatorial Searching Method for Detecting a Set of Interacting Loci Associated with Complex Traits

ANNALS OF HUMAN GENETICS, Issue 5 2006
Qiuying Sha
Summary Complex diseases are presumed to be the results of the interaction of several genes and environmental factors, with each gene only having a small effect on the disease. Mapping complex disease genes therefore becomes one of the greatest challenges facing geneticists. Most current approaches of association studies essentially evaluate one marker or one gene (haplotype approach) at a time. These approaches ignore the possibility that effects of multilocus functional genetic units may play a larger role than a single-locus effect in determining trait variability. In this article, we propose a Combinatorial Searching Method (CSM) to detect a set of interacting loci (may be unlinked) that predicts the complex trait. In the application of the CSM, a simple filter is used to filter all the possible locus-sets and retain the candidate locus-sets, then a new objective function based on the cross-validation and partitions of the multi-locus genotypes is proposed to evaluate the retained locus-sets. The locus-set with the largest value of the objective function is the final locus-set and a permutation procedure is performed to evaluate the overall p-value of the test for association between the final locus-set and the trait. The performance of the method is evaluated by simulation studies as well as by being applied to a real data set. The simulation studies show that the CSM has reasonable power to detect high-order interactions. When the CSM is applied to a real data set to detect the locus-set (among the 13 loci in the ACE gene) that predicts systolic blood pressure (SBP) or diastolic blood pressure (DBP), we found that a four-locus gene-gene interaction model best predicts SBP with an overall p-value = 0.033, and similarly a two-locus gene-gene interaction model best predicts DBP with an overall p-value = 0.045. [source]

NONPARAMETRIC ESTIMATION OF CONDITIONAL CUMULATIVE HAZARDS FOR MISSING POPULATION MARKS

AUSTRALIAN & NEW ZEALAND JOURNAL OF STATISTICS, Issue 1 2010
Dipankar Bandyopadhyay
Summary A new function for the competing risks model, the conditional cumulative hazard function, is introduced, from which the conditional distribution of failure times of individuals failing due to cause,j,can be studied. The standard Nelson,Aalen estimator is not appropriate in this setting, as population membership (mark) information may be missing for some individuals owing to random right-censoring. We propose the use of imputed population marks for the censored individuals through fractional risk sets. Some asymptotic properties, including uniform strong consistency, are established. We study the practical performance of this estimator through simulation studies and apply it to a real data set for illustration. [source]

ESTIMATION IN RICKER'S TWO-RELEASE METHOD: A BAYESIAN APPROACH

AUSTRALIAN & NEW ZEALAND JOURNAL OF STATISTICS, Issue 2 2006
Shen-Ming Lee
Summary The Ricker's two-release method is a simplified version of the Jolly-Seber method, from Seber's Estimation of Animal Abundance (1982), used to estimate survival rate and abundance in animal populations. This method assumes there is only a single recapture sample and no immigration, emigration or recruitment. In this paper, we propose a Bayesian analysis for this method to estimate the survival rate and the capture probability, employing Markov chain Monte Carlo methods and a latent variable analysis. The performance of the proposed method is illustrated with a simulation study as well as a real data set. The results show that the proposed method provides favourable inference for the survival rate when compared with the modified maximum likelihood method. [source]

Score Tests for Exploring Complex Models: Application to HIV Dynamics Models

BIOMETRICAL JOURNAL, Issue 1 2010
Julia Drylewicz
Abstract In biostatistics, more and more complex models are being developed. This is particularly the case in system biology. Fitting complex models can be very time-consuming, since many models often have to be explored. Among the possibilities are the introduction of explanatory variables and the determination of random effects. The particularity of this use of the score test is that the null hypothesis is not itself very simple; typically, some random effects may be present under the null hypothesis. Moreover, the information matrix cannot be computed, but only an approximation based on the score. This article examines this situation with the specific example of HIV dynamics models. We examine the score test statistics for testing the effect of explanatory variables and the variance of random effect in this complex situation. We study type I errors and the statistical powers of this score test statistics and we apply the score test approach to a real data set of HIV-infected patients. [source]

Sequential designs for ordinal phase I clinical trials

BIOMETRICAL JOURNAL, Issue 2 2009
Guohui Liu
Abstract Sequential designs for phase I clinical trials which incorporate maximum likelihood estimates (MLE) as data accrue are inherently problematic because of limited data for estimation early on. We address this problem for small phase I clinical trials with ordinal responses. In particular, we explore the problem of the nonexistence of the MLE of the logistic parameters under a proportional odds model with one predictor. We incorporate the probability of an undetermined MLE as a restriction, as well as ethical considerations, into a proposed sequential optimal approach, which consists of a start-up design, a follow-on design and a sequential dose-finding design. Comparisons with nonparametric sequential designs are also performed based on simulation studies with parameters drawn from a real data set. [source]

Variable Selection for Semiparametric Mixed Models in Longitudinal Studies

BIOMETRICS, Issue 1 2010
Xiao Ni
Summary We propose a double-penalized likelihood approach for simultaneous model selection and estimation in semiparametric mixed models for longitudinal data. Two types of penalties are jointly imposed on the ordinary log-likelihood: the roughness penalty on the nonparametric baseline function and a nonconcave shrinkage penalty on linear coefficients to achieve model sparsity. Compared to existing estimation equation based approaches, our procedure provides valid inference for data with missing at random, and will be more efficient if the specified model is correct. Another advantage of the new procedure is its easy computation for both regression components and variance parameters. We show that the double-penalized problem can be conveniently reformulated into a linear mixed model framework, so that existing software can be directly used to implement our method. For the purpose of model inference, we derive both frequentist and Bayesian variance estimation for estimated parametric and nonparametric components. Simulation is used to evaluate and compare the performance of our method to the existing ones. We then apply the new method to a real data set from a lactation study. [source]

Flexible Designs for Genomewide Association Studies

BIOMETRICS, Issue 3 2009
André Scherag
Summary Genomewide association studies attempting to unravel the genetic etiology of complex traits have recently gained attention. Frequently, these studies employ a sequential genotyping strategy: A large panel of markers is examined in a subsample of subjects, and the most promising markers are genotyped in the remaining subjects. In this article, we introduce a novel method for such designs enabling investigators to, for example, modify marker densities and sample proportions while strongly controlling the family-wise type I error rate. Loss of efficiency is avoided by redistributing conditional type I error rates of discarded markers. Our approach can be combined with cost optimal designs and entails a greater flexibility than all previously suggested designs. Among other features, it allows for marker selections based upon biological criteria instead of statistical criteria alone, or the option to modify the sample size at any time during the course of the project. For practical applicability, we develop a new algorithm, subsequently evaluate it by simulations, and illustrate it using a real data set. [source]

Mixture Generalized Linear Models for Multiple Interval Mapping of Quantitative Trait Loci in Experimental Crosses

BIOMETRICS, Issue 2 2009
Zehua Chen
Summary Quantitative trait loci mapping in experimental organisms is of great scientific and economic importance. There has been a rapid advancement in statistical methods for quantitative trait loci mapping. Various methods for normally distributed traits have been well established. Some of them have also been adapted for other types of traits such as binary, count, and categorical traits. In this article, we consider a unified mixture generalized linear model (GLIM) for multiple interval mapping in experimental crosses. The multiple interval mapping approach was proposed by Kao, Zeng, and Teasdale (1999, Genetics152, 1203,1216) for normally distributed traits. However, its application to nonnormally distributed traits has been hindered largely by the lack of an efficient computation algorithm and an appropriate mapping procedure. In this article, an effective expectation,maximization algorithm for the computation of the mixture GLIM and an epistasis-effect-adjusted multiple interval mapping procedure is developed. A real data set, Radiata Pine data, is analyzed and the data structure is used in simulation studies to demonstrate the desirable features of the developed method. [source]

Robustified Maximum Likelihood Estimation in Generalized Partial Linear Mixed Model for Longitudinal Data

BIOMETRICS, Issue 1 2009
Guo You Qin
Summary In this article, we study the robust estimation of both mean and variance components in generalized partial linear mixed models based on the construction of robustified likelihood function. Under some regularity conditions, the asymptotic properties of the proposed robust estimators are shown. Some simulations are carried out to investigate the performance of the proposed robust estimators. Just as expected, the proposed robust estimators perform better than those resulting from robust estimating equations involving conditional expectation like Sinha (2004, Journal of the American Statistical Association99, 451,460) and Qin and Zhu (2007, Journal of Multivariate Analysis98, 1658,1683). In the end, the proposed robust method is illustrated by the analysis of a real data set. [source]

On Estimation and Prediction for Spatial Generalized Linear Mixed Models

BIOMETRICS, Issue 1 2002
Hao Zhang
Summary. We use spatial generalized linear mixed models (GLMM) to model non-Gaussian spatial variables that are observed at sampling locations in a continuous area. In many applications, prediction of random effects in a spatial GLMM is of great practical interest. We show that the minimum mean-squared error (MMSE) prediction can be done in a linear fashion in spatial GLMMs analogous to linear kriging. We develop a Monte Carlo version of the EM gradient algorithm for maximum likelihood estimation of model parameters. A by-product of this approach is that it also produces the MMSE estimates for the realized random effects at the sampled sites. This method is illustrated through a simulation study and is also applied to a real data set on plant root diseases to obtain a map of disease severity that can facilitate the practice of precision agriculture. [source]

Model Selection in Estimating Equations

BIOMETRICS, Issue 2 2001
Wei Pan
Summary. Model selection is a necessary step in many practical regression analyses. But for methods based on estimating equations, such as the quasi-likelihood and generalized estimating equation (GEE) approaches, there seem to be few well-studied model selection techniques. In this article, we propose a new model selection criterion that minimizes the expected predictive bias (EPB) of estimating equations. A bootstrap smoothed cross-validation (BCV) estimate of EPB is presented and its performance is assessed via simulation for overdispersed generalized linear models. For illustration, the method is applied to a real data set taken from a study of the development of ewe embryos. [source]

Akaike's Information Criterion in Generalized Estimating Equations

BIOMETRICS, Issue 1 2001
Wei Pan
Summary. Correlated response data are common in biomedical studies. Regression analysis based on the generalized estimating equations (GEE) is an increasingly important method for such data. However, there seem to be few model-selection criteria available in GEE. The well-known Akaike Information Criterion (AIC) cannot be directly applied since AIC is based on maximum likelihood estimation while GEE is nonlikelihood based. We propose a modification to AIC, where the likelihood is replaced by the quasi-likelihood and a proper adjustment is made for the penalty term. Its performance is investigated through simulation studies. For illustration, the method is applied to a real data set. [source]

Palaeomorphology: fossils and the inference of cladistic relationships

ACTA ZOOLOGICA, Issue 1 2010
Gregory D. Edgecombe
Abstract Edgecombe, G.D. 2010. Palaeomorphology: fossils and the inference of cladistic relationships. ,Acta Zoologica (Stockholm) 91: 72,80 Twenty years have passed since it was empirically demonstrated that inclusion of extinct taxa could overturn a phylogenetic hypothesis formulated upon extant taxa alone, challenging Colin Patterson's bold conjecture that this phenomenon ,may be non-existent'. Suppositions and misconceptions about missing data, often couched in terms of ,wildcard taxa' and ,the missing data problem', continue to cloud the literature on the topic of fossils and phylogenetics. Comparisons of real data sets show that no a priori (or indeed a posteriori) decisions can be made about amounts of missing data and most properties of cladograms, and both simulated and real data sets demonstrate that even highly incomplete taxa can impact on relationships. The exclusion of fossils from phylogenetic analyses is neither theoretically nor empirically defensible. [source]

Variable smoothing in Bayesian intrinsic autoregressions

ENVIRONMETRICS, Issue 8 2007
Mark J. Brewer
Abstract We introduce an adapted form of the Markov random field (MRF) for Bayesian spatial smoothing with small-area data. This new scheme allows the amount of smoothing to vary in different parts of a map by employing area-specific smoothing parameters, related to the variance of the MRF. We take an empirical Bayes approach, using variance information from a standard MRF analysis to provide prior information for the smoothing parameters of the adapted MRF. The scheme is shown to produce proper posterior distributions for a broad class of models. We test our method on both simulated and real data sets, and for the simulated data sets, the new scheme is found to improve modelling of both slowly-varying levels of smoothness and discontinuities in the response surface. Copyright © 2007 John Wiley & Sons, Ltd. [source]

An empirical method for inferring species richness from samples

ENVIRONMETRICS, Issue 2 2006
Paul A. Murtaugh
Abstract We introduce an empirical method of estimating the number of species in a community based on a random sample. The numbers of sampled individuals of different species are modeled as a multinomial random vector, with cell probabilities estimated by the relative abundances of species in the sample and, for hypothetical species missing from the sample, by linear extrapolation from the abundance of the rarest observed species. Inference is then based on likelihoods derived from the multinomial distribution, conditioning on a range of possible values of the true richness in the community. The method is shown to work well in simulations based on a variety of real data sets. Copyright © 2005 John Wiley & Sons, Ltd. [source]

Case-control association testing in the presence of unknown relationships

GENETIC EPIDEMIOLOGY, Issue 8 2009
Yoonha Choi
Abstract Genome-wide association studies result in inflated false-positive results when unrecognized cryptic relatedness exists. A number of methods have been proposed for testing association between markers and disease with a correction for known pedigree-based relationships. However, in most case-control studies, relationships are generally unknown, yet the design is predicated on the assumption of at least ancestral relatedness among cases. Here, we focus on adjusting cryptic relatedness when the genealogy of the sample is unknown, particularly in the context of samples from isolated populations where cryptic relatedness may be problematic. We estimate cryptic relatedness using maximum-likelihood methods and use a corrected ,2 test with estimated kinship coefficients for testing in the context of unknown cryptic relatedness. Estimated kinship coefficients characterize precisely the relatedness between truly related people, but are biased for unrelated pairs. The proposed test substantially reduces spurious positive results, producing a uniform null distribution of P -values. Especially with missing pedigree information, estimated kinship coefficients can still be used to correct non-independence among individuals. The corrected test was applied to real data sets from genetic isolates and created a distribution of P -value that was close to uniform. Thus, the proposed test corrects the non-uniform distribution of P -values obtained with the uncorrected test and illustrates the advantage of the approach on real data. Genet. Epidemiol. 33:668,678, 2009. © 2009 Wiley-Liss, Inc. [source]

Inferences for Selected Location Quotients with Applications to Health Outcomes

GEOGRAPHICAL ANALYSIS, Issue 3 2010
Gemechis Dilba Djira
The location quotient (LQ) is an index frequently used in geography and economics to measure the relative concentration of activities. This quotient is calculated in a variety of ways depending on which group is used as a reference. Here, we focus on a simultaneous inference for the ratios of the individual proportions to the overall proportion based on binomial data. This is a multiple comparison problem and inferences for LQs with adjustments for multiplicity have not been addressed before. The comparisons are negatively correlated. The quotients can be simultaneously tested against unity, and simultaneous confidence intervals can be constructed for the LQs based on existing probability inequalities and by directly using the asymptotic joint distribution of the associated test statistics. The proposed inferences are appropriate for analysis based on sample surveys. Two real data sets are used to demonstrate the application of multiplicity-adjusted LQs. A simulation study is also carried out to assess the performance of the proposed methods to achieve a nominal coverage probability. For the LQs considered, the coverage of the simple Bonferroni-adjusted Fieller intervals for LQs is observed to be almost as good as the coverage of the method that directly takes the correlations into account. El cociente de localización (LQ) es un índice de uso frecuente en las disciplinas de Geografía y Economía para medir la concentración relativa de actividades. El cálculo del cociente se realiza de una variedad de formas, dependiendo del grupo que se utilice como referencia. El presente artículo aborda el problema de realizar inferencias simultáneas con tasas que describen proporciones individuales en relación a proporciones globales, para el caso de datos en escala binomial. Este problema puede ser caracterizado como uno de tipo de comparaciones múltiples (multiple comparison problem). Salvo el estudio presente, no existen precedentes de métodos diseñados para realizar inferencias de LQ que estén ajustados para abordar comparaciones múltiples. Las comparaciones están correlacionadas negativamente. Los cocientes pueden ser evaluados simultáneamente para verificar la propiedad de unidad (unity), y se pueden construir intervalos de confianza simultáneos para un LQ basado en la desigualdad de probabilidades existentes y por medio del uso directo de la distribución asintótica conjunta (asymtotic joint distribution) de los test o pruebas estadísticas asociadas. El tipo de inferencias propuestas por los autores son las adecuadas para el análisis de encuestas por muestreo. Para demostrar la aplicación del LQ desarrollado por el estudio, se utilizan dos conjuntos de datos del mundo real. Asimismo se lleva a cabo un estudio de simulación para evaluar el desempeño de los métodos propuestos con el fin de alcanzar una probabilidad de cobertura nominal (nominal coverage). Para los LQs seleccionados, la cobertura de los intervalos de confianza simples Fieller-Bonferroni ajustados para LQ, producen resultados casi tan buenos como la cobertura de métodos que toma en cuenta las correlaciones directamente. [source]

Unsupervised separation of seismic waves using the watershed algorithm on time-scale images

GEOPHYSICAL PROSPECTING, Issue 4 2004
Antoine Roueff
ABSTRACT This paper illustrates the use of image processing techniques for separating seismic waves. Because of the non-stationarity of seismic signals, the continuous wavelet transform is more suitable than the conventional Fourier transforms for the representation, and thus the analysis, of seismic processes. It provides a 2D representation, called a scalogram, of a 1D signal where the seismic events are well localized and isolated. Supervised methods based on this time-scale representation have already been used to separate seismic events, but they require strong interactions with the geophysicist. This paper focuses on the use of the watershed algorithm to segment time-scale representations of seismic signals, which leads to an automatic estimation of the wavelet representation of each wave separately. The computation of the inverse wavelet transform then leads to the reconstruction of the different waves. This segmentation, tracked over the different traces of the seismic profile, enables an accurate separation of the different wavefields. This method has been successfully validated on several real data sets. [source]