Sampling Distribution (sampling + distribution)

Distribution by Scientific Domains


Selected Abstracts


Geostatistical Prediction and Simulation of Point Values from Areal Data

GEOGRAPHICAL ANALYSIS, Issue 2 2005
Phaedon C. Kyriakidis
The spatial prediction and simulation of point values from areal data are addressed within the general geostatistical framework of change of support (the term support referring to the domain informed by each measurement or unknown value). It is shown that the geostatistical framework (i) can explicitly and consistently account for the support differences between the available areal data and the sought-after point predictions, (ii) yields coherent (mass-preserving or pycnophylactic) predictions, and (iii) provides a measure of reliability (standard error) associated with each prediction. In the case of stochastic simulation, alternative point-support simulated realizations of a spatial attribute reproduce (i) a point-support histogram (Gaussian in this work), (ii) a point-support semivariogram model (possibly including anisotropic nested structures), and (iii) when upscaled, the available areal data. Such point-support-simulated realizations can be used in a Monte Carlo framework to assess the uncertainty in spatially distributed model outputs operating at a fine spatial resolution because of uncertain input parameters inferred from coarser spatial resolution data. Alternatively, such simulated realizations can be used in a model-based hypothesis-testing context to approximate the sampling distribution of, say, the correlation coefficient between two spatial data sets, when one is available at a point support and the other at an areal support. A case study using synthetic data illustrates the application of the proposed methodology in a remote sensing context, whereby areal data are available on a regular pixel support. It is demonstrated that point-support (sub-pixel scale) predictions and simulated realizations can be readily obtained, and that such predictions and realizations are consistent with the available information at the coarser (pixel-level) spatial resolution. [source]


Sampling bias and logistic models

JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES B (STATISTICAL METHODOLOGY), Issue 4 2008
Peter McCullagh
Summary., In a regression model, the joint distribution for each finite sample of units is determined by a function px(y) depending only on the list of covariate values x=(x(u1),,,x(un)) on the sampled units. No random sampling of units is involved. In biological work, random sampling is frequently unavoidable, in which case the joint distribution p(y,x) depends on the sampling scheme. Regression models can be used for the study of dependence provided that the conditional distribution p(y|x) for random samples agrees with px(y) as determined by the regression model for a fixed sample having a non-random configuration x. The paper develops a model that avoids the concept of a fixed population of units, thereby forcing the sampling plan to be incorporated into the sampling distribution. For a quota sample having a predetermined covariate configuration x, the sampling distribution agrees with the standard logistic regression model with correlated components. For most natural sampling plans such as sequential or simple random sampling, the conditional distribution p(y|x) is not the same as the regression distribution unless px(y) has independent components. In this sense, most natural sampling schemes involving binary random-effects models are biased. The implications of this formulation for subject-specific and population-averaged procedures are explored. [source]


Small confidence sets for the mean of a spherically symmetric distribution

JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES B (STATISTICAL METHODOLOGY), Issue 3 2005
Richard Samworth
Summary., Suppose that X has a k -variate spherically symmetric distribution with mean vector , and identity covariance matrix. We present two spherical confidence sets for ,, both centred at a positive part Stein estimator . In the first, we obtain the radius by approximating the upper , -point of the sampling distribution of by the first two non-zero terms of its Taylor series about the origin. We can analyse some of the properties of this confidence set and see that it performs well in terms of coverage probability, volume and conditional behaviour. In the second method, we find the radius by using a parametric bootstrap procedure. Here, even greater improvement in terms of volume over the usual confidence set is possible, at the expense of having a less explicit radius function. A real data example is provided, and extensions to the unknown covariance matrix and elliptically symmetric cases are discussed. [source]


NORTH ATLANTIC RIGHT WHALE DISTRIBUTION IN RELATION TO SEA-SURFACE TEMPERATURE IN THE SOUTHEASTERN UNITED STATES CALVING GROUNDS

MARINE MAMMAL SCIENCE, Issue 2 2006
Chérie A. Keller
Abstract Standardized aerial surveys were used to document the winter (December,March) distribution of North Atlantic right whales in their calving area off the coasts of Georgia and northeastern Florida (1991,1998). Survey data were collected within four survey zones in and adjacent to federally designated critical habitat. These data, including whale-sighting locations and sampling effort, were used to describe right whale distribution in relation to sea-surface temperature (SST) from satellite-derived images. Locations where whales were sighted (n= 609) had an overall mean SST of 14.3°C ± 2.1° (range 8°,22°C). Data from two survey zones having sufficient data (including the "early warning system" (EWS) zone and the Florida nearshore) were pooled by season and stratified by month to investigate changes in monthly ambient SST and fine-scale distribution patterns of right whales in relation to SST within spatially explicit search areas. Using Monte Carlo techniques, SSTs and latitudes (means and standard deviations) of locations where whales were sighted were compared to a sampling distribution of each variable derived from daily-search areas. Overall, results support a nonrandom distribution of right whales in relation to SST: during resident months (January and February), whales exhibited low variability in observed SST and a suggested southward shift in whale distribution toward warmer SSTs in the EWS zone; while in the relatively warmer and southernmost survey zone (Florida nearshore), right whales were concentrated in the northern, cooler portion. Our results support that warm Gulf Stream waters, generally found south and east of delineated critical habitat, represent a thermal limit for right whales and play an important role in their distribution within the calving grounds. These results affirm the inclusion of SST in a multivariate predictive model for right whale distribution in their southeastern habitat. [source]


Portfolio Value-at-Risk with Heavy-Tailed Risk Factors

MATHEMATICAL FINANCE, Issue 3 2002
Paul Glasserman
This paper develops efficient methods for computing portfolio value-at-risk (VAR) when the underlying risk factors have a heavy-tailed distribution. In modeling heavy tails, we focus on multivariate t distributions and some extensions thereof. We develop two methods for VAR calculation that exploit a quadratic approximation to the portfolio loss, such as the delta-gamma approximation. In the first method, we derive the characteristic function of the quadratic approximation and then use numerical transform inversion to approximate the portfolio loss distribution. Because the quadratic approximation may not always yield accurate VAR estimates, we also develop a low variance Monte Carlo method. This method uses the quadratic approximation to guide the selection of an effective importance sampling distribution that samples risk factors so that large losses occur more often. Variance is further reduced by combining the importance sampling with stratified sampling. Numerical results on a variety of test portfolios indicate that large variance reductions are typically obtained. Both methods developed in this paper overcome difficulties associated with VAR calculation with heavy-tailed risk factors. The Monte Carlo method also extends to the problem of estimating the conditional excess, sometimes known as the conditional VAR. [source]


Statistical hypothesis testing in intraspecific phylogeography: nested clade phylogeographical analysis vs. approximate Bayesian computation

MOLECULAR ECOLOGY, Issue 2 2009
ALAN R. TEMPLETON
Abstract Nested clade phylogeographical analysis (NCPA) and approximate Bayesian computation (ABC) have been used to test phylogeographical hypotheses. Multilocus NCPA tests null hypotheses, whereas ABC discriminates among a finite set of alternatives. The interpretive criteria of NCPA are explicit and allow complex models to be built from simple components. The interpretive criteria of ABC are ad hoc and require the specification of a complete phylogeographical model. The conclusions from ABC are often influenced by implicit assumptions arising from the many parameters needed to specify a complex model. These complex models confound many assumptions so that biological interpretations are difficult. Sampling error is accounted for in NCPA, but ABC ignores important sources of sampling error that creates pseudo-statistical power. NCPA generates the full sampling distribution of its statistics, but ABC only yields local probabilities, which in turn make it impossible to distinguish between a good fitting model, a non-informative model, and an over-determined model. Both NCPA and ABC use approximations, but convergences of the approximations used in NCPA are well defined whereas those in ABC are not. NCPA can analyse a large number of locations, but ABC cannot. Finally, the dimensionality of tested hypothesis is known in NCPA, but not for ABC. As a consequence, the ,probabilities' generated by ABC are not true probabilities and are statistically non-interpretable. Accordingly, ABC should not be used for hypothesis testing, but simulation approaches are valuable when used in conjunction with NCPA or other methods that do not rely on highly parameterized models. [source]


Lower confidence limits for process capability indices Cp and Cpk when data are autocorrelated

QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, Issue 6 2009
Cynthia R. Lovelace
Abstract Many organizations use a single estimate of Cp and/or Cpk for process benchmarking, without considering the sampling variability of the estimators and how that impacts the probability of meeting minimum index requirements. Lower confidence limits have previously been determined for the Cp and Cpk indices under the standard assumption of independent data, which are based on the sampling distributions of the index estimators. In this paper, lower 100(1-,)% confidence limits for Cp and Cpk were developed for autocorrelated processes. Simulation was used to generate the empirical sampling distribution of each estimator for various combinations of sample size (n), autoregressive parameter (,), true index value (Cp or Cpk), and confidence level. In addition, the minimum values of the estimators required in order to meet quality requirements with 100(1-,)% certainty were also determined from these empirical sampling distributions. These tables may be used by practitioners to set minimum capability requirements for index estimators, rather than true values, for the autocorrelated case. The implications of these results for practitioners will be discussed. Copyright © 2008 John Wiley & Sons, Ltd. [source]


Improving robust model selection tests for dynamic models

THE ECONOMETRICS JOURNAL, Issue 2 2010
Hwan-sik Choi
Summary, We propose an improved model selection test for dynamic models using a new asymptotic approximation to the sampling distribution of a new test statistic. The model selection test is applicable to dynamic models with very general selection criteria and estimation methods. Since our test statistic does not assume the exact form of a true model, the test is essentially non-parametric once competing models are estimated. For the unknown serial correlation in data, we use a Heteroscedasticity/Autocorrelation-Consistent (HAC) variance estimator, and the sampling distribution of the test statistic is approximated by the fixed- b,asymptotic approximation. The asymptotic approximation depends on kernel functions and bandwidth parameters used in HAC estimators. We compare the finite sample performance of the new test with the bootstrap methods as well as with the standard normal approximations, and show that the fixed- b,asymptotics and the bootstrap methods are markedly superior to the standard normal approximation for a moderate sample size for time series data. An empirical application for foreign exchange rate forecasting models is presented, and the result shows the normal approximation to the distribution of the test statistic considered appears to overstate the data's ability to distinguish between two competing models. [source]


Boreal winter predictions with the GEOS-2 GCM: The role of boundary forcing and initial conditions

THE QUARTERLY JOURNAL OF THE ROYAL METEOROLOGICAL SOCIETY, Issue 567 2000
Yehui Chang
Abstract Ensembles of atmospheric General Circulation Model (GCM) seasonal forecasts and long-term simulations are analysed to assess the controlling influences of boundary forcing and memory of the initial conditions. Both the forecasts and simulations are carried out with version 2 of the Goddard Earth Observing System (GEOS-2) GCM forced with observed sea surface temperatures (SSTs). While much of the focus is on the seasonal time-scale (January-March; 1981,95) and the Pacific North American (PNA) region, we also present results for other regions, shorter time-scales, and other known modes of variability in the northern hemisphere extratropics. Forecasts of indices of some of the key large-scale modes of variability show that there is considerable variability in skill between different regions of the northern hemisphere. The eastern North Atlantic region has the poorest long-lead forecast skill, showing no skill beyond about 10 days. Skilful seasonal forecasts are primarily confined to the wave-like El Niño Southern Oscillation (ENSO) response emanating from the tropical Pacific. In the northern hemisphere, this is similar to the well-known PNA pattern. Memory of the initial conditions is the major factor leading to skilful extratropical forecasts of lead time less than one month, while boundary forcing is the dominant factor at the seasonal time-scale. Boundary forcing contributes to skilful forecasts at sub-seasonal time-scales only over the PNA region. The GEOS-2 GCM produces average signal-to-noise ratios which are less than 1.0 everywhere in the extra-tropics, except for the subtropical Pacific where they approach 1.5. An assessment of the sampling distribution of the forecasts suggests the model's ENSO response is very likely too weak. These results show some sensitivity to the uncertainties in the estimates of the SST forcing fields. In the North Pacific region, the sensitivity to SST forcing manifests itself primarily as changes in the variability of the PNA response, underscoring the need for an ensemble approach to the seasonal-prediction problem. [source]


Statistical Inference for Familial Disease Clusters

BIOMETRICS, Issue 3 2002
Chang Yu
Summary. In many epidemiologic studies, the first indication of an environmental or genetic contribution to the disease is the way in which the diseased cases cluster within the same family units. The concept of clustering is contrasted with incidence. We assume that all individuals are exchangeable except for their disease status. This assumption is used to provide an exact test of the initial hypothesis of no familial link with the disease, conditional on the number of diseased cases and the distribution of the sizes of the various family units. New parametric generalizations of binomial sampling models are described to provide measures of the effect size of the disease clustering. We consider models and an example that takes covariates into account. Ascertainment bias is described and the appropriate sampling distribution is demonstrated. Four numerical examples with real data illustrate these methods. [source]


Effectiveness of Conservation Targets in Capturing Genetic Diversity

CONSERVATION BIOLOGY, Issue 1 2003
Maile C. Neel
We used empirical data from four rare plant taxa to assess these consequences in terms of how well allele numbers ( all alleles and alleles occurring at a frequency openface>0.05 in any population ) and expected heterozygosity are represented when different numbers of populations are conserved. We determined sampling distributions for these three measures of genetic diversity using Monte Carlo methods. We assessed the proportion of alleles included in the number of populations considered adequate for conservation, needed to capture all alleles, and needed to meet an accepted standard of genetic-diversity conservation of having a 90,95% probability of including all common alleles. We also assessed the number of populations necessary to obtain values of heterozygosity within ±10% of the value obtained from all populations. Numbers of alleles were strongly affected by the number of populations sampled. Heterozygosity was only slightly less sensitive to numbers of populations than were alleles. On average, currently advocated conservation intensities represented 67,83% of all alleles and 85,93% of common alleles. The smallest number of populations to include all alleles ranged from 6 to 17 ( 42,57% ), but <0.2% of 1000 samples of these numbers of populations included them all. It was necessary to conserve 16,29 ( 53,93% ) of the sampled populations to meet the standard for common alleles. Between 20% and 64% of populations were needed to reliably represent species-level heterozygosity. Thus, higher percentages of populations are needed than are currently considered adequate to conserve genetic diversity if populations are selected without genetic data. Resumen: Cualquier acción de conservación que preserve algunas poblaciones y no otras tendrá consecuencias genéticas. Utilizamos datos empíricos de cuatro taxones de plantas raras para evaluar estas consecuencias en términos de lo bien representados que están los números de alelos ( todos los alelos ocurriendo a una frecuencia>0.05 en cualquier población ) y la heterocigosidad esperada cuando se conservan diferentes números de poblaciones. Las distribuciones de muestreo de estas tres medidas de la diversidad genética fueron determinadas utilizando métodos Monte Carlo. Evaluamos la proporción de alelos incluida en números de poblaciones: consideradas adecuadas para la conservación; requeridas para capturar todos los alelos; y las requeridas para alcanzar un estándar de conservación de diversidad genética aceptable del 90,95% de probabilidad de incluir todos los alelos comunes. También evaluamos el número de poblaciones necesarias para obtener valores de heterocigosidad que caigan dentro de ±10% del valor obtenido de todas las poblaciones. Los números de alelos fueron afectados significativamente por el número de poblaciones muestreadas. La heterocigosidad solo fue ligeramente menos sensible a los números de poblaciones de lo que fueron los alelos. Las intensidades de conservación propugnadas actualmente representaron en promedio el 67,83% de todos los alelos y el 85,93% de los alelos comunes. El menor número de poblaciones para incluir a todos los alelos varió de 6 a 17 ( 42,57% ), pero <0.2% de 1000 muestras de esos números de poblaciones los incluyó a todos. Fue necesario conservar de 16 a 29 ( 53,93% ) de las poblaciones muestreadas para alcanzar el estándar para los alelos comunes. Se requirió entre 20% y 64% de las poblaciones para representar la heterocigosidad a nivel de especie confiablemente. Por lo tanto, se requieren mayores porcentajes de poblaciones que los actualmente considerados adecuados para conservar la diversidad genética si las poblaciones son seleccionadas sin datos genéticos. [source]


Lower confidence limits for process capability indices Cp and Cpk when data are autocorrelated

QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, Issue 6 2009
Cynthia R. Lovelace
Abstract Many organizations use a single estimate of Cp and/or Cpk for process benchmarking, without considering the sampling variability of the estimators and how that impacts the probability of meeting minimum index requirements. Lower confidence limits have previously been determined for the Cp and Cpk indices under the standard assumption of independent data, which are based on the sampling distributions of the index estimators. In this paper, lower 100(1-,)% confidence limits for Cp and Cpk were developed for autocorrelated processes. Simulation was used to generate the empirical sampling distribution of each estimator for various combinations of sample size (n), autoregressive parameter (,), true index value (Cp or Cpk), and confidence level. In addition, the minimum values of the estimators required in order to meet quality requirements with 100(1-,)% certainty were also determined from these empirical sampling distributions. These tables may be used by practitioners to set minimum capability requirements for index estimators, rather than true values, for the autocorrelated case. The implications of these results for practitioners will be discussed. Copyright © 2008 John Wiley & Sons, Ltd. [source]