Simple Random Sampling (simple + random_sampling)

Distribution by Scientific Domains


Selected Abstracts


Bootstrap Methods for Markov Processes

ECONOMETRICA, Issue 4 2003
Joel L. Horowitz
The block bootstrap is the best known bootstrap method for time-series data when the analyst does not have a parametric model that reduces the data generation process to simple random sampling. However, the errors made by the block bootstrap converge to zero only slightly faster than those made by first-order asymptotic approximations. This paper describes a bootstrap procedure for data that are generated by a Markov process or a process that can be approximated by a Markov process with sufficient accuracy. The procedure is based on estimating the Markov transition density nonparametrically. Bootstrap samples are obtained by sampling the process implied by the estimated transition density. Conditions are given under which the errors made by the Markov bootstrap converge to zero more rapidly than those made by the block bootstrap. [source]


Ratio estimators in adaptive cluster sampling

ENVIRONMETRICS, Issue 6 2007
Arthur L. Dryver
Abstract In most surveys data are collected on many items rather than just the one variable of primary interest. Making the most use of the information collected is a issue of both practical and theoretical interest. Ratio estimates for the population mean or total are often more efficient. Unfortunately, ratio estimation is straightforward with simple random sampling, but this is often not the case when more complicated sampling designs are used, such as adaptive cluster sampling. A serious concern with ratio estimates introduced with many complicated designs is lack of independence, a necessary assumption. In this article, we propose two new ratio estimators under adaptive cluster sampling, one of which is unbiased for adaptive cluster sampling designs. The efficiencies of the new estimators to existing unbiased estimators, which do not utilize the auxiliary information, for adaptive cluster sampling and the conventional ratio estimation under simple random sampling without replacement are compared in this article. Related result shows the proposed estimators can be considered as a robust alternative of the conventional ratio estimator, especially when the correlation between the variable of interest and the auxiliary variable is not high enough for the conventional ratio estimator to have satisfactory performance. Copyright © 2007 John Wiley & Sons, Ltd. [source]


Parametric estimation for the location parameter for symmetric distributions using moving extremes ranked set sampling with application to trees data

ENVIRONMETRICS, Issue 7 2003
Mohammad Fraiwan Al-Saleh
Abstract A modification of ranked set sampling (RSS) called moving extremes ranked set sampling (MERSS) is considered parametrically, for the location parameter of symmetric distributions. A maximum likelihood estimator (MLE) and a modified MLE are considered and their properties are studied. Their efficiency with respect to the corresponding estimators based on simple random sampling (SRS) are compared for the case of normal distribution. The method is studied under both perfect and imperfect ranking (with error in ranking). It appears that these estimators can be real competitors to the MLE using (SRS). The procedure is illustrated using tree data. Copyright © 2003 John Wiley & Sons, Ltd. [source]


Online end-to-end quality of service monitoring for service level agreement management

INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, Issue 4 2008
Xiaoyuan Ta
Abstract A major challenge in network and service level agreement (SLA) management is to provide Quality of Service (QoS) demanded by heterogeneous network applications. Online QoS monitoring plays an important role in the process by providing objective measurements that can be used for improving network design, troubleshooting and management. Online QoS monitoring becomes increasingly difficult and complex due to the rapid expansion of the Internet and the dramatic increase in the speed of network. Sampling techniques have been explored as a means to reduce the difficulty and complexity of measurement. In this paper, we investigate several major sampling techniques, i.e. systematic sampling, simple random sampling and stratified sampling. Performance analysis is conducted on these techniques. It is shown that stratified sampling with optimum allocation has the best performance. However, stratified sampling with optimum allocation requires additional statistics usually not available for real-time applications. An adaptive stratified sampling algorithm is proposed to solve the problem. Both theoretical analysis and simulation show that the proposed adaptive stratified sampling algorithm outperforms other sampling techniques and achieves a performance comparable to stratified sampling with optimum allocation. A QoS monitoring software using the aforementioned sampling techniques is designed and tested in various real networks. Copyright © 2007 John Wiley & Sons, Ltd. [source]


Selective sampling for approximate clustering of very large data sets

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, Issue 3 2008
Liang Wang
A key challenge in pattern recognition is how to scale the computational efficiency of clustering algorithms on large data sets. The extension of non-Euclidean relational fuzzy c-means (NERF) clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF) clustering algorithm, which comprises four phases: (i) finding distinguished features that monitor progressive sampling; (ii) progressively sampling from a N × N relational matrix RN to obtain a n × n sample matrix Rn; (iii) clustering Rn with literal NERF; and (iv) extending the clusters in Rn to the remainder of the relational data. Previously published examples on several fairly small data sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases (i) and (ii), i.e., finding Rn, are not very practical because the sample size n often turns out to be roughly 50% of n, and this over-sampling defeats the whole purpose of eNERF. In this paper, we examine the performance of the sampling scheme of eNERF with respect to different parameters. We propose a modified sampling scheme for use with eNERF that combines simple random sampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT (scalable visual assessment of clustering tendency). We demonstrate that our modified sampling scheme can eliminate over-sampling of the original progressive sampling scheme, thus enabling the processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000 vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility and effectiveness of the proposed sampling method. We also find that actually running eNERF on a data set of this size is very costly in terms of computation time. Thus, our results demonstrate that further modification of eNERF, especially the extension stage, will be needed before it is truly practical for VL data. © 2008 Wiley Periodicals, Inc. [source]


Backbone Diversity Analysis in Catalyst Design

ADVANCED SYNTHESIS & CATALYSIS (PREVIOUSLY: JOURNAL FUER PRAKTISCHE CHEMIE), Issue 3 2009

Abstract We present a computer-based heuristic framework for designing libraries of homogeneous catalysts. In this approach, a set of given bidentate ligand-metal complexes is disassembled into key substructures ("building blocks"). These include metal atoms, ligating groups, backbone groups, and residue groups. The computer then rearranges these building blocks into a new library of virtual catalysts. We then tackle the practical problem of choosing a diverse subset of catalysts from this library for actual synthesis and testing. This is not trivial, since ,catalyst diversity' itself is a vague concept. Thus, we first define and quantify this diversity as the difference between key structural parameters (descriptors) of the catalysts, for the specific reaction at hand. Subsequently, we propose a method for choosing diverse sets of catalysts based on catalyst backbone selection, using weighted D-optimal design. The computer selects catalysts with different backbones, where the difference is measured as a distance in the descriptors space. We show that choosing such a D-optimal subset of backbones gives more diversity than a simple random sampling. The results are demonstrated experimentally in the nickel-catalysed hydrocyanation of 3-pentenenitrile to adiponitrile. Finally, the connection between backbone diversity and catalyst diversity, and the implications towards in silico catalysis design are discussed. [source]


Epidemiological study of symptomatic gastroesophageal reflux disease in China: Beijing and Shanghai

JOURNAL OF DIGESTIVE DISEASES, Issue 1 2000
Pan Guozong
OBJECTIVE: To explore the 1-year point prevalences (July,September 1996) of symptomatic gastroesophageal reflux (GER), gastroesophageal reflux disease (GERD) and reflux esophagitis (RE) in the adult population of two Chinese city-regions (Beijing and Shanghai) and to identify the conditions that predispose patients to GERD. METHODS: Phase I: 5000 residents of the two regions aged between 18 and 70 years were studied via a questionnaire. The study was carried out by cluster sampling from city, suburban and rural areas by using simple random sampling. Symptom scores (Sc) of the intensity and frequency of heartburn, acid reflux and regurgitation within 1 year of the time of study were taken as indices of acid reflux (highest score, Sc = 18) and Sc , 6 indicated the presence of symptomatic GER. Phase II: a small number of patients who were identified as having symptomatic GER in the survey were enrolled in a case, control study using gastroscopy and 24-h pH monitoring to obtain correct diagnostic rates of GERD and RE. Estimates of the prevalence of GERD and RE were then adjusted according to the rates of correct diagnosis. RESULTS: A total of 4992 subjects completed the survey, 2.5% had heartburn once daily, 8.97% had symptomatic GER (Sc , 6) and the male to female ratio was 1:1.11. Point prevalences for the year for GERD and RE were 5.77 and 1.92%, respectively. Stratified analysis indicated that the prevalence of symptomatic GER in Beijing (10.19%) was higher than that in Shanghai (7.76%) and there was also a higher prevalence of GER in males, manual laborers, people from rural areas and people older than 40 years of age in Beijing as compared with Shanghai. Stepwise logistic analysis indicated that GER had a close relationship with dental, pharyngolaryngeal disorders and respiratory diseases. The conditions that predispose patients to GERD are (OR, odds ratio): age > 40 (OR = 1.01), eating greasy/oily food (OR = 6.56), overeating (OR = 1.99), tiredness (OR = 2.35), emotional stress (OR = 2.22), pregnancy (OR = 6.80) and constipation (OR = 1.65). CONCLUSIONS: Gastroesophageal reflux disease is a common disease in the adult Chinese population and it is more common in Beijing than in Shanghai. [source]


Sampling bias and logistic models

JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES B (STATISTICAL METHODOLOGY), Issue 4 2008
Peter McCullagh
Summary., In a regression model, the joint distribution for each finite sample of units is determined by a function px(y) depending only on the list of covariate values x=(x(u1),,,x(un)) on the sampled units. No random sampling of units is involved. In biological work, random sampling is frequently unavoidable, in which case the joint distribution p(y,x) depends on the sampling scheme. Regression models can be used for the study of dependence provided that the conditional distribution p(y|x) for random samples agrees with px(y) as determined by the regression model for a fixed sample having a non-random configuration x. The paper develops a model that avoids the concept of a fixed population of units, thereby forcing the sampling plan to be incorporated into the sampling distribution. For a quota sample having a predetermined covariate configuration x, the sampling distribution agrees with the standard logistic regression model with correlated components. For most natural sampling plans such as sequential or simple random sampling, the conditional distribution p(y|x) is not the same as the regression distribution unless px(y) has independent components. In this sense, most natural sampling schemes involving binary random-effects models are biased. The implications of this formulation for subject-specific and population-averaged procedures are explored. [source]


Modeling multiple-response categorical data from complex surveys

THE CANADIAN JOURNAL OF STATISTICS, Issue 4 2009
Christopher R. Bilder
Abstract Although "choose all that apply" questions are common in modern surveys, methods for analyzing associations among responses to such questions have only recently been developed. These methods are generally valid only for simple random sampling, but these types of questions often appear in surveys conducted under more complex sampling plans. The purpose of this article is to provide statistical analysis methods that can be applied to "choose all that apply" questions in complex survey sampling situations. Loglinear models are developed to incorporate the multiple responses inherent in these types of questions. Statistics to compare models and to measure association are proposed and their asymptotic distributions are derived. Monte Carlo simulations show that tests based on adjusted Pearson statistics generally hold their correct size when comparing models. These simulations also show that confidence intervals for odds ratios estimated from loglinear models have good coverage properties, while being shorter than those constructed using empirical estimates. Furthermore, the methods are shown to be applicable to more general problems of modeling associations between elements of two or more binary vectors. The proposed analysis methods are applied to data from the National Health and Nutrition Examination Survey. The Canadian Journal of Statistics © 2009 Statistical Society of Canada Quoique les questions du type « Sélectionner une ou plusieurs réponses » sont courantes dans les enquêtes modernes, les méthodes pour analyser les associations entre les réponses à de telles questions viennent seulement d'être développées. Ces méthodes sont habituellement valides uni-quement pour des échantillons aléatoires simples, mais ce genre de questions apparaissent souvent dans les enquêtes conduites sous des plans de sondage beaucoup plus complexes. Le but de cet article est de donner des méthodes d'analyse statistique pouvant être appliquées aux questions de type « Sélectionner une ou plusieurs réponses » dans des enquêtes utilisant des plans de sondage complexes. Des modèles loglinéaires sont développés permettant d'incorporer les réponses multiples inhérentes à ce type de questions. Des statistiques permettant de comparer les modèles et de mesu-rer l'association sont proposées et leurs distributions asymptotiques sont obtenues. Des simulations de Monte-Carlo montrent que les tests basés sur les statistiques de Pearson ajustées maintiennent généralement leur niveau lorsqu'ils sont utilisés pour comparer des modèles. Ces études montrent également que les niveaux des intervalles de confiance pour les rapports de cotes estimés à par-tir des modèles loglinéaires ont de bonnes propriétés de couverture tout en étant plus courts que ceux utilisant les estimations empiriques. De plus, il est montré que ces méthodes peuvent aussi êtres utilisées dans un contexte plus général de modélisation de l'association entre les éléments de deux ou plusieurs vecteurs binaires. Les méthodes d'analyse proposées sont appliquées à des données provenant de l'étude américaine « National Health and Nutrition Examination Survey » (NHANES). La revue canadienne de statistique © 2009 Société statistique du Canada [source]


Estimating population size and habitat associations of two federally endangered mussels in the St. Croix River, Minnesota and Wisconsin, USA,

AQUATIC CONSERVATION: MARINE AND FRESHWATER ECOSYSTEMS, Issue 3 2010
Daniel J. Hornbach
Abstract 1.North America is a globally important centre of freshwater mussel biodiversity. Accurate population estimates and descriptions of critical habitat for endangered species of mussels are needed but are hindered by their patchy distribution and the dynamic nature of their habitat. Adaptive cluster sampling (ACS) was used to estimate population size and habitat associations of two federally endangered species, Higgins eye (Lampsilis higginsii) and winged mapleleaf (Quadrula fragosa), in the St. Croix River. 2.This river holds the largest known winged mapleleaf population in the upper Mississippi River and contains Essential Habitat Areas for Higgins eye. Winged mapleleaf density ranged from 0.008,0.020 individuals m,2 (coefficient of variation=50,66%), yielding an estimate of 13 000 winged mapleleaf in this reach of the river. Higgins eye density varied from 0.008,0.015 individuals m,2 (coefficient of variation=66,167%) giving an estimate of 14 400 individuals in this area. 3.Higgins eye and winged mapleleaf were associated with areas of the overall highest mussel density and species richness, suggesting these endangered species occur in ,premier' mussel habitat. There were no differences in many microhabitat factors for sites with and without either endangered species. Select hydraulic measures (such as shear velocity and shear stress) showed significant differences in areas with and without the winged mapleleaf but not for Higgins eye. Areas that are less depositional support dense and diverse mussel assemblages that include both endangered species, with winged mapleleaf having a narrower habitat range than Higgins eye. 4.This study suggests that ACS can provide statistically robust estimates of density with 2,3 times more efficiency than simple random sampling. ACS, however, was quite time consuming. This work confirmed that of others demonstrating that larger-scale hydraulic parameters might be better predictors of prime mussel habitat than fine-scaled microhabitat factors. Using hydraulic measures may allow improved identification of potentially critical mussel habitat. Copyright © 2009 John Wiley & Sons, Ltd. [source]


ACCOUNTING FOR NON-COMPLIANCE IN THE ANALYSIS OF RANDOMIZED RESPONSE DATA

AUSTRALIAN & NEW ZEALAND JOURNAL OF STATISTICS, Issue 3 2009
Ardo Van Den Hout
Summary The randomized response model is a misclassification design that is used to protect the privacy of respondents with respect to sensitive questions. Conditional misclassification probabilities are specified by the researcher and are therefore considered to be known. It is to be expected that some of the respondents do not comply with respect to the misclassification design. These respondents induce extra perturbation, which is not accounted for in the standard randomized response model. An extension of the randomized response model is presented that takes into account assumptions with respect to non-compliance under simple random sampling. The extended model is investigated using Bayesian inference. The research is motivated by randomized response data concerning violations of regulations for social benefit. [source]


Ranked Set Sampling: Cost and Optimal Set Size

BIOMETRICS, Issue 4 2002
Ramzi W. Nahhas
Summary. Mclntyre (1952, Australian Journal of Agricultural Research3, 385,390) introduced ranked set sampling (RSS) as a method for improving estimation of a population mean in settings where sampling and ranking of units from the population are inexpensive when compared with actual measurement of the units. Two of the major factors in the usefulness of RSS are the set size and the relative costs of the various operations of sampling, ranking, and measurement. In this article, we consider ranking error models and cost models that enable us to assess the effect of different cost structures on the optimal set size for RSS. For reasonable cost structures, we find that the optimal RSS set sizes are generally larger than had been anticipated previously. These results will provide a useful tool for determining whether RSS is likely to lead to an improvement over simple random sampling in a given setting and, if so, what RSS set size is best to use in this case. [source]