Binary Data (binary + data)

Distribution by Scientific Domains

Kinds of Binary Data

  • longitudinal binary data


  • Selected Abstracts


    Methods for Generating Longitudinally Correlated Binary Data

    INTERNATIONAL STATISTICAL REVIEW, Issue 1 2008
    Patrick J. Farrell
    Summary The analysis of longitudinally correlated binary data has attracted considerable attention of late. Since the estimation of parameters in models for such data is based on asymptotic theory, it is necessary to investigate the small-sample properties of estimators by simulation. In this paper, we review the mechanisms that have been proposed for generating longitudinally correlated binary data. We compare and contrast these models with regard to various features, including computational efficiency, flexibility and the range restrictions that they impose on the longitudinal association parameters. Some extensions to the data generation mechanism originally suggested by Kanter (1975) are proposed. Résumé L'analyse des données longitudinales corrélées fait récemment l'objet d'un grand intérêt. Comme l'estimation des paramètres des modèles pour de telles données est souvent basée sur des études asymptotiques, il est nécessaire de procéder à des simulations pour explorer les propriétés des estimateurs en petits échantillonages. Dans ce papier, nous présentons une revue des méthodes qui ont été proposées pour générer des données binaires longitudinales corrélées. Nous les comparons sous différents aspects, notamment en termes d'efficience, flexibilité, et des restrictions qu'elles peuvent avoir sur les paramètres dits d'association longitudinale. Quelques extensions, de la méthode suggérée par Kanter (1975) pour générer de telles données, sont aussi proposées. [source]


    Modelling Binary Data (Second Edition) Collett D (2003) ISBN 1584883243; 387 pages; CRC Press; http://www.crcpress.com/shopping_cart/products/product_detail.asp?sku=C3243

    PHARMACEUTICAL STATISTICS: THE JOURNAL OF APPLIED STATISTICS IN THE PHARMACEUTICAL INDUSTRY, Issue 1 2004
    Richard Kay
    No abstract is available for this article. [source]


    Analysis of Misclassified Correlated Binary Data Using a Multivariate Probit Model when Covariates are Subject to Measurement Error

    BIOMETRICAL JOURNAL, Issue 3 2009
    Surupa Roy
    Abstract A multivariate probit model for correlated binary responses given the predictors of interest has been considered. Some of the responses are subject to classification errors and hence are not directly observable. Also measurements on some of the predictors are not available; instead the measurements on its surrogate are available. However, the conditional distribution of the unobservable predictors given the surrogate is completely specified. Models are proposed taking into account either or both of these sources of errors. Likelihood-based methodologies are proposed to fit these models. To ascertain the effect of ignoring classification errors and /or measurement error on the estimates of the regression and correlation parameters, a sensitivity study is carried out through simulation. Finally, the proposed methodology is illustrated through an example. [source]


    Meta-Analysis of Binary Data Using Profile Likelihood by BÖHNING, D., KUHNERT, R., and RATTANASIRI, S.

    BIOMETRICS, Issue 2 2009
    Eloise Kaizar
    No abstract is available for this article. [source]


    A Two-Part Joint Model for the Analysis of Survival and Longitudinal Binary Data with Excess Zeros

    BIOMETRICS, Issue 2 2008
    Dimitris Rizopoulos
    Summary Many longitudinal studies generate both the time to some event of interest and repeated measures data. This article is motivated by a study on patients with a renal allograft, in which interest lies in the association between longitudinal proteinuria (a dichotomous variable) measurements and the time to renal graft failure. An interesting feature of the sample at hand is that nearly half of the patients were never tested positive for proteinuria (,1g/day) during follow-up, which introduces a degenerate part in the random-effects density for the longitudinal process. In this article we propose a two-part shared parameter model framework that effectively takes this feature into account, and we investigate sensitivity to the various dependence structures used to describe the association between the longitudinal measurements of proteinuria and the time to renal graft failure. [source]


    Test of Marginal Compatibility and Smoothing Methods for Exchangeable Binary Data with Unequal Cluster Sizes

    BIOMETRICS, Issue 1 2007
    Zhen Pang
    Summary Exchangeable binary data are often collected in developmental toxicity and other studies, and a whole host of parametric distributions for fitting this kind of data have been proposed in the literature. While these distributions can be matched to have the same marginal probability and intra-cluster correlation, they can be quite different in terms of shape and higher-order quantities of interest such as the litter-level risk of having at least one malformed fetus. A sensible alternative is to fit a saturated model (Bowman and George, 1995, Journal of the American Statistical Association90, 871,879) using the expectation-maximization (EM) algorithm proposed by Stefanescu and Turnbull (2003, Biometrics59, 18,24). The assumption of compatibility of marginal distributions is often made to link up the distributions for different cluster sizes so that estimation can be based on the combined data. Stefanescu and Turnbull proposed a modified trend test to test this assumption. Their test, however, fails to take into account the variability of an estimated null expectation and as a result leads to inaccurate p -values. This drawback is rectified in this article. When the data are sparse, the probability function estimated using a saturated model can be very jagged and some kind of smoothing is needed. We extend the penalized likelihood method (Simonoff, 1983, Annals of Statistics11, 208,218) to the present case of unequal cluster sizes and implement the method using an EM-type algorithm. In the presence of covariate, we propose a penalized kernel method that performs smoothing in both the covariate and response space. The proposed methods are illustrated using several data sets and the sampling and robustness properties of the resulting estimators are evaluated by simulations. [source]


    Marginal Analysis of Incomplete Longitudinal Binary Data: A Cautionary Note on LOCF Imputation

    BIOMETRICS, Issue 3 2004
    Richard J. Cook
    Summary In recent years there has been considerable research devoted to the development of methods for the analysis of incomplete data in longitudinal studies. Despite these advances, the methods used in practice have changed relatively little, particularly in the reporting of pharmaceutical trials. In this setting, perhaps the most widely adopted strategy for dealing with incomplete longitudinal data is imputation by the "last observation carried forward" (LOCF) approach, in which values for missing responses are imputed using observations from the most recently completed assessment. We examine the asymptotic and empirical bias, the empirical type I error rate, and the empirical coverage probability associated with estimators and tests of treatment effect based on the LOCF imputation strategy. We consider a setting involving longitudinal binary data with longitudinal analyses based on generalized estimating equations, and an analysis based simply on the response at the end of the scheduled follow-up. We find that for both of these approaches, imputation by LOCF can lead to substantial biases in estimators of treatment effects, the type I error rates of associated tests can be greatly inflated, and the coverage probability can be far from the nominal level. Alternative analyses based on all available data lead to estimators with comparatively small bias, and inverse probability weighted analyses yield consistent estimators subject to correct specification of the missing data process. We illustrate the differences between various methods of dealing with drop-outs using data from a study of smoking behavior. [source]


    Testing for Spatial Correlation in Nonstationary Binary Data, with Application to Aberrant Crypt Foci in Colon Carcinogenesis

    BIOMETRICS, Issue 4 2003
    Tatiyana V. Apanasovich
    Summary. In an experiment to understand colon carcinogenesis, all animals were exposed to a carcinogen, with half the animals also being exposed to radiation. Spatially, we measured the existence of what are referred to as aberrant crypt foci (ACF), namely, morphologically changed colonic crypts that are known to be precursors of colon cancer development. The biological question of interest is whether the locations of these ACFs are spatially correlated: if so, this indicates that damage to the colon due to carcinogens and radiation is localized. Statistically, the data take the form of binary outcomes (corresponding to the existence of an ACF) on a regular grid. We develop score-type methods based upon the Matern and conditionally autoregressive (CAR) correlation models to test for the spatial correlation in such data, while allowing for nonstationarity. Because of a technical peculiarity of the score-type test, we also develop robust versions of the method. The methods are compared to a generalization of Moran's test for continuous outcomes, and are shown via simulation to have the potential for increased power. When applied to our data, the methods indicate the existence of spatial correlation, and hence indicate localization of damage. [source]


    Variable Selection in High-Dimensional Multivariate Binary Data with Application to the Analysis of Microbial Community DNA Fingerprints

    BIOMETRICS, Issue 2 2002
    J. D. Wilbur
    Summary. In order to understand the relevance of microbial communities on crop productivity, the identification and characterization of the rhieosphere soil microbial community is necessary. Characteristic profiles of the microbial communities are obtained by denaturing gradient gel electrophoresis (DGGE) of polymerase chain reaction (PCR) amplified 16s rDNA from soil extracted DNA. These characteristic profiles, commonly called community DNA fingerprints, can be represented in the form of high-dimensional binary vectors. We address the problem of modeling and variable selection in high-dimensional multivariate binary data and present an application of our methodology in the context of a controlled agricultural experiment. [source]


    Tests for Order Restrictions in Binary Data

    BIOMETRICS, Issue 4 2001
    Shyamal D. Peddada
    Summary. In this article, a general procedure is presented for testing for equality of k independent binary response probabilities against any given ordered alternative. The proposed methodology is based on an estimation procedure developed in Hwang and Peddada (1994, Annals of Statistics22, 67,93) and can be used for a very broad class of order restrictions. The procedure is illustrated through application to two data sets that correspond to three commonly encountered order restrictions: simple tree order, simple order, and down turn order. [source]


    Assessment of short-term association between health outcomes and ozone concentrations using a Markov regression model

    ENVIRONMETRICS, Issue 3 2003
    Abdelkrim Zeghnoun
    Abstract Longitudinal binary data are often used in panel studies where short-term associations between air pollutants and respiratory health outcomes are investigated. A Markov regression model in which the transition probabilities depend on the covariates, as well as the past responses, was used to study the short-term association between daily ozone (O3) concentrations and respiratory health outcomes in a panel of schoolchildren in Armentières, Northern France. The results suggest that there was a small but statistically significant association between O3 and children's cough episodes. A 10,,g/m3 increase in O3 concentrations was associated with a 13.9,% increase in cough symptoms (CI,95%,=,1.2,28.1%). The use of a Markov regression model can be useful as it permits one to address easily both the regression objective and the stochastic dependence between successive observations. However, it is important to verify the sensitivity of the Markov regression parameters to the time-dependence structure. In this study, it was found that, although what happened on the previous day was a strong predictor of what happened on the current day, this did not contradict the O3 -respiratory symptom associations. Compared to the Markov regression model, the signs of the parameter estimates of marginal and random-intercept models remain the same. The magnitudes of the O3 effects were also essentially the same in the three models, whose confidence intervals overlapped. Copyright © 2003 John Wiley & Sons, Ltd. [source]


    Methods for Generating Longitudinally Correlated Binary Data

    INTERNATIONAL STATISTICAL REVIEW, Issue 1 2008
    Patrick J. Farrell
    Summary The analysis of longitudinally correlated binary data has attracted considerable attention of late. Since the estimation of parameters in models for such data is based on asymptotic theory, it is necessary to investigate the small-sample properties of estimators by simulation. In this paper, we review the mechanisms that have been proposed for generating longitudinally correlated binary data. We compare and contrast these models with regard to various features, including computational efficiency, flexibility and the range restrictions that they impose on the longitudinal association parameters. Some extensions to the data generation mechanism originally suggested by Kanter (1975) are proposed. Résumé L'analyse des données longitudinales corrélées fait récemment l'objet d'un grand intérêt. Comme l'estimation des paramètres des modèles pour de telles données est souvent basée sur des études asymptotiques, il est nécessaire de procéder à des simulations pour explorer les propriétés des estimateurs en petits échantillonages. Dans ce papier, nous présentons une revue des méthodes qui ont été proposées pour générer des données binaires longitudinales corrélées. Nous les comparons sous différents aspects, notamment en termes d'efficience, flexibilité, et des restrictions qu'elles peuvent avoir sur les paramètres dits d'association longitudinale. Quelques extensions, de la méthode suggérée par Kanter (1975) pour générer de telles données, sont aussi proposées. [source]


    Linkage Disequilibrium Mapping of Disease Susceptibility Genes in Human Populations

    INTERNATIONAL STATISTICAL REVIEW, Issue 1 2000
    David Clayton
    Summary The paper reviews recent work on statistical methods for using linkage disequilibrium to locate disease susceptibility genes, given a set of marker genes at known positions in the genome. The paper starts by considering a simple deterministic model for linkage disequilibrium and discusses recent attempts to elaborate it to include the effects of stochastic influences, of "drift", by the use of either Writht-Fisher models or by approaches based on the coalescence of the genealogy of the sample of disease chromosomes. Most of this first part of the paper concerns a series of diallelic markers and, in this case, the models so far proposed are hierarchical probability models for multivariate binary data. Likelihoods are intractable and most approaches to linkage disequilibrium mapping amount to marginal models for pairwise associations between individual markers and the disease susceptibility locus. Approaches to evalutation of a full likelihood require Monte Carlo methods in order to integrate over the large number of unknowns. The fact that the initial state of the stochastic process which has led to present-day allele frequencies is unknown is noted and its implications for the hierarchical probability model is discussed. Difficulties and opportunities arising as a result of more polymorphic markers and extended marker haplotypes are indicated. Connections between the hierarchical modelling approach and methods based upon identity by descent and haplotype sharing by seemingly unrelated case are explored. Finally problems resulting from unknown modes of inheritance, incomplete penetrance, and "phenocopies" are briefly reviewed. Résumé Ce papier est une revue des travaux récents, protant sur les méthodes statistiques qui utilisent I'étude, des liaisons désé, quilib rées, pour identifer les génes, de susceptibilité des maladies,ápartir d'une série, de marqueurs de géncs á des positions définies du génome,. Le papier commence par considérer, un modéle, détéministe, simple pour liaisons déséquilibr,ées, puis nous discutons les améliorations, ré centes proposées, de ce modéle, dans but de tenir compte des effects des influences stochastiques soit en utilisant les modéles, de wright-fisher, soit par des approches basées, sur la coalescence de la géné alogic de I'échantillon, des chromosomes malades. La plupart de cette premiére, partie porte sur une série, de marqueurs dialléliques et, dans ce cas, les modéles, proposés, sont des modéles, hiérerchiques, probabilistes pour dinnées, binaires multivariées. Les viaisemblances n'ont pas de forme analytique et la plupart des approches pour la cartographie des liaisons déséquilibrées, sont équivalentes aux modéles, marginaux pour dinnées, appariées, entre des marqueurs individuels et le géne, de susceptibilité de la maladie.Pour évaluer, la vriausemblance compléte, des méthodes de Monte carlo sont nécessaires, afin d'intégrer, le large nombre d; inconnues. Le fait que l'état, initial du process stochastique qui a conduit éla fré, quence, allélique, actuel soit inconnu est á noter et ses implications pour le modéle, hiérarchique, probabiliste sont discutées.Les difficultés, et implications issues de marqueurs polumorphiques et de marquers haplotypes sont dévéloppées.Les liens entire l'approche de modélisation, hiérerchique, et les méthodes, d'analyse d'identite pardescendance et les haplotypes partagés, par des cas apparement non apparentés, sont explorés. Enfin les problémes, relatifs à des modes de transmission inconnus,à des pénétrances, incomplé, tes, et aux "phénocopies" sont briévenment evoqués. [source]


    Generalized marker regression and interval QTL mapping methods for binary traits in half-sib family designs

    JOURNAL OF ANIMAL BREEDING AND GENETICS, Issue 5 2001
    H. N. Kadarmideen
    A Generalized Marker Regression Mapping (GMR) approach was developed for mapping Quantitative Trait Loci (QTL) affecting binary polygenic traits in a single-family half-sib design. The GMR is based on threshold-liability model theory and regression of offspring phenotype on expected marker genotypes at flanking marker loci. Using simulation, statistical power and bias of QTL mapping for binary traits by GMR was compared with full QTL interval mapping based on a threshold model (GIM) and with a linear marker regression mapping method (LMR). Empirical significance threshold values, power and estimates of QTL location and effect were identical for GIM and GMR when QTL mapping was restricted to within the marker interval. These results show that the theory of the marker regression method for QTL mapping is also applicable to binary traits and possibly for traits with other non-normal distributions. The linear and threshold models based on marker regression (LMR and GMR) also resulted in similar estimates and power for large progeny group sizes, indicating that LMR can be used for binary data for balanced designs with large families, as this method is computationally simpler than GMR. GMR may have a greater potential than LMR for QTL mapping for binary traits in complex situations such as QTL mapping with complex pedigrees, random models and models with interactions. Generalisierte Marker Regression und Intervall QTL Kartierungsmethoden für binäre Merkmale in einem Halbgeschwisterdesign Es wurde ein Ansatz zur generalisierten Marker Regressions Kartierung (GMR) entwickelt, um quantitative Merkmalsloci (QTL) zu kartieren, die binäre polygenetische Merkmale in einem Einfamilien-Halbgeschwisterdesign beeinflussen. Das GMR basiert auf der Theorie eines Schwellenwertmodells und auf der Regression des Nachkommenphänotyps auf den erwarteten Markergenotyp der flankierenden Markerloci. Mittels Simulation wurde die statistische Power und Schiefe der QTL Kartierung für binäre Merkmale nach GMR verglichen mit vollständiger QTL Intervallkartierung, die auf einem Schwellenmodell (GIM) basiert, und mit einer Methode zur linearen Marker Regressions Kartierung (LMR). Empirische Signifikanzschwellenwerte, Power und Schätzer für die QTL Lokation und der Effekt waren für GIM und GMR identisch, so lange die QTL Kartierung innerhalb des Markerintervalls definiert war. Diese Ergebnisse zeigen, dass die Theorie der Marker Regressions-Methode zur QTL Kartierung auch für binäre Merkmale und möglicherweise auch für Merkmale, die keiner Normalverteilung folgen, geeignet ist. Die linearen und Schwellenmodelle, die auf Marker Regression (LMR und GMR) basieren, ergaben ebenfalls ähnliche Schätzer und Power bei großen Nachkommengruppen, was schlussfolgern lässt, dass LMR für binäre Daten in einem balancierten Design mit großen Familien genutzt werden kann. Schließlich ist diese Methode computertechnisch einfacher als GMR. GMR mag für die QTL Kartierung bei binären Merkmalen in komplexen Situationen ein größeres Potential haben als LMR. Ein Beispiel dafür ist die QTL Kartierung mit komplexen Pedigrees, zufälligen Modellen und Interaktionsmodellen. [source]


    Regression modelling of correlated data in ecology: subject-specific and population averaged response patterns

    JOURNAL OF APPLIED ECOLOGY, Issue 5 2009
    John Fieberg
    Summary 1.,Statistical methods that assume independence among observations result in optimistic estimates of uncertainty when applied to correlated data, which are ubiquitous in applied ecological research. Mixed effects models offer a potential solution and rely on the assumption that latent or unobserved characteristics of individuals (i.e. random effects) induce correlation among repeated measurements. However, careful consideration must be given to the interpretation of parameters when using a nonlinear link function (e.g. logit). Mixed model regression parameters reflect the change in the expected response within an individual associated with a change in that individual's covariates [i.e. a subject-specific (SS) interpretation], which may not address a relevant scientific question. In particular, a SS interpretation is not natural for covariates that do not vary within individuals (e.g. gender). 2.,An alternative approach combines the solution to an unbiased estimating equation with robust measures of uncertainty to make inferences regarding predictor,outcome relationships. Regression parameters describe changes in the average response among groups of individuals differing in their covariates [i.e. a population-averaged (PA) interpretation]. 3.,We compare these two approaches [mixed models and generalized estimating equations (GEE)] with illustrative examples from a 3-year study of mallard (Anas platyrhynchos) nest structures. We observe that PA and SS responses differ when modelling binary data, with PA parameters behaving like attenuated versions of SS parameters. Differences between SS and PA parameters increase with the size of among-subject heterogeneity captured by the random effects variance component. Lastly, we illustrate how PA inferences can be derived (post hoc) from fitted generalized and nonlinear-mixed models. 4.,Synthesis and applications. Mixed effects models and GEE offer two viable approaches to modelling correlated data. The preferred method should depend primarily on the research question (i.e. desired parameter interpretation), although operating characteristics of the associated estimation procedures should also be considered. Many applied questions in ecology, wildlife management and conservation biology (including the current illustrative examples) focus on population performance measures (e.g. mean survival or nest success rates) as a function of general landscape features, for which the PA model interpretation, not the more commonly used SS model interpretation may be more natural. [source]


    Antisolvent crystallization of anhydrous sodium carbonate at atmospherical conditions

    AICHE JOURNAL, Issue 3 2001
    Harald Oosterhof
    When antisolvents are applied to crystallize sodium carbonate from aqueous solutions, the transition temperature at which the hydrates are in equilibrium is decreased. Two models proposed can predict the influence of the amount and type of antisolvent on the transition temperature. Only binary data of the water/sodium carbonate system and measured vapor pressures over ternary soda-saturated mixtures of water and antisolvent are needed. To validate the two models, continuous crystallization experiments were carried out at various temperatures using ethylene glycol (EG) and diethylene glycol (DEG) as antisolvent, in varying concentrations. Both models predict the influence of the antisolvent on the transition temperature with good accuracy. Anhydrous soda with bulk densities of up to 950 kg/m3 was crystallized at temperatures as low as 80°C. [source]


    Causal inference with generalized structural mean models

    JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES B (STATISTICAL METHODOLOGY), Issue 4 2003
    S. Vansteelandt
    Summary., We estimate cause,effect relationships in empirical research where exposures are not completely controlled, as in observational studies or with patient non-compliance and self-selected treatment switches in randomized clinical trials. Additive and multiplicative structural mean models have proved useful for this but suffer from the classical limitations of linear and log-linear models when accommodating binary data. We propose the generalized structural mean model to overcome these limitations. This is a semiparametric two-stage model which extends the structural mean model to handle non-linear average exposure effects. The first-stage structural model describes the causal effect of received exposure by contrasting the means of observed and potential exposure-free outcomes in exposed subsets of the population. For identification of the structural parameters, a second stage ,nuisance' model is introduced. This takes the form of a classical association model for expected outcomes given observed exposure. Under the model, we derive estimating equations which yield consistent, asymptotically normal and efficient estimators of the structural effects. We examine their robustness to model misspecification and construct robust estimators in the absence of any exposure effect. The double-logistic structural mean model is developed in more detail to estimate the effect of observed exposure on the success of treatment in a randomized controlled blood pressure reduction trial with self-selected non-compliance. [source]


    A multifaceted sensitivity analysis of the Slovenian public opinion survey data

    JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES C (APPLIED STATISTICS), Issue 2 2009
    Caroline Beunckens
    Summary., Many models to analyse incomplete data have been developed that allow the missing data to be missing not at random. Awareness has grown that such models are based on unverifiable assumptions, in the sense that they rest on the (incomplete) data only in part, but that inferences nevertheless depend on what the model predicts about the unobserved data, given the observed data. This explains why, nowadays, considerable work is being devoted to assess how sensitive models for incomplete data are to the particular model chosen, a family of models chosen and the effect of (a group of) influential subjects. For each of these categories, several proposals have been formulated, studied theoretically and/or by simulations, and applied to sets of data. It is, however, uncommon to explore various sensitivity analysis avenues simultaneously. We apply a collection of such tools, some after extension, to incomplete counts arising from cross-classified binary data from the so-called Slovenian public opinion survey. Thus for the first time bringing together a variety of sensitivity analysis tools on the same set of data, we can sketch a comprehensive sensitivity analysis picture. We show that missingness at random estimates of the proportion voting in favour of independence are insensitive to the precise choice of missingness at random model and close to the actual plebiscite results, whereas the missingness not at random models that are furthest from the plebiscite results are vulnerable to the influence of outlying cases. Our approach helps to illustrate the value of comprehensive sensitivity analysis. Ideas are formulated on the methodology's use beyond the data analysis that we consider. [source]


    Estimating herd-specific force of infection by using random-effects models for clustered binary data and monotone fractional polynomials

    JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES C (APPLIED STATISTICS), Issue 5 2006
    Christel Faes
    Summary., In veterinary epidemiology, we are often confronted with hierarchical or clustered data. Typically animals are grouped within herds, and consequently we cannot ignore the possibility of animals within herds being more alike than between herds. Based on a serological survey of bovine herpes virus type 1 in cattle, we describe a method for the estimation of herd-specific rates at which susceptible animals acquire the infection at different ages. In contrast with the population-averaged force of infection, this method allows us to model the herd-specific force of infection, allowing investigation of the variability between herds. A random-effects approach is used to account for the correlation in the data, allowing us to study both population-averaged and herd-specific force of infection. In contrast, generalized estimating equations can be used when interest is only in the population-averaged force of infection. Further, a flexible predictor model is needed to describe the dependence of covariates appropriately. Fractional polynomials as proposed by Royston and Altman offer such flexibility. However, the flexibility of this model should be restricted, since only positive forces of infection have a meaningful interpretation. [source]


    High dimensional multivariate mixed models for binary questionnaire data

    JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES C (APPLIED STATISTICS), Issue 4 2006
    Steffen Fieuws
    Summary., Questionnaires that are used to measure the effect of an intervention often consist of different sets of items, each set possibly measuring another concept. Mixed models with set-specific random effects are a flexible tool to model the different sets of items jointly. However, computational problems typically arise as the number of sets increases. This is especially true when the random-effects distribution cannot be integrated out analytically, as with mixed models for binary data. A pairwise modelling strategy, in which all possible bivariate mixed models are fitted and where inference follows from pseudolikelihood theory, has been proposed as a solution. This approach has been applied to assess the effect of physical activity on psychocognitive functioning, the latter measured by a battery of questionnaires. [source]


    Analysis of longitudinal multiple-source binary data using generalized estimating equations

    JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES C (APPLIED STATISTICS), Issue 1 2004
    Liam M. O'Brien
    Summary., We present a multivariate logistic regression model for the joint analysis of longitudinal multiple-source binary data. Longitudinal multiple-source binary data arise when repeated binary measurements are obtained from two or more sources, with each source providing a measure of the same underlying variable. Since the number of responses on each subject is relatively large, the empirical variance estimator performs poorly and cannot be relied on in this setting. Two methods for obtaining a parsimonious within-subject association structure are considered. An additional complication arises with estimation, since maximum likelihood estimation may not be feasible without making unrealistically strong assumptions about third- and higher order moments. To circumvent this, we propose the use of a generalized estimating equations approach. Finally, we present an analysis of multiple-informant data obtained longitudinally from a psychiatric interventional trial that motivated the model developed in the paper. [source]


    Comparison of T-RFLP and DGGE techniques to assess denitrifier community composition in soil

    LETTERS IN APPLIED MICROBIOLOGY, Issue 1 2009
    K. Enwall
    Abstract Terminal restriction fragment length polymorphism (T-RFLP) and denaturing gradient gel electrophoresis (DGGE) and subsequent statistical analysis were compared with assess denitrifier community composition in agricultural soil based on the nosZ gene, encoding the nitrous oxide reductase. Analysis of binary or relative abundance-based metric and semi-metric distance matrices provided similar results for DGGE, but not for T-RFLP. Moreover, DGGE had a higher resolution than T-RFLP and binary data was better for discriminating between samples. [source]


    Statistical analysis of amplified fragment length polymorphism data: a toolbox for molecular ecologists and evolutionists

    MOLECULAR ECOLOGY, Issue 18 2007
    A. Bonin
    Abstract Recently, the amplified fragment length polymorphism (AFLP) technique has gained a lot of popularity, and is now frequently applied to a wide variety of organisms. Technical specificities of the AFLP procedure have been well documented over the years, but there is on the contrary little or scattered information about the statistical analysis of AFLPs. In this review, we describe the various methods available to handle AFLP data, focusing on four research topics at the population or individual level of analysis: (i) assessment of genetic diversity; (ii) identification of population structure; (iii) identification of hybrid individuals; and (iv) detection of markers associated with phenotypes. Two kinds of analysis methods can be distinguished, depending on whether they are based on the direct study of band presences or absences in AFLP profiles (,band-based' methods), or on allelic frequencies estimated at each locus from these profiles (,allele frequency-based' methods). We investigate the characteristics and limitations of these statistical tools; finally, we appeal for a wider adoption of methodologies borrowed from other research fields, like for example those especially designed to deal with binary data. [source]


    The combined effect of SNP-marker and phenotype attributes in genome-wide association studies

    ANIMAL GENETICS, Issue 2 2009
    E. K. F. Chan
    Summary The last decade has seen rapid improvements in high-throughput single nucleotide polymorphism (SNP) genotyping technologies that have consequently made genome-wide association studies (GWAS) possible. With tens to hundreds of thousands of SNP markers being tested simultaneously in GWAS, it is imperative to appropriately pre-process, or filter out, those SNPs that may lead to false associations. This paper explores the relationships between various SNP genotype and phenotype attributes and their effects on false associations. We show that (i) uniformly distributed ordinal data as well as binary data are more easily influenced, though not necessarily negatively, by differences in various SNP attributes compared with normally distributed data; (ii) filtering SNPs on minor allele frequency (MAF) and extent of Hardy,Weinberg equilibrium (HWE) deviation has little effect on the overall false positive rate; (iii) in some cases, filtering on MAF only serves to exclude SNPs from the analysis without reduction of the overall proportion of false associations; and (iv) HWE, MAF and heterozygosity are all dependent on minor genotype frequency, a newly proposed measure for genotype integrity. [source]


    The Log Multinomial Regression Model for Nominal Outcomes with More than Two Attributes

    BIOMETRICAL JOURNAL, Issue 6 2007
    L. Blizzard
    Abstract An estimate of the risk or prevalence ratio, adjusted for confounders, can be obtained from a log binomial model (binomial errors, log link) fitted to binary outcome data. We propose a modification of the log binomial model to obtain relative risk estimates for nominal outcomes with more than two attributes (the "log multinomial model"). Extensive data simulations were undertaken to compare the performance of the log multinomial model with that of an expanded data multinomial logistic regression method based on the approach proposed by Schouten et al. (1993) for binary data, and with that of separate fits of a Poisson regression model based on the approach proposed by Zou (2004) and Carter, Lipsitz and Tilley (2005) for binary data. Log multinomial regression resulted in "inadmissable" solutions (out-of-bounds probabilities) exceeding 50% in some data settings. Coefficient estimates by the alternative methods produced out-of-bounds probabilities for the log multinomial model in up to 27% of samples to which a log multinomial model had been successfully fitted. The log multinomial coefficient estimates generally had lesser relative bias and mean squared error than the alternative methods. The practical utility of the log multinomial regression model was demonstrated with a real data example. The log multinomial model offers a practical solution to the problem of obtaining adjusted estimates of the risk ratio in the multinomial setting, but must be used with some care and attention to detail. (© 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim) [source]


    Test of Marginal Compatibility and Smoothing Methods for Exchangeable Binary Data with Unequal Cluster Sizes

    BIOMETRICS, Issue 1 2007
    Zhen Pang
    Summary Exchangeable binary data are often collected in developmental toxicity and other studies, and a whole host of parametric distributions for fitting this kind of data have been proposed in the literature. While these distributions can be matched to have the same marginal probability and intra-cluster correlation, they can be quite different in terms of shape and higher-order quantities of interest such as the litter-level risk of having at least one malformed fetus. A sensible alternative is to fit a saturated model (Bowman and George, 1995, Journal of the American Statistical Association90, 871,879) using the expectation-maximization (EM) algorithm proposed by Stefanescu and Turnbull (2003, Biometrics59, 18,24). The assumption of compatibility of marginal distributions is often made to link up the distributions for different cluster sizes so that estimation can be based on the combined data. Stefanescu and Turnbull proposed a modified trend test to test this assumption. Their test, however, fails to take into account the variability of an estimated null expectation and as a result leads to inaccurate p -values. This drawback is rectified in this article. When the data are sparse, the probability function estimated using a saturated model can be very jagged and some kind of smoothing is needed. We extend the penalized likelihood method (Simonoff, 1983, Annals of Statistics11, 208,218) to the present case of unequal cluster sizes and implement the method using an EM-type algorithm. In the presence of covariate, we propose a penalized kernel method that performs smoothing in both the covariate and response space. The proposed methods are illustrated using several data sets and the sampling and robustness properties of the resulting estimators are evaluated by simulations. [source]


    Marginal Analysis of Incomplete Longitudinal Binary Data: A Cautionary Note on LOCF Imputation

    BIOMETRICS, Issue 3 2004
    Richard J. Cook
    Summary In recent years there has been considerable research devoted to the development of methods for the analysis of incomplete data in longitudinal studies. Despite these advances, the methods used in practice have changed relatively little, particularly in the reporting of pharmaceutical trials. In this setting, perhaps the most widely adopted strategy for dealing with incomplete longitudinal data is imputation by the "last observation carried forward" (LOCF) approach, in which values for missing responses are imputed using observations from the most recently completed assessment. We examine the asymptotic and empirical bias, the empirical type I error rate, and the empirical coverage probability associated with estimators and tests of treatment effect based on the LOCF imputation strategy. We consider a setting involving longitudinal binary data with longitudinal analyses based on generalized estimating equations, and an analysis based simply on the response at the end of the scheduled follow-up. We find that for both of these approaches, imputation by LOCF can lead to substantial biases in estimators of treatment effects, the type I error rates of associated tests can be greatly inflated, and the coverage probability can be far from the nominal level. Alternative analyses based on all available data lead to estimators with comparatively small bias, and inverse probability weighted analyses yield consistent estimators subject to correct specification of the missing data process. We illustrate the differences between various methods of dealing with drop-outs using data from a study of smoking behavior. [source]


    Variable Selection in High-Dimensional Multivariate Binary Data with Application to the Analysis of Microbial Community DNA Fingerprints

    BIOMETRICS, Issue 2 2002
    J. D. Wilbur
    Summary. In order to understand the relevance of microbial communities on crop productivity, the identification and characterization of the rhieosphere soil microbial community is necessary. Characteristic profiles of the microbial communities are obtained by denaturing gradient gel electrophoresis (DGGE) of polymerase chain reaction (PCR) amplified 16s rDNA from soil extracted DNA. These characteristic profiles, commonly called community DNA fingerprints, can be represented in the form of high-dimensional binary vectors. We address the problem of modeling and variable selection in high-dimensional multivariate binary data and present an application of our methodology in the context of a controlled agricultural experiment. [source]