Home About us Contact | |||
Data Applications (data + application)
Kinds of Data Applications Selected AbstractsValidation of Group Domain Score Estimates Using a Test of DomainJOURNAL OF EDUCATIONAL MEASUREMENT, Issue 2 2006Mary Pommerich Domain scores have been proposed as a user-friendly way of providing instructional feedback about examinees' skills. Domain performance typically cannot be measured directly; instead, scores must be estimated using available information. Simulation studies suggest that IRT-based methods yield accurate group domain score estimates. Because simulations can represent best-case scenarios for methodology, it is important to verify results with a real data application. This study administered a domain of elementary algebra (EA) items created from operational test forms. An IRT-based group-level domain score was estimated from responses to a subset of taken items (comprised of EA items from a single operational form) and compared to the actual observed domain score. Domain item parameters were calibrated both using item responses from the special study and from national operational administrations of the items. The accuracy of the domain score estimates were evaluated within schools and across school sizes for each set of parameters. The IRT-based domain score estimates typically were closer to the actual domain score than observed performance on the EA items from the single form. Previously simulated findings for the IRT-based domain score estimation procedure were supported by the results of the real data application. [source] Time Deformation, Continuous Euler Processes and ForecastingJOURNAL OF TIME SERIES ANALYSIS, Issue 6 2006Chu-Ping C. Vijverberg Abstract., A continuous Euler model has time-varying coefficients. Through a logarithmic time transformation, a continuous Euler model can be transformed to a continuous autoregressive (AR) model. By using the continuous Kalman filtering through the Laplace method, this article explores the data application of a continuous Euler process. This time deformation of an Euler process deforms specific time-variant (non-stationary) behaviour to time-invariant (stationary) data on the deformed time scale. With these time-invariant data on the transformed time scale, one may use traditional tools to conduct parameter estimation and forecasts. The obtained results can then be transformed back to the original time scale. Simulated data and actual data such as bat echolocation and the US residential investment growth are used to demonstrate the usefulness of time deformation in forecasting. The results indicate that fitting a traditional autoregressive moving-average (ARMA) model on an Euler data set without imposing time transformation leads to forecasts that are out of phase while the forecasts of an Euler model stay mostly in phase. [source] A Bayesian Hierarchical Model for Classification with Selection of Functional PredictorsBIOMETRICS, Issue 2 2010Hongxiao Zhu Summary In functional data classification, functional observations are often contaminated by various systematic effects, such as random batch effects caused by device artifacts, or fixed effects caused by sample-related factors. These effects may lead to classification bias and thus should not be neglected. Another issue of concern is the selection of functions when predictors consist of multiple functions, some of which may be redundant. The above issues arise in a real data application where we use fluorescence spectroscopy to detect cervical precancer. In this article, we propose a Bayesian hierarchical model that takes into account random batch effects and selects effective functions among multiple functional predictors. Fixed effects or predictors in nonfunctional form are also included in the model. The dimension of the functional data is reduced through orthonormal basis expansion or functional principal components. For posterior sampling, we use a hybrid Metropolis,Hastings/Gibbs sampler, which suffers slow mixing. An evolutionary Monte Carlo algorithm is applied to improve the mixing. Simulation and real data application show that the proposed model provides accurate selection of functional predictors as well as good classification. [source] Using the Optimal Robust Receiver Operating Characteristic (ROC) Curve for Predictive Genetic TestsBIOMETRICS, Issue 2 2010Qing Lu Summary Current ongoing genome-wide association (GWA) studies represent a powerful approach to uncover common unknown genetic variants causing common complex diseases. The discovery of these genetic variants offers an important opportunity for early disease prediction, prevention, and individualized treatment. We describe here a method of combining multiple genetic variants for early disease prediction, based on the optimality theory of the likelihood ratio (LR). Such theory simply shows that the receiver operating characteristic (ROC) curve based on the LR has maximum performance at each cutoff point and that the area under the ROC curve so obtained is highest among that of all approaches. Through simulations and a real data application, we compared it with the commonly used logistic regression and classification tree approaches. The three approaches show similar performance if we know the underlying disease model. However, for most common diseases we have little prior knowledge of the disease model and in this situation the new method has an advantage over logistic regression and classification tree approaches. We applied the new method to the type 1 diabetes GWA data from the Wellcome Trust Case Control Consortium. Based on five single nucleotide polymorphisms, the test reaches medium level classification accuracy. With more genetic findings to be discovered in the future, we believe a predictive genetic test for type 1 diabetes can be successfully constructed and eventually implemented for clinical use. [source] Mixture Modeling for Genome-Wide Localization of Transcription FactorsBIOMETRICS, Issue 1 2007Sündüz Kele Summary Chromatin immunoprecipitation followed by DNA microarray analysis (ChIP-chip methodology) is an efficient way of mapping genome-wide protein,DNA interactions. Data from tiling arrays encompass DNA,protein interaction measurements on thousands or millions of short oligonucleotides (probes) tiling a whole chromosome or genome. We propose a new model-based method for analyzing ChIP-chip data. The proposed model is motivated by the widely used two-component multinomial mixture model of de novo motif finding. It utilizes a hierarchical gamma mixture model of binding intensities while incorporating inherent spatial structure of the data. In this model, genomic regions belong to either one of the following two general groups: regions with a local protein,DNA interaction (peak) and regions lacking this interaction. Individual probes within a genomic region are allowed to have different localization rates accommodating different binding affinities. A novel feature of this model is the incorporation of a distribution for the peak size derived from the experimental design and parameters. This leads to the relaxation of the fixed peak size assumption that is commonly employed when computing a test statistic for these types of spatial data. Simulation studies and a real data application demonstrate good operating characteristics of the method including high sensitivity with small sample sizes when compared to available alternative methods. [source] Lithology and hydrocarbon mapping from multicomponent seismic dataGEOPHYSICAL PROSPECTING, Issue 2 2010Hüseyin Özdemir ABSTRACT Elastic rock properties can be estimated from prestack seismic data using amplitude variation with offset analysis. P-wave, S-wave and density ,reflectivities', or contrasts, can be inverted from angle-band stacks. The ,reflectivities' are then inverted to absolute acoustic impedance, shear impedance and density. These rock properties can be used to map reservoir parameters through all stages of field development and production. When P-wave contrast is small, or gas clouds obscure reservoir zones, multicomponent ocean-bottom recording of converted-waves (P to S or Ps) data provides reliable mapping of reservoir boundaries. Angle-band stacks of multicomponent P-wave (Pz) and Ps data can also be inverted jointly. In this paper Aki-Richards equations are used without simplifications to invert angle-band stacks to ,reflectivities'. This enables the use of reflection seismic data beyond 30° of incident angles compared to the conventional amplitude variation with offset analysis. It, in turn, provides better shear impedance and density estimates. An important input to amplitude variation with offset analysis is the Vs/Vp ratio. Conventional methods use a constant or a time-varying Vs/Vp model. Here, a time- and space-varying model is used during the computation of the ,reflectivities'. The Vs/Vp model is generated using well log data and picked horizons. For multicomponent data applications, the latter model can also be generated from processing Vs/Vp models and available well data. Reservoir rock properties such as ,,, ,,, Poisson's ratio and bulk modulus can be computed from acoustic impedance, shear impedance and density for pore fill and lithology identification. , and , are the Lamé constants and , is density. These estimations can also be used for a more efficient log property mapping. Vp/Vs ratio or Poisson's ratio, ,, and weighted stacks, such as the one computed from ,, and ,/,, are good gas/oil and oil/water contact indicators, i.e., pore fill indicators, while ,, mainly indicates lithology. ,, is also affected by pressure changes. Results from a multicomponent data set are used to illustrate mapping of gas, oil and water saturation and lithology in a Tertiary sand/shale setting. Whilst initial log crossplot analysis suggested that pore fill discrimination may be possible, the inversion was not successful in revealing fluid effects. However, rock properties computed from acoustic impedance, shear impedance and density estimates provided good lithology indicators; pore fill identification was less successful. Neural network analysis using computed rock properties provided good indication of sand/shale distribution away from the existing wells and complemented the results depicted from individual rock property inversions. [source] Monitoring and controlling QoS network domainsINTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, Issue 1 2005Ahsan Habib Increased performance, fairness, and security remain important goals for service providers. In this work, we design an integrated distributed monitoring, traffic conditioning, and flow control system for higher performance and security of network domains. Edge routers monitor (using tomography techniques) a network domain to detect quality of service (QoS) violations,possibly caused by underprovisioning,as well as bandwidth theft attacks. To bound the monitoring overhead, a router only verifies service level agreement (SLA) parameters such as delay, loss, and throughput when anomalies are detected. The marking component of the edge router uses TCP flow characteristics to protect ,fragile' flows. Edge routers may also regulate unresponsive flows, and may propagate congestion information to upstream domains. Simulation results indicate that this design increases application-level throughput of data applications such as large FTP transfers; achieves low packet delays and response times for Telnet and WWW traffic; and detects bandwidth theft attacks and service violations.,Copyright © 2004 John Wiley & Sons, Ltd. [source] Idiot's Bayes,Not So Stupid After All?INTERNATIONAL STATISTICAL REVIEW, Issue 3 2001David J. Hand Summary Folklore has it that a very simple supervised classification rule, based on the typically false assumption that the predictor variables are independent, can be highly effective, and often more effective than sophisticated rules. We examine the evidence for this, both empirical, as observed in real data applications, and theoretical, summarising explanations for why this simple rule might be effective. Résumé La tradition veunt qu'une règle très simple assumant l'independance des variables prédictives. une hypothèse fausse dans la plupart des cas, peut être très efficace, souvent même plus efficace qu'une méthode plus sophistiquée en ce qui concerne l'attribution de classes a un groupe d'objets. A ce sujet, nous examinons les preuves empiriques, et les preuves théoriques, e'est-a-dire les raisons pour lesquelles cette simple règle pourrait faciliter le processus de tri. [source] An Empirically Based Method of Q-Matrix Validation for the DINA Model: Development and ApplicationsJOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2008Jimmy De La Torre Most model fit analyses in cognitive diagnosis assume that a Q matrix is correct after it has been constructed, without verifying its appropriateness. Consequently, any model misfit attributable to the Q matrix cannot be addressed and remedied. To address this concern, this paper proposes an empirically based method of validating a Q matrix used in conjunction with the DINA model. The proposed method can be implemented with other considerations such as substantive information about the items, or expert knowledge about the domain, to produce a more integrative framework of Q-matrix validation. The paper presents the theoretical foundation for the proposed method, develops an algorithm for its practical implementation, and provides real and simulated data applications to examine its viability. Relevant issues regarding the implementation of the method are discussed. [source] Skills Diagnosis Using IRT-Based Latent Class ModelsJOURNAL OF EDUCATIONAL MEASUREMENT, Issue 4 2007Louis A. Roussos This article describes a latent trait approach to skills diagnosis based on a particular variety of latent class models that employ item response functions (IRFs) as in typical item response theory (IRT) models. To enable and encourage comparisons with other approaches, this description is provided in terms of the main components of any psychometric approach: the ability model and the IRF structure; review of research on estimation, model checking, reliability, validity, equating, and scoring; and a brief review of real data applications. In this manner the article demonstrates that this approach to skills diagnosis has built a strong initial foundation of research and resources available to potential users. The outlook for future research and applications is discussed with special emphasis on a call for pilot studies and concomitant increased validity research. [source] Forecasting with panel data,JOURNAL OF FORECASTING, Issue 2 2008Badi H. Baltagi Abstract This paper gives a brief survey of forecasting with panel data. It begins with a simple error component regression model and surveys the best linear unbiased prediction under various assumptions of the disturbance term. This includes various ARMA models as well as spatial autoregressive models. The paper also surveys how these forecasts have been used in panel data applications, running horse races between heterogeneous and homogeneous panel data models using out-of-sample forecasts. Copyright © 2008 John Wiley & Sons, Ltd. [source] Carrying out an optimal experimentACTA CRYSTALLOGRAPHICA SECTION D, Issue 4 2010Zbigniew Dauter Diffraction data collection is the last experimental stage in structural crystallography. It has several technical and theoretical aspects and a compromise usually has to be found between various parameters in order to achieve optimal data quality. The influence and importance of various experimental parameters and their consequences are discussed in the context of different data applications, such as molecular replacement, anomalous phasing, high-resolution refinement or searching for ligands. [source] User-level QoS and traffic engineering for 3G wireless 1xEV-DO systemsBELL LABS TECHNICAL JOURNAL, Issue 2 2003Simon C. Borst Third-generation (3G) wireless systems such as 3G1X, 1xEV-DO, and 1xEV-DV provide support for a variety of high-speed data applications. The success of these services critically relies on the capability to ensure an adequate quality of service (QoS) experience to users at an affordable price. With wireless bandwidth at a premium, traffic engineering and network planning play a vital role in addressing these challenges. We present models and techniques that we have developed for quantifying the QoS perception of 1xEV-DO users generating file transfer protocol (FTP) or Web browsing sessions. We show how user-level QoS measures may be evaluated by means of a Processor-Sharing model that explicitly accounts for the throughput gains from multi-user scheduling. The model provides simple analytical formulas for key performance metrics such as response times, blocking probabilities, and throughput. Analytical models are especially useful for network deployment and in-service tuning purposes due to the intrinsic difficulties associated with simulation-based optimization approaches. © 2003 Lucent Technologies Inc. [source] Evolution of UMTS toward high-speed downlink packet accessBELL LABS TECHNICAL JOURNAL, Issue 3 2002Arnab Das An expanded effort is under way to support the evolution of the Universal Mobile Telecommunications System (UMTS) standard to meet the rapidly developing needs associated with wireless data applications. A new, shared channel,the high-speed downlink shared channel (HS-DSCH),provides support to packet-switched high-speed data users. A number of performance-enhancing technologies are included in the high-speed downlink packet access (HSDPA) system to ensure high peak and average packet data rates while supporting circuit-switched voice and packet data on the same carrier. Lucent Technologies took a pivotal role in specifying many of these techniques, including adaptive modulation and coding (AMC), hybrid automatic repeat request (HARQ), and fat-pipe scheduling. In this paper, we provide system-level simulations results to indicate the achievable performance and capacity with these advanced technologies. We also discuss HSDPA protocol architecture along with the uplink and downlink control channel design and performance. We conclude with a discussion of potential enhancements for the future. © 2003 Lucent Technologies Inc. [source] Evolution of the reverse link of CDMA-based systems to support high-speed dataBELL LABS TECHNICAL JOURNAL, Issue 3 2002Nandu Gopalakrishnan Development of an upcoming release of the CDMA2000* family of standards is expected to focus on enhancing the reverse link (RL) operation to support high-speed packet data applications. The challenge is to design a system that yields substantial throughput gain while causing only minimal perturbations to the existing standard. We are proposing a system that evolves features already present in the CDMA2000 Release B and IS-856 (1xEV-DO) standards and reuses concepts and capabilities that have been introduced for high-speed packet data support on the forward link (FL) in Release C of the CDMA2000 standard. The RL of Release C of the CDMA2000 standard supports a relatively slow scheduled operation of this link using signaling messages. Scheduling with shorter latencies can be achieved by moving this functionality to the physical layer. Concurrently, both the FL and RL channel conditions may be tracked, and users may be scheduled based on this knowledge. To further manage the power and bandwidth cost on the FL, that is, of scheduling users' transmissions on the RL, the mobile station (MS) is permitted to operate in either a scheduled mode or an autonomous mode. A capability is provided for the MS station to switch the mode of operation. Performance impact of, and gain from, some of the system features is characterized through simulation results. © 2003 Lucent Technologies Inc. [source] High-Dimensional Cox Models: The Choice of Penalty as Part of the Model Building ProcessBIOMETRICAL JOURNAL, Issue 1 2010Axel Benner Abstract The Cox proportional hazards regression model is the most popular approach to model covariate information for survival times. In this context, the development of high-dimensional models where the number of covariates is much larger than the number of observations ( ) is an ongoing challenge. A practicable approach is to use ridge penalized Cox regression in such situations. Beside focussing on finding the best prediction rule, one is often interested in determining a subset of covariates that are the most important ones for prognosis. This could be a gene set in the biostatistical analysis of microarray data. Covariate selection can then, for example, be done by L1 -penalized Cox regression using the lasso (Tibshirani (1997). Statistics in Medicine16, 385,395). Several approaches beyond the lasso, that incorporate covariate selection, have been developed in recent years. This includes modifications of the lasso as well as nonconvex variants such as smoothly clipped absolute deviation (SCAD) (Fan and Li (2001). Journal of the American Statistical Association96, 1348,1360; Fan and Li (2002). The Annals of Statistics30, 74,99). The purpose of this article is to implement them practically into the model building process when analyzing high-dimensional data with the Cox proportional hazards model. To evaluate penalized regression models beyond the lasso, we included SCAD variants and the adaptive lasso (Zou (2006). Journal of the American Statistical Association101, 1418,1429). We compare them with "standard" applications such as ridge regression, the lasso, and the elastic net. Predictive accuracy, features of variable selection, and estimation bias will be studied to assess the practical use of these methods. We observed that the performance of SCAD and adaptive lasso is highly dependent on nontrivial preselection procedures. A practical solution to this problem does not yet exist. Since there is high risk of missing relevant covariates when using SCAD or adaptive lasso applied after an inappropriate initial selection step, we recommend to stay with lasso or the elastic net in actual data applications. But with respect to the promising results for truly sparse models, we see some advantage of SCAD and adaptive lasso, if better preselection procedures would be available. This requires further methodological research. [source] A Latent Model to Detect Multiple Clusters of Varying SizesBIOMETRICS, Issue 4 2009Minge Xie Summary This article develops a latent model and likelihood-based inference to detect temporal clustering of events. The model mimics typical processes generating the observed data. We apply model selection techniques to determine the number of clusters, and develop likelihood inference and a Monte Carlo expectation,maximization algorithm to estimate model parameters, detect clusters, and identify cluster locations. Our method differs from the classical scan statistic in that we can simultaneously detect multiple clusters of varying sizes. We illustrate the methodology with two real data applications and evaluate its efficiency through simulation studies. For the typical data-generating process, our methodology is more efficient than a competing procedure that relies on least squares. [source] Assessment of Agreement under Nonstandard Conditions Using Regression Models for Mean and VarianceBIOMETRICS, Issue 1 2006Pankaj K. Choudhary Summary The total deviation index of Lin (2000, Statistics in Medicine19, 255,270) and Lin et al. (2002, Journal of the American Statistical Association97, 257,270) is an intuitive approach for the assessment of agreement between two methods of measurement. It assumes that the differences of the paired measurements are a random sample from a normal distribution and works essentially by constructing a probability content tolerance interval for this distribution. We generalize this approach to the case when differences may not have identical distributions,a common scenario in applications. In particular, we use the regression approach to model the mean and the variance of differences as functions of observed values of the average of the paired measurements, and describe two methods based on asymptotic theory of maximum likelihood estimators for constructing a simultaneous probability content tolerance band. The first method uses bootstrap to approximate the critical point and the second method is an analytical approximation. Simulation shows that the first method works well for sample sizes as small as 30 and the second method is preferable for large sample sizes. We also extend the methodology for the case when the mean function is modeled using penalized splines via a mixed model representation. Two real data applications are presented. [source] |