Home About us Contact

Variable Selection (variable + selection)

Distribution by Scientific Domains

Mathematics and Statistics	56%
Chemistry	22%
Life Sciences	12%
Medical Sciences	2%
3 Other Domains	8%

Kinds of Variable Selection

bayesian variable selection

Terms modified by Variable Selection

variable selection method

variable selection methods

Selected Abstracts

A Combinatorial Approach to the Variable Selection in Multiple Linear Regression: Analysis of Selwood et al.

MOLECULAR INFORMATICS, Issue 6 2003
A Case Study, Data Set
Abstract A combinatorial protocol (CP) is introduced here to interface it with the multiple linear regression (MLR) for variable selection. The efficiency of CP-MLR is primarily based on the restriction of entry of correlated variables to the model development stage. It has been used for the analysis of Selwood et al data set [16], and the obtained models are compared with those reported from GFA [8] and MUSEUM [9] approaches. For this data set CP-MLR could identify three highly independent models (27, 28 and 31) with Q2 value in the range of 0.632,0.518. Also, these models are divergent and unique. Even though, the present study does not share any models with GFA [8], and MUSEUM [9] results, there are several descriptors common to all these studies, including the present one. Also a simulation is carried out on the same data set to explain the model formation in CP-MLR. The results demonstrate that the proposed method should be able to offer solutions to data sets with 50 to 60 descriptors in reasonable time frame. By carefully selecting the inter-parameter correlation cutoff values in CP-MLR one can identify divergent models and handle data sets larger than the present one without involving excessive computer time. [source]

Joint Identification of Multiple Genetic Variants via Elastic-Net Variable Selection in a Genome-Wide Association Analysis

ANNALS OF HUMAN GENETICS, Issue 5 2010
Seoae Cho
Summary Unraveling the genetic background of common complex traits is a major goal in modern genetics. In recent years, genome-wide association (GWA) studies have been conducted with large-scale data sets of genetic variants. Most of those studies have relied on single-marker approaches that identify single genetic factors individually and can be limited in considering fully the joint effects of multiple genetic factors on complex traits. Joint identification of multiple genetic factors would be more powerful and would provide better prediction on complex traits since it utilizes combined information across variants. Here we propose a multi-stage approach for GWA analysis: (1) prescreening, (2) joint identification of putative SNPs based on elastic-net variable selection, and (3) empirical replication using bootstrap samples. Our approach enables an efficient joint search for genetic associations in GWA analysis. The suggested empirical replication method can be beneficial in GWA studies because one can avoid a costly, independent replication study while eliminating false-positive associations and focusing on a smaller number of replicable variants. We applied the proposed approach to a GWA analysis, and jointly identified 129 genetic variants having an association with adult height in a Korean population. [source]

Pairwise Variable Selection for High-Dimensional Model-Based Clustering

BIOMETRICS, Issue 3 2010
Jian Guo
Summary Variable selection for clustering is an important and challenging problem in high-dimensional data analysis. Existing variable selection methods for model-based clustering select informative variables in a "one-in-all-out" manner; that is, a variable is selected if at least one pair of clusters is separable by this variable and removed if it cannot separate any of the clusters. In many applications, however, it is of interest to further establish exactly which clusters are separable by each informative variable. To address this question, we propose a pairwise variable selection method for high-dimensional model-based clustering. The method is based on a new pairwise penalty. Results on simulated and real data show that the new method performs better than alternative approaches that use ,1 and ,, penalties and offers better interpretation. [source]

Variable Selection for Semiparametric Mixed Models in Longitudinal Studies

BIOMETRICS, Issue 1 2010
Xiao Ni
Summary We propose a double-penalized likelihood approach for simultaneous model selection and estimation in semiparametric mixed models for longitudinal data. Two types of penalties are jointly imposed on the ordinary log-likelihood: the roughness penalty on the nonparametric baseline function and a nonconcave shrinkage penalty on linear coefficients to achieve model sparsity. Compared to existing estimation equation based approaches, our procedure provides valid inference for data with missing at random, and will be more efficient if the specified model is correct. Another advantage of the new procedure is its easy computation for both regression components and variance parameters. We show that the double-penalized problem can be conveniently reformulated into a linear mixed model framework, so that existing software can be directly used to implement our method. For the purpose of model inference, we derive both frequentist and Bayesian variance estimation for estimated parametric and nonparametric components. Simulation is used to evaluate and compare the performance of our method to the existing ones. We then apply the new method to a real data set from a lactation study. [source]

Variable Selection in the Cox Regression Model with Covariates Missing at Random

BIOMETRICS, Issue 1 2010
Ramon I. Garcia
Summary We consider variable selection in the Cox regression model (Cox, 1975,,Biometrika,362, 269,276) with covariates missing at random. We investigate the smoothly clipped absolute deviation penalty and adaptive least absolute shrinkage and selection operator (LASSO) penalty, and propose a unified model selection and estimation procedure. A computationally attractive algorithm is developed, which simultaneously optimizes the penalized likelihood function and penalty parameters. We also optimize a model selection criterion, called the,ICQ,statistic (Ibrahim, Zhu, and Tang, 2008,,Journal of the American Statistical Association,103, 1648,1658), to estimate the penalty parameters and show that it consistently selects all important covariates. Simulations are performed to evaluate the finite sample performance of the penalty estimates. Also, two lung cancer data sets are analyzed to demonstrate the proposed methodology. [source]

Fast FSR Variable Selection with Applications to Clinical Trials

BIOMETRICS, Issue 3 2009
Dennis D. Boos
Summary A new version of the false selection rate variable selection method of Wu, Boos, and Stefanski (2007,,Journal of the American Statistical Association,102, 235,243) is developed that requires no simulation. This version allows the tuning parameter in forward selection to be estimated simply by hand calculation from a summary table of output even for situations where the number of explanatory variables is larger than the sample size. Because of the computational simplicity, the method can be used in permutation tests and inside bagging loops for improved prediction. Illustration is provided in clinical trials for linear regression, logistic regression, and Cox proportional hazards regression. [source]

Variable Selection for Clustering with Gaussian Mixture Models

BIOMETRICS, Issue 3 2009
Cathy Maugis
Summary This article is concerned with variable selection for cluster analysis. The problem is regarded as a model selection problem in the model-based cluster analysis context. A model generalizing the model of Raftery and Dean (2006,,Journal of the American Statistical Association,101, 168,178) is proposed to specify the role of each variable. This model does not need any prior assumptions about the linear link between the selected and discarded variables. Models are compared with Bayesian information criterion. Variable role is obtained through an algorithm embedding two backward stepwise algorithms for variable selection for clustering and linear regression. The model identifiability is established and the consistency of the resulting criterion is proved under regularity conditions. Numerical experiments on simulated datasets and a genomic application highlight the interest of the procedure. [source]

Variable Selection and Model Choice in Geoadditive Regression Models

BIOMETRICS, Issue 2 2009
Thomas Kneib
Summary Model choice and variable selection are issues of major concern in practical regression analyses, arising in many biometric applications such as habitat suitability analyses, where the aim is to identify the influence of potentially many environmental conditions on certain species. We describe regression models for breeding bird communities that facilitate both model choice and variable selection, by a boosting algorithm that works within a class of geoadditive regression models comprising spatial effects, nonparametric effects of continuous covariates, interaction surfaces, and varying coefficients. The major modeling components are penalized splines and their bivariate tensor product extensions. All smooth model terms are represented as the sum of a parametric component and a smooth component with one degree of freedom to obtain a fair comparison between the model terms. A generic representation of the geoadditive model allows us to devise a general boosting algorithm that automatically performs model choice and variable selection. [source]

Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data

BIOMETRICS, Issue 2 2008
Sijian Wang
Summary Variable selection in high-dimensional clustering analysis is an important yet challenging problem. In this article, we propose two methods that simultaneously separate data points into similar clusters and select informative variables that contribute to the clustering. Our methods are in the framework of penalized model-based clustering. Unlike the classical L1 -norm penalization, the penalty terms that we propose make use of the fact that parameters belonging to one variable should be treated as a natural "group." Numerical results indicate that the two new methods tend to remove noninformative variables more effectively and provide better clustering results than the L1 -norm approach. [source]

Generalized Additive Modeling with Implicit Variable Selection by Likelihood-Based Boosting

BIOMETRICS, Issue 4 2006
Gerhard Tutz
Summary The use of generalized additive models in statistical data analysis suffers from the restriction to few explanatory variables and the problems of selection of smoothing parameters. Generalized additive model boosting circumvents these problems by means of stagewise fitting of weak learners. A fitting procedure is derived which works for all simple exponential family distributions, including binomial, Poisson, and normal response variables. The procedure combines the selection of variables and the determination of the appropriate amount of smoothing. Penalized regression splines and the newly introduced penalized stumps are considered as weak learners. Estimates of standard deviations and stopping criteria, which are notorious problems in iterative procedures, are based on an approximate hat matrix. The method is shown to be a strong competitor to common procedures for the fitting of generalized additive models. In particular, in high-dimensional settings with many nuisance predictor variables it performs very well. [source]

Variable Selection for Logistic Regression Using a Prediction-Focused Information Criterion

BIOMETRICS, Issue 4 2006
Gerda Claeskens
Summary In biostatistical practice, it is common to use information criteria as a guide for model selection. We propose new versions of the focused information criterion (FIC) for variable selection in logistic regression. The FIC gives, depending on the quantity to be estimated, possibly different sets of selected variables. The standard version of the FIC measures the mean squared error of the estimator of the quantity of interest in the selected model. In this article, we propose more general versions of the FIC, allowing other risk measures such as the one based on Lp error. When prediction of an event is important, as is often the case in medical applications, we construct an FIC using the error rate as a natural risk measure. The advantages of using an information criterion which depends on both the quantity of interest and the selected risk measure are illustrated by means of a simulation study and application to a study on diabetic retinopathy. [source]

Imputation and Variable Selection in Linear Regression Models with Missing Covariates

BIOMETRICS, Issue 2 2005
Xiaowei Yang
Summary Across multiply imputed data sets, variable selection methods such as stepwise regression and other criterion-based strategies that include or exclude particular variables typically result in models with different selected predictors, thus presenting a problem for combining the results from separate complete-data analyses. Here, drawing on a Bayesian framework, we propose two alternative strategies to address the problem of choosing among linear regression models when there are missing covariates. One approach, which we call "impute, then select" (ITS) involves initially performing multiple imputation and then applying Bayesian variable selection to the multiply imputed data sets. A second strategy is to conduct Bayesian variable selection and missing data imputation simultaneously within one Gibbs sampling process, which we call "simultaneously impute and select" (SIAS). The methods are implemented and evaluated using the Bayesian procedure known as stochastic search variable selection for multivariate normal data sets, but both strategies offer general frameworks within which different Bayesian variable selection algorithms could be used for other types of data sets. A study of mental health services utilization among children in foster care programs is used to illustrate the techniques. Simulation studies show that both ITS and SIAS outperform complete-case analysis with stepwise variable selection and that SIAS slightly outperforms ITS. [source]

Variable Selection for Marginal Longitudinal Generalized Linear Models

BIOMETRICS, Issue 2 2005
Eva Cantoni
Summary Variable selection is an essential part of any statistical analysis and yet has been somewhat neglected in the context of longitudinal data analysis. In this article, we propose a generalized version of Mallows's Cp (GCp) suitable for use with both parametric and nonparametric models. GCp provides an estimate of a measure of model's adequacy for prediction. We examine its performance with popular marginal longitudinal models (fitted using GEE) and contrast results with what is typically done in practice: variable selection based on Wald-type or score-type tests. An application to real data further demonstrates the merits of our approach while at the same time emphasizing some important robust features inherent to GCp. [source]

Bayesian Variable Selection in Multinomial Probit Models to Identify Molecular Signatures of Disease Stage

BIOMETRICS, Issue 3 2004
Naijun Sha
Summary Here we focus on discrimination problems where the number of predictors substantially exceeds the sample size and we propose a Bayesian variable selection approach to multinomial probit models. Our method makes use of mixture priors and Markov chain Monte Carlo techniques to select sets of variables that differ among the classes. We apply our methodology to a problem in functional genomics using gene expression profiling data. The aim of the analysis is to identify molecular signatures that characterize two different stages of rheumatoid arthritis. [source]

Variable Selection in High-Dimensional Multivariate Binary Data with Application to the Analysis of Microbial Community DNA Fingerprints

BIOMETRICS, Issue 2 2002
J. D. Wilbur
Summary. In order to understand the relevance of microbial communities on crop productivity, the identification and characterization of the rhieosphere soil microbial community is necessary. Characteristic profiles of the microbial communities are obtained by denaturing gradient gel electrophoresis (DGGE) of polymerase chain reaction (PCR) amplified 16s rDNA from soil extracted DNA. These characteristic profiles, commonly called community DNA fingerprints, can be represented in the form of high-dimensional binary vectors. We address the problem of modeling and variable selection in high-dimensional multivariate binary data and present an application of our methodology in the context of a controlled agricultural experiment. [source]

Variable selection in random calibration of near-infrared instruments: ridge regression and partial least squares regression settings

JOURNAL OF CHEMOMETRICS, Issue 3 2003
Arief Gusnanto
Abstract Standard methods for calibration of near-infrared instruments, such as partial least-squares (PLS) and ridge regression (RR), typically use the full set of wavelengths in the model. In this paper we investigate the effect of variable (wavelength) selection for these two methods on the model prediction. For RR the selection is optimized with respect to the ridge parameter, the number of variables and the configuration of the variables in the model. A fast iterative computational algorithm is developed for the purpose of this optimization. For PLS the selection is optimized with respect to the number of components, the number of variables and the configuration of the variables. We use three real data sets in this study: processed milk from the market, milk from a dairy farm and milk from the production line of a milk processing factory. The quantity of interest is the concentration of fat in the milk. The observations are randomly split into estimation and validation sets. Optimization is based on the mean square prediction error computed on the validation set. The results indicate that the wavelength selection will not always give better prediction than using all of the available wavelengths. Investigation of the information in the spectra is necessary to determine whether all of them are relevant to the objective of the model. Copyright © 2003 John Wiley & Sons, Ltd. [source]

A systematic evaluation of the benefits and hazards of variable selection in latent variable regression.

JOURNAL OF CHEMOMETRICS, Issue 7 2002
Part I. Search algorithm, simulations, theory
Abstract Variable selection is an extensively studied problem in chemometrics and in the area of quantitative structure,activity relationships (QSARs). Many search algorithms have been compared so far. Less well studied is the influence of different objective functions on the prediction quality of the selected models. This paper investigates the performance of different cross-validation techniques as objective function for variable selection in latent variable regression. The results are compared in terms of predictive ability, model size (number of variables) and model complexity (number of latent variables). It will be shown that leave-multiple-out cross-validation with a large percentage of data left out performs best. Since leave-multiple-out cross-validation is computationally expensive, a very efficient tabu search algorithm is introduced to lower the computational burden. The tabu search algorithm needs no user-defined operational parameters and optimizes the variable subset and the number of latent variables simultaneously. Copyright © 2002 John Wiley & Sons, Ltd. [source]

Variable selection and oversampling in the use of smooth support vector machines for predicting the default risk of companies

JOURNAL OF FORECASTING, Issue 6 2009
Wolfgang Härdle
Abstract In the era of Basel II a powerful tool for bankruptcy prognosis is vital for banks. The tool must be precise but also easily adaptable to the bank's objectives regarding the relation of false acceptances (Type I error) and false rejections (Type II error). We explore the suitability of smooth support vector machines (SSVM), and investigate how important factors such as the selection of appropriate accounting ratios (predictors), length of training period and structure of the training sample influence the precision of prediction. Moreover, we show that oversampling can be employed to control the trade-off between error types, and we compare SSVM with both logistic and discriminant analysis. Finally, we illustrate graphically how different models can be used jointly to support the decision-making process of loan officers. Copyright © 2008 John Wiley & Sons, Ltd. [source]

Building neural network models for time series: a statistical approach

JOURNAL OF FORECASTING, Issue 1 2006
Marcelo C. Medeiros
Abstract This paper is concerned with modelling time series by single hidden layer feedforward neural network models. A coherent modelling strategy based on statistical inference is presented. Variable selection is carried out using simple existing techniques. The problem of selecting the number of hidden units is solved by sequentially applying Lagrange multiplier type tests, with the aim of avoiding the estimation of unidentified models. Misspecification tests are derived for evaluating an estimated neural network model. All the tests are entirely based on auxiliary regressions and are easily implemented. A small-sample simulation experiment is carried out to show how the proposed modelling strategy works and how the misspecification tests behave in small samples. Two applications to real time series, one univariate and the other multivariate, are considered as well. Sets of one-step-ahead forecasts are constructed and forecast accuracy is compared with that of other nonlinear models applied to the same series. Copyright © 2006 John Wiley & Sons, Ltd. [source]

Sure independence screening for ultrahigh dimensional feature space

JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES B (STATISTICAL METHODOLOGY), Issue 5 2008
Jianqing Fan
Summary., Variable selection plays an important role in high dimensional statistical modelling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, accuracy of estimation and computational cost are two top concerns. Recently, Candes and Tao have proposed the Dantzig selector using L1 -regularization and showed that it achieves the ideal risk up to a logarithmic factor log (p). Their innovative procedure and remarkable result are challenged when the dimensionality is ultrahigh as the factor log (p) can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method that is based on correlation learning, called sure independence screening, to reduce dimensionality from high to a moderate scale that is below the sample size. In a fairly general asymptotic framework, correlation learning is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, iterative sure independence screening is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a well-developed method such as smoothly clipped absolute deviation, the Dantzig selector, lasso or adaptive lasso. The connections between these penalized least squares methods are also elucidated. [source]

Pairwise Variable Selection for High-Dimensional Model-Based Clustering

Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data

An Empirical Bayes Method for Estimating Epistatic Effects of Quantitative Trait Loci

BIOMETRICS, Issue 2 2007
Shizhong Xu
Summary The genetic variance of a quantitative trait is often controlled by the segregation of multiple interacting loci. Linear model regression analysis is usually applied to estimating and testing effects of these quantitative trait loci (QTL). Including all the main effects and the effects of interaction (epistatic effects), the dimension of the linear model can be extremely high. Variable selection via stepwise regression or stochastic search variable selection (SSVS) is the common procedure for epistatic effect QTL analysis. These methods are computationally intensive, yet they may not be optimal. The LASSO (least absolute shrinkage and selection operator) method is computationally more efficient than the above methods. As a result, it has been widely used in regression analysis for large models. However, LASSO has never been applied to genetic mapping for epistatic QTL, where the number of model effects is typically many times larger than the sample size. In this study, we developed an empirical Bayes method (E-BAYES) to map epistatic QTL under the mixed model framework. We also tested the feasibility of using LASSO to estimate epistatic effects, examined the fully Bayesian SSVS, and reevaluated the penalized likelihood (PENAL) methods in mapping epistatic QTL. Simulation studies showed that all the above methods performed satisfactorily well. However, E-BAYES appears to outperform all other methods in terms of minimizing the mean-squared error (MSE) with relatively short computing time. Application of the new method to real data was demonstrated using a barley dataset. [source]

Variable Selection for Marginal Longitudinal Generalized Linear Models

Differences in spatial predictions among species distribution modeling methods vary with species traits and environmental predictors

ECOGRAPHY, Issue 6 2009
Alexandra D. Syphard
Prediction maps produced by species distribution models (SDMs) influence decision-making in resource management or designation of land in conservation planning. Many studies have compared the prediction accuracy of different SDM modeling methods, but few have quantified the similarity among prediction maps. There has also been little systematic exploration of how the relative importance of different predictor variables varies among model types and affects map similarity. Our objective was to expand the evaluation of SDM performance for 45 plant species in southern California to better understand how map predictions vary among model types, and to explain what factors may affect spatial correspondence, including the selection and relative importance of different environmental variables. Four types of models were tested. Correlation among maps was highest between generalized linear models (GLMs) and generalized additive models (GAMs) and lowest between classification trees and GAMs or GLMs. Correlation between Random Forests (RFs) and GAMs was the same as between RFs and classification trees. Spatial correspondence among maps was influenced the most by model prediction accuracy (AUC) and species prevalence; map correspondence was highest when accuracy was high and prevalence was intermediate (average prevalence for all species was 0.124). Species functional type and the selection of climate variables also influenced map correspondence. For most (but not all) species, climate variables were more important than terrain or soil in predicting their distributions. Environmental variable selection varied according to modeling method, but the largest differences were between RFs and GLMs or GAMs. Although prediction accuracy was equal for GLMs, GAMs, and RFs, the differences in spatial predictions suggest that it may be important to evaluate the results of more than one model to estimate the range of spatial uncertainty before making planning decisions based on map outputs. This may be particularly important if models have low accuracy or if species prevalence is not intermediate. [source]

Prevention programs in the 21st century: what we do not discuss in public

ADDICTION, Issue 4 2010
Harold Holder
ABSTRACT Prevention research concerning alcohol, tobacco and other drugs faces a number of challenges as the scientific foundation is strengthened for the future. Seven issues which the prevention research field should address are discussed: lack of transparency in analyses of prevention program outcomes, lack of disclosure of copyright and potential for profit/income during publication, post-hoc outcome variable selection and reporting only outcomes which show positive and statistical significance at any follow-up point, tendency to evaluate statistical significance only rather than practical significance as well, problem of selection bias in terms of selecting subjects and limited generalizability, the need for confirmation of outcomes in which only self-report data are used and selection of appropriate statistical distributions in conducting significance testing. In order to establish a solid scientific base for alcohol, tobacco and drug prevention, this paper calls for discussions, disclosures and debates about the above issues (and others) as essential. In summary, the best approach is always transparency. [source]

NATURAL SELECTION ON A POLYMORPHIC DISEASE-RESISTANCE LOCUS IN IPOMOEA PURPUREA

EVOLUTION, Issue 2 2007
Joel M. Kniskern
Although disease-resistance polymorphisms are common in natural plant populations, the mechanisms responsible for this variation are not well understood. Theoretical models predict that balancing selection can maintain polymorphism within a population if the fitness effects of a resistance allele vary from a net cost to a net benefit, depending upon the extent of pathogen damage. However, there have been a few attempts to determine how commonly this mechanism operates in natural plant,pathogen interactions. Ipomoea purpurea populations are often polymorphic for resistance and susceptibility alleles at a locus that influences resistance to the fungal pathogen, Coleosporium ipomoeae. We measured the fitness effects of resistance over three consecutive years at natural and manipulated levels of damage to characterize the type of selection acting on this locus. Costs of resistance varied in magnitude from undetectable to 15.5%, whereas benefits of resistance sometimes equaled, but never exceeded, these costs. In the absence of net benefits of resistance at natural or elevated levels of disease, we conclude that selection within individual populations of I. purpurea probably does not account completely for maintenance of this polymorphism. Rather, the persistence of this polymorphism is probably best explained by a combination of variable selection and meta-population processes. [source]

TEMPORAL VARIATION IN DIVERGENT SELECTION ON SPINE NUMBER IN THREESPINE STICKLEBACK

EVOLUTION, Issue 12 2002
T. E. Reimchen
Abstract., Short-term temporal cycles in ecological pressures, such as shifts in predation regime, are widespread in nature yet estimates of temporal variation in the direction and intensity of natural selection are few. Previous work on threespine stickleback (Gasterosteus aculeatus) has revealed that dorsal and pelvic spines are a defense against gape-limited predators but may be detrimental against grappling insect predators. In this study, we examined a 15-year database from an endemic population of threespine stickleback to look for evidence of temporal shifts in exposure to these divergent predation regimes and correlated shifts in selection on spine number. For juveniles, we detected selection for increased spine number during winter when gape-limited avian piscivores were most common but selection for decreased spine number during summer when odonate predation was more common. For subadults and adults, which are taken primarily by avian piscivores, we predicted selection should generally be for increased spine number in all seasons. Among 59 comparisons, four selection differentials were significant (Bonferroni corrected) and in the predicted direction. However, there was also substantial variability in remaining differentials, including two examples with strong selection for spine reduction. These reversals were associated with increased tendency of the fish to shift to a benthic niche, as determined from examination of stomach contents. These dietary data suggest that increased encounter rates with odonate predation select for spine reduction. Strong selection on spine number was followed by changes in mean spine number during subsequent years and a standard quantitative genetic formula revealed that spine number has a heritable component. Our results provide evidence of rapid morphological responses to selection from predators and suggest that temporal variation in selection may help maintain variation within populations. Furthermore, our findings indicate that variable selection can be predicted if the agents of selection are known. [source]

Winter diatom blooms in a regulated river in South Korea: explanations based on evolutionary computation

FRESHWATER BIOLOGY, Issue 10 2007
DONG-KYUN KIM
Summary 1. An ecological model was developed using genetic programming (GP) to predict the time-series dynamics of the diatom, Stephanodiscus hantzschii for the lower Nakdong River, South Korea. Eight years of weekly data showed the river to be hypertrophic (chl. a, 45.1 ± 4.19 ,g L,1, mean ± SE, n = 427), and S. hantzschii annually formed blooms during the winter to spring flow period (late November to March). 2. A simple non-linear equation was created to produce a 3-day sequential forecast of the species biovolume, by means of time series optimization genetic programming (TSOGP). Training data were used in conjunction with a GP algorithm utilizing 7 years of limnological variables (1995,2001). The model was validated by comparing its output with measurements for a specific year with severe blooms (1994). The model accurately predicted timing of the blooms although it slightly underestimated biovolume (training r2 = 0.70, test r2 = 0.78). The model consisted of the following variables: dam discharge and storage, water temperature, Secchi transparency, dissolved oxygen (DO), pH, evaporation and silica concentration. 3. The application of a five-way cross-validation test suggested that GP was capable of developing models whose input variables were similar, although the data are randomly used for training. The similarity of input variable selection was approximately 51% between the best model and the top 20 candidate models out of 150 in total (based on both Root Mean Squared Error and the determination coefficients for the test data). 4. Genetic programming was able to determine the ecological importance of different environmental variables affecting the diatoms. A series of sensitivity analyses showed that water temperature was the most sensitive parameter. In addition, the optimal equation was sensitive to DO, Secchi transparency, dam discharge and silica concentration. The analyses thus identified likely causes of the proliferation of diatoms in ,river-reservoir hybrids' (i.e. rivers which have the characteristics of a reservoir during the dry season). This result provides specific information about the bloom of S. hantzschii in river systems, as well as the applicability of inductive methods, such as evolutionary computation to river-reservoir hybrid systems. [source]

Linkage mapping methods applied to the COGA data set: Presentation Group 4 of Genetic Analysis Workshop 14

GENETIC EPIDEMIOLOGY, Issue S1 2005
E. Warwick Daw
Abstract Presentation Group 4 participants analyzed the Collaborative Study on the Genetics of Alcoholism data provided for Genetic Analysis Workshop 14. This group examined various aspects of linkage analysis and related issues. Seven papers included linkage analyses, while the eighth calculated identity-by-descent (IBD) probabilities. Six papers analyzed linkage to an alcoholism phenotype: ALDX1 (four papers), ALDX2 (one paper), or a combination both (one paper). Methods used included Bayesian variable selection coupled with Haseman-Elston regression, recursive partitioning to identify phenotype and covariate groupings that interact with evidence for linkage, nonparametric linkage regression modeling, affected sib-pair linkage analysis with discordant sib-pair controls, simulation-based homozygosity mapping in a single pedigree, and application of a propensity score to collapse covariates in a general conditional logistic model. Alcoholism linkage was found with ,2 of these approaches on chromosomes 2, 4, 6, 7, 9, 14, and 21. The remaining linkage paper compared the utility of several single-nucleotide polymorphism (SNP) and microsatellite marker maps for Monte Carlo Markov chain combined oligogenic segregation and linkage analysis, and analyzed one of the electrophysiological endophenotypes, ttth1, on chromosome 7. Linkage was found with all marker sets. The last paper compared the multipoint IBD information content of several SNP sets and the microsatellite set, and found that while all SNP sets examined contained more information than the microsatellite set, most of the information contained in the SNP sets was captured by a subset of the SNP markers with ,1-cM marker spacing. From these papers, we highlight three points: a 1-cM SNP map seems to capture most of the linkage information, so denser maps do not appear necessary; careful and appropriate use of covariates can aid linkage analysis; and sources of increased gene-sharing between relatives should be accounted for in analyses. Genet. Epidemiol. 29(Suppl. 1):S29,S34, 2005. © 2005 Wiley-Liss, Inc. [source]