Larger Dataset (larger + dataset)

Distribution by Scientific Domains


Selected Abstracts


Spatially autocorrelated sampling falsely inflates measures of accuracy for presence-only niche models

JOURNAL OF BIOGEOGRAPHY, Issue 12 2009
Samuel D. Veloz
Abstract Aim, Environmental niche models that utilize presence-only data have been increasingly employed to model species distributions and test ecological and evolutionary predictions. The ideal method for evaluating the accuracy of a niche model is to train a model with one dataset and then test model predictions against an independent dataset. However, a truly independent dataset is often not available, and instead random subsets of the total data are used for ,training' and ,testing' purposes. The goal of this study was to determine how spatially autocorrelated sampling affects measures of niche model accuracy when using subsets of a larger dataset for accuracy evaluation. Location, The distribution of Centaurea maculosa (spotted knapweed; Asteraceae) was modelled in six states in the western United States: California, Oregon, Washington, Idaho, Wyoming and Montana. Methods, Two types of niche modelling algorithms , the genetic algorithm for rule-set prediction (GARP) and maximum entropy modelling (as implemented with Maxent) , were used to model the potential distribution of C. maculosa across the region. The effect of spatially autocorrelated sampling was examined by applying a spatial filter to the presence-only data (to reduce autocorrelation) and then comparing predictions made using the spatial filter with those using a random subset of the data, equal in sample size to the filtered data. Results, The accuracy of predictions from both algorithms was sensitive to the spatial autocorrelation of sampling effort in the occurrence data. Spatial filtering led to lower values of the area under the receiver operating characteristic curve plot but higher similarity statistic (I) values when compared with predictions from models built with random subsets of the total data, meaning that spatial autocorrelation of sampling effort between training and test data led to inflated measures of accuracy. Main conclusions, The findings indicate that care should be taken when interpreting the results from presence-only niche models when training and test data have been randomly partitioned but occurrence data were non-randomly sampled (in a spatially autocorrelated manner). The higher accuracies obtained without the spatial filter are a result of spatial autocorrelation of sampling effort between training and test data inflating measures of prediction accuracy. If independently surveyed data for testing predictions are unavailable, then it may be necessary to explicitly account for the spatial autocorrelation of sampling effort between randomly partitioned training and test subsets when evaluating niche model predictions. [source]


The adipokinetic hormones of Heteroptera: a comparative study

PHYSIOLOGICAL ENTOMOLOGY, Issue 2 2010
DALIBOR KODRÍK
The adipokinetic hormones (AKHs) from 15 species of heteropteran Hemiptera (encompassing eight families, six superfamilies and three infraorders) have been isolated and structurally identified using liquid chromatography coupled with mass spectrometry. None of the structures are novel and all are octapeptides. These peptide sequence data are used, together with the previously available AKH sequence data on Heteroptera, to create a larger dataset for comparative analyses. This results, in total, in AKH sequences from 30 species (spanning 13 families), which are used in a matrix confronted with the current hypotheses on the phylogeny of Heteroptera. The expanded dataset shows that all heteropterans have octapeptide AKHs; three species have two AKHs, whereas the overwhelming majority have only one AKH. From a total of 11 different AKH peptides known from Heteroptera to date, three AKHs occur frequently: Panbo-red pigment-concentrating hormone (RPCH) (×10), Schgr-AKH-II (×6) and Anaim-AKH (×4). The heteropteran database also suggests that particular AKH variants are family-specific. The AKHs of Heteroptera: Pentatomomorpha (all terrestrial) are not present in Nepomorpha (aquatic) and Gerromorpha: Gerridae (semiaquatic); AKHs with a Val in position 2 are absent in the Pentatomomorpha (only AKHs with Leu2 are present), whereas Val2 predominates in the nonterrestrial species. An unexpected diversity of AKH sequences is found in Nepomorpha, Nepoidea, Nepidae and Nepinae, whereas Panbo-RPCH (which has been identified in all infraorders of decapod crustaceans) is present in all analysed species of Pentatomidae and also in the only species of Tessaratomidae investigated. The molecular evolution of Heteroptera with respect to other insect groups and to crustaceans is discussed [source]


Hypothetical Intertemporal Consumption Choices*

THE ECONOMIC JOURNAL, Issue 486 2003
Arie Kapteyn
The paper extends and replicates part of the analysis by Barsky et al. (1997), which exploits hypothetical choices among different consumption streams to infer intertemporal substitution elasticities and rates of time preference. We use a new and much larger dataset than Barsky et al. Furthermore, we estimate structural models of intertemporal choice, while parameterising the parameters of interest as a function of relevant individual characteristics. We also consider ,behavioural' extensions, like habit formation. Models with habit formation appear to be superior to models with intertemporally additive preferences. [source]


Reliable computing in estimation of variance components

JOURNAL OF ANIMAL BREEDING AND GENETICS, Issue 6 2008
I. Misztal
Summary The purpose of this study is to present guidelines in selection of statistical and computing algorithms for variance components estimation when computing involves software packages. For this purpose two major methods are to be considered: residual maximal likelihood (REML) and Bayesian via Gibbs sampling. Expectation-Maximization (EM) REML is regarded as a very stable algorithm that is able to converge when covariance matrices are close to singular, however it is slow. However, convergence problems can occur with random regression models, especially if the starting values are much lower than those at convergence. Average Information (AI) REML is much faster for common problems but it relies on heuristics for convergence, and it may be very slow or even diverge for complex models. REML algorithms for general models become unstable with larger number of traits. REML by canonical transformation is stable in such cases but can support only a limited class of models. In general, REML algorithms are difficult to program. Bayesian methods via Gibbs sampling are much easier to program than REML, especially for complex models, and they can support much larger datasets; however, the termination criterion can be hard to determine, and the quality of estimates depends on a number of details. Computing speed varies with computing optimizations, with which some large data sets and complex models can be supported in a reasonable time; however, optimizations increase complexity of programming and restrict the types of models applicable. Several examples from past research are discussed to illustrate the fact that different problems required different methods. [source]


Active learning support vector machines for optimal sample selection in classification

JOURNAL OF CHEMOMETRICS, Issue 6 2004
Simeone Zomer
Abstract Labelling samples is a procedure that may result in significant delays particularly when dealing with larger datasets and/or when labelling implies prolonged analysis. In such cases a strategy that allows the construction of a reliable classifier on the basis of a minimal sized training set by labelling a minor fraction of samples can be of advantage. Support vector machines (SVMs) are ideal for such an approach because the classifier relies on only a small subset of samples, namely the support vectors, while being independent from the remaining ones that typically form the majority of the dataset. This paper describes a procedure where a SVM classifier is constructed with support vectors systematically retrieved from the pool of unlabelled samples. The procedure is termed ,active' because the algorithm interacts with the samples prior to their labelling rather than waiting passively for the input. The learning behaviour on simulated datasets is analysed and a practical application for the detection of hydrocarbons in soils using mass spectrometry is described. Results on simulations show that the active learning SVM performs optimally on datasets where the classes display an intermediate level of separation. On the real case study the classifier correctly assesses the membership of all samples in the original dataset by requiring for labelling around 14% of the data. Its subsequent application on a second dataset of analogous nature also provides perfect classification without further labelling, giving the same outcome as most classical techniques based on the entirely labelled original dataset. Copyright © 2004 John Wiley & Sons, Ltd. [source]


A review of standards for data exchange within systems biology

PROTEINS: STRUCTURE, FUNCTION AND BIOINFORMATICS, Issue 6 2007
Lena Strömbäck Dr.
Abstract The rapid increase in experimental data within systems biology has increased the need for exchange of data to allow analysis and comparison of larger datasets. This has resulted in a need for standardized formats for representation of such results and currently many formats for representation of data have been developed or are under development. In this paper, we give an overview of the current state of available standards and ontologies within systems biology. We focus on XML-based standards for exchange of data and give a thorough description of similarities and differences of currently available formats. For each of these, we discuss how the important concepts such as substances, interactions, and experimental data can be represented. In particular, we note that the purpose of a standard is often visible in the structures it provides for the representation of data. A clear purpose is also crucial for the success of a standard. Moreover, we note that the development of representation formats is parallel to the development of ontologies and the recent trend is that representation formats make more and more use of available ontologies. [source]