Home About us Contact

Large Data Sets (large + data_set)

Distribution by Scientific Domains

Life Sciences	26%
Medical Sciences	17%
Information Science and Computing	17%
Mathematics and Statistics	11%
Humanities and Social Sciences	5%
Chemistry	5%
Business, Economics, Finance and Accounting	5%
Engineering	4%
Earth and Environmental Science	2%

Distribution within Life Sciences

Entomology	7%

Kinds of Large Data Sets

very large data set

Selected Abstracts

Clustering revealed in high-resolution simulations and visualization of multi-resolution features in fluid,particle models

CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 2 2003
Krzysztof Boryczko
Abstract Simulating natural phenomena at greater accuracy results in an explosive growth of data. Large-scale simulations with particles currently involve ensembles consisting of between 106 and 109 particles, which cover 105,106 time steps. Thus, the data files produced in a single run can reach from tens of gigabytes to hundreds of terabytes. This data bank allows one to reconstruct the spatio-temporal evolution of both the particle system as a whole and each particle separately. Realistically, for one to look at a large data set at full resolution at all times is not possible and, in fact, not necessary. We have developed an agglomerative clustering technique, based on the concept of a mutual nearest neighbor (MNN). This procedure can be easily adapted for efficient visualization of extremely large data sets from simulations with particles at various resolution levels. We present the parallel algorithm for MNN clustering and its timings on the IBM SP and SGI/Origin 3800 multiprocessor systems for up to 16 million fluid particles. The high efficiency obtained is mainly due to the similarity in the algorithmic structure of MNN clustering and particle methods. We show various examples drawn from MNN applications in visualization and analysis of the order of a few hundred gigabytes of data from discrete particle simulations, using dissipative particle dynamics and fluid particle models. Because data clustering is the first step in this concept extraction procedure, we may employ this clustering procedure to many other fields such as data mining, earthquake events and stellar populations in nebula clusters. Copyright © 2003 John Wiley & Sons, Ltd. [source]

Missing data assumptions and methods in a smoking cessation study

ADDICTION, Issue 3 2010
Sunni A. Barnes
ABSTRACT Aim A sizable percentage of subjects do not respond to follow-up attempts in smoking cessation studies. The usual procedure in the smoking cessation literature is to assume that non-respondents have resumed smoking. This study used data from a study with a high follow-up rate to assess the degree of bias that may be caused by different methods of imputing missing data. Design and methods Based on a large data set with very little missing follow-up information at 12 months, a simulation study was undertaken to compare and contrast missing data imputation methods (assuming smoking, propensity score matching and optimal matching) under various assumptions as to how the missing data arose (randomly generated missing values, increased non-response from smokers and a hybrid of the two). Findings Missing data imputation methods all resulted in some degree of bias which increased with the amount of missing data. Conclusion None of the missing data imputation methods currently available can compensate for bias when there are substantial amounts of missing data. [source]

Topology and Dependency Tests in Spatial and Network Autoregressive Models

GEOGRAPHICAL ANALYSIS, Issue 2 2009
Steven Farber
Social network analysis has been identified as a promising direction for further applications of spatial statistical and econometric models. The type of network analysis envisioned is formally identical to the analysis of geographical systems, in that both involve the measurement of dependence between observations connected by edges that constitute a system. An important item, which has not been investigated in this context, is the potential relationship between the topology properties of networks (or network descriptions of geographical systems) and the properties of spatial models and tests. The objective of this article is to investigate, within a simulation setting, the ability of spatial dependency tests to identify a spatial/network autoregressive model when two network topology measures, namely degree distribution and clustering, are controlled. Drawing on a large data set of synthetically controlled social networks, the impact of network topology on dependency tests is investigated under a hierarchy of topology factors, sample size, and autocorrelation strength. In addition, topology factors are related to known properties of empirical systems. El análisis de redes sociales ha sido y es una dirección prometedora en el avance de las aplicaciones de modelos econométricos y de estadística espacial. El tipo de análisis de redes que proponemos es idéntico al análisis de sistemas geográficos, ya que ambos miden la dependencia entre observaciones conectadas que conforman un sistema. Un punto importante que no ha sido investigado en este contexto es la potencial relación entre las propiedades topológicas de redes (o descripción de redes de sistemas geográficos) y las propiedades de los modelos y pruebas (tests) espaciales. El objetivo de este artículo es investigar (dentro del marco de simulaciones Monte Carlo), la capacidad que poseen las pruebas de dependencia espacial para identificar un modelo autorregresivo espacial/de redes, en los casos en los que dos medidas topológicas de redes (grado de distribución y transitividad) son controlados. Haciendo uso de una base de datos de redes sociales controladas sintéticamente, este artículo evalúa el impacto de la topología de redes en las pruebas de dependencia espacial. Dicho impacto es evaluado con respecto a variaciones en los factores topológicos, el tamaño de muestra, y los niveles de autocorrelación espacial. Adicionalmente, los factores topológicos son relacionados a propiedades conocidas de varios sistemas empíricos. [source]

Improved imaging with phase-weighted common conversion point stacks of receiver functions

GEOPHYSICAL JOURNAL INTERNATIONAL, Issue 1 2010
A. Frassetto
SUMMARY Broad-band array studies frequently stack receiver functions to improve their signal-to-noise ratio while mapping structures in the crust and upper mantle. Noise may produce spurious secondary arrivals that obscure or mimic arrivals produced by P -to- S conversions at large contrasts in seismic impedance such as the Moho. We use a Hilbert transform to calculate phase-weights, which minimize the constructive stacking of erroneous signal in receiver function data sets. We outline this approach and demonstrate its application through synthetic data combined with different types of noise, a previously published example of signal-generated noise, and a large data set from the Sierra Nevada EarthScope Project. These examples show that phase-weighting reduces the presence of signal-generated noise in receiver functions and improves stacked data sets. [source]

Redox Processes and Water Quality of Selected Principal Aquifer Systems

GROUND WATER, Issue 2 2008
P.B. McMahon
Reduction/oxidation (redox) conditions in 15 principal aquifer (PA) systems of the United States, and their impact on several water quality issues, were assessed from a large data base collected by the National Water-Quality Assessment Program of the USGS. The logic of these assessments was based on the observed ecological succession of electron acceptors such as dissolved oxygen, nitrate, and sulfate and threshold concentrations of these substrates needed to support active microbial metabolism. Similarly, the utilization of solid-phase electron acceptors such as Mn(IV) and Fe(III) is indicated by the production of dissolved manganese and iron. An internally consistent set of threshold concentration criteria was developed and applied to a large data set of 1692 water samples from the PAs to assess ambient redox conditions. The indicated redox conditions then were related to the occurrence of selected natural (arsenic) and anthropogenic (nitrate and volatile organic compounds) contaminants in ground water. For the natural and anthropogenic contaminants assessed in this study, considering redox conditions as defined by this framework of redox indicator species and threshold concentrations explained many water quality trends observed at a regional scale. An important finding of this study was that samples indicating mixed redox processes provide information on redox heterogeneity that is useful for assessing common water quality issues. Given the interpretive power of the redox framework and given that it is relatively inexpensive and easy to measure the chemical parameters included in the framework, those parameters should be included in routine water quality monitoring programs whenever possible. [source]

Preparing a large data set for analysis: using the Minimum Data Set to study perineal dermatitis

JOURNAL OF ADVANCED NURSING, Issue 4 2005
Kay Savik MS
Aim., The aim of this paper is to present a practical example of preparing a large set of Minimum Data Set records for analysis, operationalizing Minimum Data Set items that defined risk factors for perineal dermatitis, our outcome variable. Background., Research with nursing home elders remains a vital need as ,baby boomers' age. Conducting research in nursing homes is a daunting task. The Minimum Data Set is a standardized instrument used to assess many aspects of a nursing home resident's functional capability. United States Federal Regulations require a Minimum Data Set assessment of all nursing home residents. These large data would be a useful resource for research studies, but need to be extensively refined for use in most statistical analyses. Although fairly comprehensive, the Minimum Data Set does not provide direct measures of all clinical outcomes and variables of interest. Method., Perineal dermatitis is not directly measured in the Minimum Data Set. Additional information from prescribers' (physician and nurse) orders was used to identify cases of perineal dermatitis. The following steps were followed to produce Minimum Data Set records appropriate for analysis: (1) identification of a subset of Minimum Data Set records specific to the research, (2) identification of perineal dermatitis cases from the prescribers' orders, (3) merging of the perineal dermatitis cases with the Minimum Data Set data set, (4) identification of Minimum Data Set items used to operationalize the variables in our model of perineal dermatitis, (5) determination of the appropriate way to aggregate individual Minimum Data Set items into composite measures of the variables, (6) refinement of these composites using item analysis and (7) assessment of the distribution of the composite variables and need for transformations to use in statistical analysis. Results., Cases of perineal dermatitis were successfully identified and composites were created that operationalized a model of perineal dermatitis. Conclusion., Following these steps resulted in a data set where data analysis could be pursued with confidence. Incorporating other sources of data, such as prescribers' orders, extends the usefulness of the Minimum Data Set for research use. [source]

Evidences of non-additive effects of multiple parasitoids on Diatraea saccharalis Fabr. (Lep., Crambidae) populations in sugarcane fields in Brazil

JOURNAL OF APPLIED ENTOMOLOGY, Issue 2 2004
M. N. Rossi
Abstract: Biological control is a relatively benign method of pest control. However, considerable debate exists over whether multiple natural enemies often interact to produce additive or non-additive effects on their prey or host populations. Based on the large data set stored in the São João and Barra sugarcane mills (state of São Paulo, Brazil) regarding the programme of biological control of Diatraea saccharalis using the parasitoids Cotesia flavipes and tachinid flies, in the present study the author investigated whether the parasitoids released into sugarcane fields interfered significantly with the rate of parasitized D. saccharalis hosts. The author also observed whether there was an additive effect of releasing C. flavipes and tachinids on the rate of parasitized hosts, and looked for evidence of possible negative effects of the use of multiple parasitoid species in this biological control programme. Results showed that C. flavipes and the tachinids were concomitantly released in the Barra Mill, but not in the São Jão Mill. Furthermore, in the Barra Mill there was evidence that the parasitoids interacted because the percentage of parasitism did not increase after the release of either C. flavipes or tachinids. In the São João Mill, when both parasitoid species were released out of synchrony, both the percentage of parasitism by C. flavipes as well as that of the tachinids increased. When large numbers of tachinids were released in the Barra Mill, they caused a significant lower percentage of parasitism imposed by C. flavipes. The implications of the results as evidence of non-additive effects of C. flavipes plus tachinids on D. saccharalis populations are discussed. [source]

Discovery of ten new specimens of large-billed reed warbler Acrocephalus orinus, and new insights into its distributional range

JOURNAL OF AVIAN BIOLOGY, Issue 6 2008
Lars Svensson
We here report the finding of ten new specimens of the poorly known large-billed reed warbler Acrocephalus orinus. Preliminary identifications were made on the basis of bill, tarsus and claw measurements, and their specific identity was then confirmed by comparison of partial sequences of the cytochrome b gene with a large data set containing nearly all other species in the genus Acrocephalus, including the type specimen of A. orinus. Five of the new specimens were collected in summer in Afghanistan and Kazakhstan, indicating that the species probably breeds in Central Asia, and the data and moult of the others suggest that the species migrates along the Himalayas to winter in N India and SE Asia. The population structure suggests a stable or shrinking population. [source]

Are parametric models suitable for estimating avian growth rates?

JOURNAL OF AVIAN BIOLOGY, Issue 4 2007
William P. Brown
For many bird species, growth is negative or equivocal during development. Traditional, parametric growth curves assume growth follows a sigmoidal form with prescribed inflection points and is positive until asymptotic size. Accordingly, these curves will not accurately capture the variable, sometimes considerable, fluctuations in avian growth over the course of the trajectory. We evaluated the fit of three traditional growth curves (logistic, Gompertz, and von Bertalanffy) and a nonparametric spline estimator to simulated growth data of six different specified forms over a range of sample sizes. For all sample sizes, the spline best fit the simulated model that exhibited negative growth during a portion of the trajectory. The Gompertz curve was the most flexible for fitting simulated models that were strictly sigmoidal in form, yet the fit of the spline was comparable to that of the Gompertz curve as sample size increased. Importantly, confidence intervals for all of the fitted, traditional growth curves were wholly inaccurate, negating the apparent robustness of the Gompertz curve, while confidence intervals of the spline were acceptable. We further evaluated the fit of traditional growth curves and the spline to a large data set of wood thrush Hylocichla mustelina mass and wing chord observations. The spline fit the wood thrush data better than the traditional growth curves, produced estimates that did not differ from known observations, and described negative growth rates at relevant life history stages that were not detected by the growth curves. The common rationale for using parametric growth curves, which compress growth information into a few parameters, is to predict an expected size or growth rate at some age or to compare estimated growth with other published estimates. The suitability of these traditional growth curves may be compromised by several factors, however, including variability in the true growth trajectory. Nonparametric methods, such as the spline, provide a precise description of empirical growth yet do not produce such parameter estimates. Selection of a growth descriptor is best determined by the question being asked but may be constrained by inherent patterns in the growth data. [source]

Leichhardt's maps: 100 years of change in vegetation structure in inland Queensland

JOURNAL OF BIOGEOGRAPHY, Issue 1 2008
R. J. Fensham
Abstract Aim, To address the hypothesis that there has been a substantial increase in woody vegetation cover (,vegetation thickening') during the 100 years after the burning practices of aboriginal hunter-gatherers were abruptly replaced by the management activities associated with pastoralism in north-east Australia. Location, Three hundred and eighty-three sites on 3000 km transect, inland Queensland, Australia. Methods, Vegetation structure descriptions from the route notes of the first European exploration of the location by Ludwig Leichhardt in 1844,45 were georeferenced and compiled. Leichhardt's application of structural descriptors (e.g. ,scrub', ,open forest', ,plain') was interpreted as domains within a matrix of tall stratum and low stratum woody cover. Woody cover was also interpreted for the same locations using aerial photography that largely pre-dates extensive land clearing (1940s,1970s) and compared with their structural domain in 1844,45. The fire-sensitive tree, cypress-pine (Callitris glaucophylla) was singled out for case study because it has been widely proposed that the density of this tree has substantially increased under European pastoral management. Results, The coarse resolution of this analysis indicates that the structure of the vegetation has been stable over the first 100 years of pastoralism. For example treeless or sparsely treed plains described by Leichhardt (1844,45) had the same character on the aerial photography (1945,78). Leichhardt typically described vegetation that includes cypress-pine as having a ,thicket' structure suggesting dense regenerating stands of small trees, consistent with the signature typical on the aerial photography. Main conclusions, A large data set of geographically located descriptions of vegetation structure from the first European traverse of inland Australia compared with vegetation structure determined from aerial photography does not support the hypothesis that vegetation thickening has been extensive and substantial. On the contrary the study suggests that the structure of the vegetation has been relatively stable for the first 100 years of European settlement and pastoralism except for those areas that have been affected by broad-scale clearing. [source]

Genome size and recombination in angiosperms: a second look

JOURNAL OF EVOLUTIONARY BIOLOGY, Issue 2 2007
J. ROSS-IBARRA
Abstract Despite dramatic differences in genome size , and thus space for recombination to occur , previous workers found no correlation between recombination rate and genome size in flowering plants. Here I re-investigate these claims using phylogenetic comparative methods to test a large data set of recombination data in angiosperms. I show that genome size is significantly correlated with recombination rate across a wide sampling of species and that change in genome size explains a meaningful proportion (,20%) of variation in recombination rate. I show that the strength of this correlation is comparable with that of several characters previously linked to evolutionary change in recombination rate, but argue that consideration of processes of genome size change likely make the observed correlation a conservative estimate. And finally, although I find that recombination rate increases less than proportionally to change in genome size, several mechanistic and theoretical arguments suggest that this result is not unexpected. [source]

Using classification tree analysis to reveal causes of mortality in an insect population

AGRICULTURAL AND FOREST ENTOMOLOGY, Issue 2 2010
Chris J. K. MacQuarrie
1Invasive species pose significant threats to native and managed ecosystems. However, it may not always be possible to perform rigorous, long-term studies on invaders to determine the factors that influence their population dynamics, particularly when time and resources are limited. We applied a novel approach to determine factors associated with mortality in larvae of the sawfly Profenusa thomsoni Konow, a leafminer of birch, and a relatively recent invader of urban and rural birch forests in Alaska. Classification tree analysis was applied to reveal relationships between qualitative and quantitative predictor variables and categorical response variables in a large data set of larval mortality observations. 2We determined the state (living or dead) of sawfly larvae in samples of individual leaves. Each leaf was scored for variables reflecting the intensity of intra-specific competition and leaf quality for leafminers, year of collection and degree-days accumulated were recorded for each sample. We explored the association of these variables with larval state using classification tree analysis. 3Leafminer mortality was best explained by a combination of competition and resource exhaustion and our analysis revealed a possible advantage to group feeding in young larvae that may explain previously observed patterns of resource overexploitation in this species. Dead larvae were disproportionately found in smaller leaves, which highlights the potential effect of competition on mortality and suggests that smaller-leaved species of birch will better able to resist leafminer damage. 4We show that classification tree analysis may be useful in situations where urgency and/or limited resources prohibit traditional life-table studies. [source]

Alcoholism Susceptibility Loci: Confirmation Studies in a Replicate Sample and Further Mapping

ALCOHOLISM, Issue 7 2000
Tatiana Foroud
Background: There is substantial evidence for a significant genetic component to the risk for alcoholism. A previous study reported linkage to chromosomes 1, 2, and 7 in a large data set that consisted of 105 families, each with at least three alcoholic members. Methods: Additional, genotyping in the 105 families has been completed in the chromosomal regions identified in the initial analyses, and a replication sample of 157 alcoholic families ascertained under identical criteria has been genotyped. Two hierarchical definitions of alcoholism were employed in the linkage analyses: (1) Individuals who met both Feighner and DSM-III-R criteria for alcohol dependence represented a broad definition of disease; and (2) individuals who met ICD-10 criteria for alcoholism were considered affected under a more severe definition of disease. Results: Genetic analyses of affected sibling pairs supported linkage to chromosome 1 (LOD = 1.6) in the replication data set as well as in a combined analysis of the two samples (LOD = 2.6). Evidence of linkage to chromosome 7 increased in the combined data (LOD = 2.9). The LOD score on chromosome 2 in the initial data set increased after genotyping of additional markers; however, combined analyses of the two data sets resulted in overall lower LOD scores (LOD = 1.8) on chromosome 2. A new finding of linkage to chromosome 3 was identified in the replication data set (LOD = 3.4). Conclusions: Analyses of a second large sample of alcoholic families provided further evidence of genetic susceptibility loci on chromosomes 1 and 7. Genetic analyses also have identified susceptibility loci on chromosomes 2 and 3 that may act only in one of the two data sets. [source]

Dialogue act recognition using maximum entropy

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 6 2008
Kwok Cheung Lan
A dialogue-based interface for information systems is considered a potentially very useful approach to information access. A key step in computer processing of natural-language dialogues is dialogue-act (DA) recognition. In this paper, we apply a feature-based classification approach for DA recognition, by using the maximum entropy (ME) method to build a classifier for labeling utterances with DA tags. The ME method has the advantage that a large number of heterogeneous features can be flexibly combined in one classifier, which can facilitate feature selection. A unique characteristic of our approach is that it does not need to model the prior probability of DAs directly, and thus avoids the use of a discourse grammar. This simplifies the implementation of the classifier and improves the efficiency of DA recognition, without sacrificing the classification accuracy. We evaluate the classifier using a large data set based on the Switchboard corpus. Encouraging performance is observed; the highest classification accuracy achieved is 75.03%. We also propose a heuristic to address the problem of sparseness of the data set. This problem has resulted in poor classification accuracies of some DA types that have very low occurrence frequencies in the data set. Preliminary evaluation shows that the method is effective in improving the macroaverage classification accuracy of the ME classifier. [source]

Fixed rank kriging for very large spatial data sets

JOURNAL OF THE ROYAL STATISTICAL SOCIETY: SERIES B (STATISTICAL METHODOLOGY), Issue 1 2008
Noel Cressie
Summary., Spatial statistics for very large spatial data sets is challenging. The size of the data set, n, causes problems in computing optimal spatial predictors such as kriging, since its computational cost is of order . In addition, a large data set is often defined on a large spatial domain, so the spatial process of interest typically exhibits non-stationary behaviour over that domain. A flexible family of non-stationary covariance functions is defined by using a set of basis functions that is fixed in number, which leads to a spatial prediction method that we call fixed rank kriging. Specifically, fixed rank kriging is kriging within this class of non-stationary covariance functions. It relies on computational simplifications when n is very large, for obtaining the spatial best linear unbiased predictor and its mean-squared prediction error for a hidden spatial process. A method based on minimizing a weighted Frobenius norm yields best estimators of the covariance function parameters, which are then substituted into the fixed rank kriging equations. The new methodology is applied to a very large data set of total column ozone data, observed over the entire globe, where n is of the order of hundreds of thousands. [source]

Genetic variability is unrelated to growth and parasite infestation in natural populations of the European eel (Anguilla anguilla)

MOLECULAR ECOLOGY, Issue 22 2009
J. M. PUJOLAR
Abstract Positive correlations between individual genetic heterozygosity and fitness-related traits (HFCs) have been observed in organisms as diverse as plants, marine bivalves, fish or mammals. HFCs are not universal and the strength and stability of HFCs seem to be variable across species, populations and ages. We analysed the relationship between individual genetic variability and two different estimators of fitness in natural samples of European eel, growth rate (using back-calculated length-at-age 1, 2 and 3) and parasite infestation by the swimbladder nematode Anguillicola crassus. Despite using a large data set of 22 expressed sequence tags-derived microsatellite loci and a large sample size of 346 individuals, no heterozygote advantage was observed in terms of growth rate or parasite load. The lack of association was evidenced by (i) nonsignificant global HFCs, (ii) a Multivariate General Linear Model showing no effect of heterozygosity on fitness components, (iii) single-locus analysis showing a lower number of significant tests than the expected false discovery rate, (iv) sign tests showing only a significant departure from expectations at one component, and, (v) a random distribution of significant single-locus HFCs that was not consistent across fitness components or sampling sites. This contrasts with the positive association observed in farmed eels in a previous study using allozymes, which can be explained by the nature of the markers used, with the allozyme study including many loci involved in metabolic energy pathways, while the expressed sequence tags-linked microsatellites might be located in genes or in the proximity of genes uncoupled with metabolism/growth. [source]

Characterization of population structure from the mitochondrial DNA vis-à-vis language and geography in Papua New Guinea

AMERICAN JOURNAL OF PHYSICAL ANTHROPOLOGY, Issue 4 2010
Esther J. Lee
Abstract Situated along a corridor linking the Asian continent with the outer islands of the Pacific, Papua New Guinea has long played a key role in understanding the initial peopling of Oceania. The vast diversity in languages and unique geographical environments in the region have been central to the debates on human migration and the degree of interaction between the Pleistocene settlers and newer migrants. To better understand the role of Papua New Guinea in shaping the region's prehistory, we sequenced the mitochondrial DNA (mtDNA) control region of three populations, a total of 94 individuals, located in the East Sepik Province of Papua New Guinea. We analyzed these samples with a large data set of Oceania populations to examine the role of geography and language in shaping population structure within New Guinea and between the region and Island Melanesia. Our results from median-joining networks, star-cluster age estimates, and population genetic analyses show that while highland New Guinea populations seem to be the oldest settlers, there has been significant gene flow within New Guinea with little influence from geography or language. The highest genetic division is between Papuan speakers of New Guinea versus East Papuan speakers located outside of mainland New Guinea. Our study supports the weak language barriers to genetic structuring among populations in close contact and highlights the complexity of understanding the genetic histories of Papua New Guinea in association with language and geography. Am J Phys Anthropol 142:613,624, 2010. © 2010 Wiley-Liss, Inc. [source]

The Effect of Problem Severity, Managerial and Organizational Capacity, and Agency Structure on Intergovernmental Collaboration: Evidence from Local Emergency Management

PUBLIC ADMINISTRATION REVIEW, Issue 2 2010
Michael McGuire
Like most public managers nowadays, local emergency managers operate within complex, uncertain environments. Rapid changes in the scope and severity of the issues increase the extent of intergovernmental collaboration necessary to address such challenges. Using a large data set of county emergency management agency directors, variations in intergovernmental collaboration reflect influences from problem severity, managerial capacity, and structural factors. The results demonstrate that public managers who perceive problems as severe, possess specific managerial skills, lead high-capacity organizations, and operate in less complex agency structures collaborate more often and more effectively across governmental boundaries. [source]

Estimation of breeding values from large-sized routine carcass data in Japanese Black cattle using Bayesian analysis

ANIMAL SCIENCE JOURNAL, Issue 6 2009
Aisaku ARAKAWA
ABSTRACT Volumes of official data sets have been increasing rapidly in the genetic evaluation using the Japanese Black routine carcass field data. Therefore, an alternative approach with smaller memory requirement to the current one using the restricted maximum likelihood (REML) and the empirical best linear unbiased prediction (EBLUP) is desired. This study applied a Bayesian analysis using Gibbs sampling (GS) to a large data set of the routine carcass field data and practically verified its validity in the estimation of breeding values. A Bayesian analysis like REML-EBLUP was implemented, and the posterior means were calculated using every 10th sample from 90 000 of samples after 10 000 samples discarded. Moment and rank correlations between breeding values estimated by GS and REML-EBLUP were very close to one, and the linear regression coefficients and the intercepts of the GS on the REML-EBLUP estimates were substantially one and zero, respectively, showing a very good agreement between breeding value estimation by the current GS and the REML-EBLUP. The current GS required only one-sixth of the memory space with REML-EBLUP. It is confirmed that the current GS approach with relatively small memory requirement is valid as a genetic evaluation procedure using large routine carcass data. [source]

Angiosperm phylogeny inferred from 18S rDNA, vbcL, and atpB sequences

BOTANICAL JOURNAL OF THE LINNEAN SOCIETY, Issue 4 2000
DOUGLAS E. SOLTIS
A phylogenetic analysis of a combined data set for 560 angiosperms and seven outgroups based on three genes, 18S rDNA (1855 bp), rbcL (1428 bp), and atpB (1450 bp) representing a total of 4733 bp is presented. Parsimony analysis was expedited by use of a new computer program, the RATCHET. Parsimony jackknifing was performed to assess the support of clades. The combination of three data sets for numerous species has resulted in the most highly resolved and strongly supported topology yet obtained for angiosperms. In contrast to previous analyses based on single genes, much of the spine of the tree and most of the larger clades receive jackknife support 250%. Some of the noneudicots form a grade followed by a strongly supported eudicot clade. The early-branching angiosperms are Amborellaceae, Nymphaeaceae, and a clade of Austrobaileyaceae, Illiciaceae, and Schi-sandraceae. The remaining noneudicots, except Ceratophyllaceae, form a weakly supported core eumagnoliid clade comprising six well-supported subclades: Chloranthaceae, monocots, WinteraceaeICanellaceae, Piperales, Laurales, and Magnoliales. Ceratophyllaceae are sister to the eudicots. Within the well-supported eudicot clade, the early-diverging eudicots (e.g. Proteales, Ranunculales, Trochodendraceae, Sabiaceae) form a grade, followed by the core eudicots, the monophyly of which is also strongly supported. The core eudicots comprise six well-supported subclades: (1) Berberidopsidaceae/Aextoxicaceae; (2) Myrothamnaceae/ Gunneraceae; (3) Saxifragales, which are the sister to Vitaceae (including Leea) plus a strongly supported eurosid clade; (4) Santalales; (5) Caryophyllales, to which Dilleniaceae are sister; and (6) an asterid clade. The relationships among these six subclades of core eudicots do not receive strong support. This large data set has also helped place a number of enigmatic angiosperm families, including Podostemaceae, Aphloiaceae, and Ixerbaceae. This analysis further illustrates the tractability of large data sets and supports a recent, phylogenetically based, ordinal-level reclassification of the angiosperms based largely, but not exclusively, on molecular (DNA sequence) data. [source]

Model-based design of chemotherapeutic regimens that account for heterogeneity in leucopoenia

BRITISH JOURNAL OF HAEMATOLOGY, Issue 6 2006
Markus Scholz
Summary Patients treated with multicycle chemotherapy can exhibit large interindividual heterogeneity of haematotoxicity. We describe how a biomathematical model of human granulopoiesis can be used to design risk-adapted dose-dense chemotherapies, leading to more similar leucopoenias in the population. Calculations were performed on a large data set for cyclophosphamide/doxorubicin/vincristine/prednisone (CHOP)-like chemotherapies for aggressive non-Hodgkin lymphoma. Age, gender, Eastern Cooperative Oncology Group performance status, lactate dehydrogenase and the degree of leucopoenia within the first therapy cycle were used to stratify patients into groups with different expected severity of leucopoenia. We estimated risk-specific bone marrow toxicities depending on the drug doses administered. These toxicities were used to derive risk-adapted therapy schedules. We determined different doses of cyclophosphamide and additional etoposide for patients treated with CHOP-14. Alternatively, the model predicted that further reductions of cycle duration were feasible in groups with low toxicity. We also used the model to identify appropriate granulocyte colony-stimulating factor (G-CSF) schedules. In conclusion, we present a method to estimate the potential of risk-specific dose adaptation of different cytotoxic drugs in order to design chemotherapy protocols that result in decreased diversity of leucopoenia between patients, to develop dose-escalation strategies in cases of low leucopoenic reaction and to determine optimal G-CSF support. [source]

Colour constancy based on texture similarity for natural images

COLORATION TECHNOLOGY, Issue 6 2009
Bing Li
Colour constancy is a classical problem in computer vision. Although there are a number of colour constancy algorithms based on different assumptions, none of them can be considered as universal. How to select or combine these available methods for different natural image characteristics is an important problem. Recent studies have shown that the texture feature is an important factor to consider when selecting the best colour constancy algorithm for a certain image. In this paper, Weibull parameterisation is used to identify the texture characteristics of colour images. According to the texture similarity, the best colour constancy method (or best combination of methods) is selected out for a specific image. The experiments were carried out on a large data set and the results show that this new approach outperforms current state-of-the-art single algorithms, as well as some combined algorithms. [source]

An Exploratory Technique for Coherent Visualization of Time-varying Volume Data

COMPUTER GRAPHICS FORUM, Issue 3 2010
A. Tikhonova
Abstract The selection of an appropriate global transfer function is essential for visualizing time-varying simulation data. This is especially challenging when the global data range is not known in advance, as is often the case in remote and in-situ visualization settings. Since the data range may vary dramatically as the simulation progresses, volume rendering using local transfer functions may not be coherent for all time steps. We present an exploratory technique that enables coherent classification of time-varying volume data. Unlike previous approaches, which require pre-processing of all time steps, our approach lets the user explore the transfer function space without accessing the original 3D data. This is useful for interactive visualization, and absolutely essential for in-situ visualization, where the entire simulation data range is not known in advance. Our approach generates a compact representation of each time step at rendering time in the form of ray attenuation functions, which are used for subsequent operations on the opacity and color mappings. The presented approach offers interactive exploration of time-varying simulation data that alleviates the cost associated with reloading and caching large data sets. [source]

Streaming Surface Reconstruction Using Wavelets

COMPUTER GRAPHICS FORUM, Issue 5 2008
J. Manson
Abstract We present a streaming method for reconstructing surfaces from large data sets generated by a laser range scanner using wavelets. Wavelets provide a localized, multiresolution representation of functions and this makes them ideal candidates for streaming surface reconstruction algorithms. We show how wavelets can be used to reconstruct the indicator function of a shape from a cloud of points with associated normals. Our method proceeds in several steps. We first compute a low-resolution approximation of the indicator function using an octree followed by a second pass that incrementally adds fine resolution details. The indicator function is then smoothed using a modified octree convolution step and contoured to produce the final surface. Due to the local, multiresolution nature of wavelets, our approach results in an algorithm over 10 times faster than previous methods and can process extremely large data sets in the order of several hundred million points in only an hour. [source]

A Screen Space Quality Method for Data Abstraction

COMPUTER GRAPHICS FORUM, Issue 3 2008
J. Johansson
Abstract The rendering of large data sets can result in cluttered displays and non-interactive update rates, leading to time consuming analyses. A straightforward solution is to reduce the number of items, thereby producing an abstraction of the data set. For the visual analysis to remain accurate, the graphical representation of the abstraction must preserve the significant features present in the original data. This paper presents a screen space quality method, based on distance transforms, that measures the visual quality of a data abstraction. This screen space measure is shown to better capture significant visual structures in data, compared with data space measures. The presented method is implemented on the GPU, allowing interactive creation of high quality graphical representations of multivariate data sets containing tens of thousands of items. [source]

Interactive Visualization with Programmable Graphics Hardware

COMPUTER GRAPHICS FORUM, Issue 3 2002
Thomas Ertl
One of the main scientific goals of visualization is the development of algorithms and appropriate data models which facilitate interactive visual analysis and direct manipulation of the increasingly large data sets which result from simulations running on massive parallel computer systems, from measurements employing fast high-resolution sensors, or from large databases and hierarchical information spaces. This task can only be achieved with the optimization of all stages of the visualization pipeline: filtering, compression, and feature extraction of the raw data sets, adaptive visualization mappings which allow the users to choose between speed and accuracy, and exploiting new graphics hardware features for fast and high-quality rendering. The recent introduction of advanced programmability in widely available graphics hardware has already led to impressive progress in the area of volume visualization. However, besides the acceleration of the final rendering, flexible graphics hardware is increasingly being used also for the mapping and filtering stages of the visualization pipeline, thus giving rise to new levels of interactivity in visualization applications. The talk will present recent results of applying programmable graphics hardware in various visualization algorithms covering volume data, flow data, terrains, NPR rendering, and distributed and remote applications. [source]

Managing very large distributed data sets on a data grid

CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 11 2010
Miguel Branco
Abstract In this work we address the management of very large data sets, which need to be stored and processed across many computing sites. The motivation for our work is the ATLAS experiment for the Large Hadron Collider (LHC), where the authors have been involved in the development of the data management middleware. This middleware, called DQ2, has been used for the last several years by the ATLAS experiment for shipping petabytes of data to research centres and universities worldwide. We describe our experience in developing and deploying DQ2 on the Worldwide LHC computing Grid, a production Grid infrastructure formed of hundreds of computing sites. From this operational experience, we have identified an important degree of uncertainty that underlies the behaviour of large Grid infrastructures. This uncertainty is subjected to a detailed analysis, leading us to present novel modelling and simulation techniques for Data Grids. In addition, we discuss what we perceive as practical limits to the development of data distribution algorithms for Data Grids given the underlying infrastructure uncertainty, and propose future research directions. Copyright © 2009 John Wiley & Sons, Ltd. [source]

Measuring and modelling the performance of a parallel ODMG compliant object database server

CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 1 2006
Sandra de F. Mendes Sampaio
Abstract Object database management systems (ODBMSs) are now established as the database management technology of choice for a range of challenging data intensive applications. Furthermore, the applications associated with object databases typically have stringent performance requirements, and some are associated with very large data sets. An important feature for the performance of object databases is the speed at which relationships can be explored. In queries, this depends on the effectiveness of different join algorithms into which queries that follow relationships can be compiled. This paper presents a performance evaluation of the Polar parallel object database system, focusing in particular on the performance of parallel join algorithms. Polar is a parallel, shared-nothing implementation of the Object Database Management Group (ODMG) standard for object databases. The paper presents an empirical evaluation of queries expressed in the ODMG Query Language (OQL), as well as a cost model for the parallel algebra that is used to evaluate OQL queries. The cost model is validated against the empirical results for a collection of queries using four different join algorithms, one that is value based and three that are pointer based. Copyright © 2005 John Wiley & Sons, Ltd. [source]

Clustering revealed in high-resolution simulations and visualization of multi-resolution features in fluid,particle models

A new dimension in combining data?

ACTA ZOOLOGICA, Issue 1 2010
The use of morphology, phylogenomic data in metazoan systematics
Abstract Giribet, G. 2010. A new dimension in combining data? The use of morphology and phylogenomic data in metazoan systematics. ,Acta Zoologica (Stockholm) 91: 11,19 Animal phylogenies have been traditionally inferred by using the character state information derived from the observation of a diverse array of morphological and anatomical features, but the incorporation of molecular data into the toolkit of phylogenetic characters has shifted drastically the way researchers infer phylogenies. A main reason for this is the ease at which molecular data can be obtained, compared to, e.g., traditional histological and microscopical techniques. Researchers now routinely use genomic data for reconstructing relationships among animal phyla (using whole genomes or Expressed Sequence Tags) but the amount of morphological data available to study the same phylogenetic patterns has not grown accordingly. Given the disparity between the amounts of molecular and morphological data, some authors have questioned entire morphological programs. In this review I discuss issues related to the combinability of genomic and morphological data, the informativeness of each set of characters, and conclude with a discussion of how morphology could be made scalable by utilizing new techniques that allow for non-intrusive examination of large amounts of preserved museum specimens. Morphology should therefore remains a strong field in evolutionary and comparative biology, as it continues to provide information for inferring phylogenetic patterns, is an important complement for the patterns derived from the molecular data, and it is the common nexus that allows studying fossil taxa with large data sets of molecular data. [source]