Home About us Contact

Small Data Sets (small + data_set)

Distribution by Scientific Domains

Mathematics and Statistics	22%
Education	22%
Earth and Environmental Science	22%

Selected Abstracts

Decision-making method using a visual approach for cluster analysis problems; indicative classification algorithms and grouping scope

EXPERT SYSTEMS, Issue 3 2007
Ran M. Bittmann
Abstract: Currently, classifying samples into a fixed number of clusters (i.e. supervised cluster analysis) as well as unsupervised cluster analysis are limited in their ability to support ,cross-algorithms' analysis. It is well known that each cluster analysis algorithm yields different results (i.e. a different classification); even running the same algorithm with two different similarity measures commonly yields different results. Researchers usually choose the preferred algorithm and similarity measure according to analysis objectives and data set features, but they have neither a formal method nor tool that supports comparisons and evaluations of the different classifications that result from the diverse algorithms. Current research development and prototype decisions support a methodology based upon formal quantitative measures and a visual approach, enabling presentation, comparison and evaluation of multiple classification suggestions resulting from diverse algorithms. This methodology and tool were used in two basic scenarios: (I) a classification problem in which a ,true result' is known, using the Fisher iris data set; (II) a classification problem in which there is no ,true result' to compare with. In this case, we used a small data set from a user profile study (a study that tries to relate users to a set of stereotypes based on sociological aspects and interests). In each scenario, ten diverse algorithms were executed. The suggested methodology and decision support system produced a cross-algorithms presentation; all ten resultant classifications are presented together in a ,Tetris-like' format. Each column represents a specific classification algorithm, each line represents a specific sample, and formal quantitative measures analyse the ,Tetris blocks', arranging them according to their best structures, i.e. best classification. [source]

Estimation of an optimal mixed-phase inverse filter

GEOPHYSICAL PROSPECTING, Issue 4 2000
Bjorn Ursin
Inverse filtering is applied to seismic data to remove the effect of the wavelet and to obtain an estimate of the reflectivity series. In many cases the wavelet is not known, and only an estimate of its autocorrelation function (ACF) can be computed. Solving the Yule-Walker equations gives the inverse filter which corresponds to a minimum-delay wavelet. When the wavelet is mixed delay, this inverse filter produces a poor result. By solving the extended Yule-Walker equations with the ACF of lag , on the main diagonal of the filter equations, it is possible to decompose the inverse filter into a finite-length filter convolved with an infinite-length filter. In a previous paper we proposed a mixed-delay inverse filter where the finite-length filter is maximum delay and the infinite-length filter is minimum delay. Here, we refine this technique by analysing the roots of the Z -transform polynomial of the finite-length filter. By varying the number of roots which are placed inside the unit circle of the mixed-delay inverse filter, at most 2, different filters are obtained. Applying each filter to a small data set (say a CMP gather), we choose the optimal filter to be the one for which the output has the largest Lp -norm, with p=5. This is done for increasing values of , to obtain a final optimal filter. From this optimal filter it is easy to construct the inverse wavelet which may be used as an estimate of the seismic wavelet. The new procedure has been applied to a synthetic wavelet and to an airgun wavelet to test its performance, and also to verify that the reconstructed wavelet is close to the original wavelet. The algorithm has also been applied to prestack marine seismic data, resulting in an improved stacked section compared with the one obtained by using a minimum-delay filter. [source]

Dialogic inquiry in life science conversations of family groups in a museum

JOURNAL OF RESEARCH IN SCIENCE TEACHING, Issue 2 2003
Doris Ash
This research illustrates the efficacy of a new approach for collecting and analyzing family conversational data at museums and other informal settings. This article offers a detailed examination of a small data set (three families) that informs a larger body of work that focuses on conversation as methodology. The dialogic content of this work centers on biological themes, specifically adaptation. The biological principle becomes visible when families talk about survival strategies such as breeding or protection from predators. These themes arise from both the family members and the museum exhibit. This study also analyzes the inquiry skills families use as they make sense of science content. I assume that children and adults offer different interest areas or expertise for dialogic negotiation and that family members use inquiry skills in dialogue to explore matters of importance. This analysis offers educators methodological tools for investigating families' scientific sense-making in informal settings. © 2003 Wiley Periodicals, Inc. J Res Sci Teach 40: 138,162, 2003 [source]

Measuring beta-diversity from taxonomic similarity

JOURNAL OF VEGETATION SCIENCE, Issue 6 2007
Giovanni Bacaro
Abstract Question: The utility of beta (,-) diversity measures that incorporate information about the degree of taxonomic (dis)similarity between species plots is becoming increasingly recognized. In this framework, the question for this study is: can we define an ecologically meaningful index of ,-diversity that, besides indicating simple species turnover, is able to account for taxonomic similarity amongst species in plots? Methods: First, the properties of existing measures of taxonomic similarity measures are briefly reviewed. Next, a new measure of plot-to-plot taxonomic similarity is presented that is based on the maximal common subgraph of two taxonomic trees. The proposed measure is computed from species presences and absences and include information about the degree of higher-level taxonomic similarity between species plots. The performance of the proposed measure with respect to existing coefficients of taxonomic similarity and the coefficient of Jaccard is discussed using a small data set of heath plant communities. Finally, a method to quantify ,-diversity from taxonomic dissimilarities is discussed. Results: The proposed measure of taxonomic ,-diversity incorporates not only species richness, but also information about the degree of higher-order taxonomic structure between species plots. In this view, it comes closer to a modern notion of biological diversity than more traditional measures of ,-di-versity. From regression analysis between the new coefficient and existing measures of taxonomic similarity it is shown that there is an evident nonlinearity between the coefficients. This nonlinearity demonstrates that the new coefficient measures similarity in a conceptually different way from previous indices. Also, in good agreement with the findings of previous authors, the regression between the new index and the Jaccard coefficient of similarity shows that more than 80% of the variance of the former is explained by the community structure at the species level, while only the residual variance is explained by differences in the higher-order taxonomic structure of the species plots. This means that a genuine taxonomic approach to the quantification of plot-to-plot similarity is only needed if we are interested in the residual system's variation that is related to the higher-order taxonomic structure of a pair of species plots. [source]

Application of the Levenshtein Distance Metric for the Construction of Longitudinal Data Files

EDUCATIONAL MEASUREMENT: ISSUES AND PRACTICE, Issue 2 2010
Harold C. Doran
The analysis of longitudinal data in education is becoming more prevalent given the nature of testing systems constructed for No Child Left Behind Act (NCLB). However, constructing the longitudinal data files remains a significant challenge. Students move into new schools, but in many cases the unique identifiers (ID) that should remain constant for each student change. As a result, different students frequently share the same ID, and merging records for an ID that is erroneously assigned to different students clearly becomes problematic. In small data sets, quality assurance of the merge can proceed through human reviews of the data to ensure all merged records are properly joined. However, in data sets with hundreds of thousands of cases, quality assurance via human review is impossible. While the record linkage literature has many applications in other disciplines, the educational measurement literature lacks details of formal protocols that can be used for quality assurance procedures for longitudinal data files. This article presents an empirical quality assurance procedure that may be used to verify the integrity of the merges performed for longitudinal analysis. We also discuss possible extensions that would permit merges to occur even when unique identifiers are not available. [source]

Estimating common trends in multivariate time series using dynamic factor analysis

ENVIRONMETRICS, Issue 7 2003
A. F. Zuur
Abstract This article discusses dynamic factor analysis, a technique for estimating common trends in multivariate time series. Unlike more common time series techniques such as spectral analysis and ARIMA models, dynamic factor analysis can analyse short, non-stationary time series containing missing values. Typically, the parameters in dynamic factor analysis are estimated by direct optimization, which means that only small data sets can be analysed if computing time is not to become prohibitively long and the chances of obtaining sub-optimal estimates are to be avoided. This article shows how the parameters of dynamic factor analysis can be estimated using the EM algorithm, allowing larger data sets to be analysed. The technique is illustrated on a marine environmental data set. Copyright © 2003 John Wiley & Sons, Ltd. [source]

Selective sampling for approximate clustering of very large data sets

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, Issue 3 2008
Liang Wang
A key challenge in pattern recognition is how to scale the computational efficiency of clustering algorithms on large data sets. The extension of non-Euclidean relational fuzzy c-means (NERF) clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF) clustering algorithm, which comprises four phases: (i) finding distinguished features that monitor progressive sampling; (ii) progressively sampling from a N × N relational matrix RN to obtain a n × n sample matrix Rn; (iii) clustering Rn with literal NERF; and (iv) extending the clusters in Rn to the remainder of the relational data. Previously published examples on several fairly small data sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases (i) and (ii), i.e., finding Rn, are not very practical because the sample size n often turns out to be roughly 50% of n, and this over-sampling defeats the whole purpose of eNERF. In this paper, we examine the performance of the sampling scheme of eNERF with respect to different parameters. We propose a modified sampling scheme for use with eNERF that combines simple random sampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT (scalable visual assessment of clustering tendency). We demonstrate that our modified sampling scheme can eliminate over-sampling of the original progressive sampling scheme, thus enabling the processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000 vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility and effectiveness of the proposed sampling method. We also find that actually running eNERF on a data set of this size is very costly in terms of computation time. Thus, our results demonstrate that further modification of eNERF, especially the extension stage, will be needed before it is truly practical for VL data. © 2008 Wiley Periodicals, Inc. [source]

atetra, a new software program to analyse tetraploid microsatellite data: comparison with tetra and tetrasat

MOLECULAR ECOLOGY RESOURCES, Issue 2 2010
K. VAN PUYVELDE
Abstract Despite the importance of tetraploid species, most population genetic studies deal with diploid ones because of difficulties in analysing codominant microsatellite data in tetraploid species. We developed a new software program,atetra,which combines both the rigorous method of enumeration for small data sets and Monte Carlo simulations for large ones. We discuss the added value of atetra by comparing its precision, stability and calculation time for different population sizes with those obtained from previous software programs tetrasat and tetra. The influence of the number of simulations on the calculation stability is also investigated. atetra and tetrasat proved to be more precise when compared with tetra, which, however, remains faster. atetra has the same precision than tetrasat, but is much faster, can handle an infinite number of partial heterozygotes and calculates more genetic variables. The more user-friendly interface of atetra reduces possible mistakes. [source]

Bayesian Optimal Design for Phase II Screening Trials

BIOMETRICS, Issue 3 2008
Meichun Ding
Summary Most phase II screening designs available in the literature consider one treatment at a time. Each study is considered in isolation. We propose a more systematic decision-making approach to the phase II screening process. The sequential design allows for more efficiency and greater learning about treatments. The approach incorporates a Bayesian hierarchical model that allows combining information across several related studies in a formal way and improves estimation in small data sets by borrowing strength from other treatments. The design incorporates a utility function that includes sampling costs and possible future payoff. Computer simulations show that this method has high probability of discarding treatments with low success rates and moving treatments with high success rates to phase III trial. [source]