Clustering Algorithm (clustering + algorithm)

Distribution by Scientific Domains

Selected Abstracts

Fast and automated functional classification with MED-SuMo: An application on purine-binding proteins

Olivia Doppelt-Azeroual
Abstract Ligand,protein interactions are essential for biological processes, and precise characterization of protein binding sites is crucial to understand protein functions. MED-SuMo is a powerful technology to localize similar local regions on protein surfaces. Its heuristic is based on a 3D representation of macromolecules using specific surface chemical features associating chemical characteristics with geometrical properties. MED-SMA is an automated and fast method to classify binding sites. It is based on MED-SuMo technology, which builds a similarity graph, and it uses the Markov Clustering algorithm. Purine binding sites are well studied as drug targets. Here, purine binding sites of the Protein DataBank (PDB) are classified. Proteins potentially inhibited or activated through the same mechanism are gathered. Results are analyzed according to PROSITE annotations and to carefully refined functional annotations extracted from the PDB. As expected, binding sites associated with related mechanisms are gathered, for example, the Small GTPases. Nevertheless, protein kinases from different Kinome families are also found together, for example, Aurora-A and CDK2 proteins which are inhibited by the same drugs. Representative examples of different clusters are presented. The effectiveness of the MED-SMA approach is demonstrated as it gathers binding sites of proteins with similar structure-activity relationships. Moreover, an efficient new protocol associates structures absent of cocrystallized ligands to the purine clusters enabling those structures to be associated with a specific binding mechanism. Applications of this classification by binding mode similarity include target-based drug design and prediction of cross-reactivity and therefore potential toxic side effects. [source]


Silke Jšnichen
Case-based object recognition requires a general case of the object that should be detected. Real-world applications such as the recognition of biological objects in images cannot be solved by one general case. A case base is necessary to handle the great natural variations in the appearance of these objects. In this paper, we will present how to learn a hierarchical case base of general cases. We present our conceptual clustering algorithm to learn groups of similar cases from a set of acquired structural cases of fungal spores. Due to its concept description, it explicitly supplies for each cluster a generalized case and a measure for the degree of its generalization. The resulting hierarchical case base is used for applications in the field of case-based object recognition. We present results based on our application for health monitoring of biologically hazardous material. [source]

Using computer vision to simulate the motion of virtual agents

Soraia R. Musse
Abstract In this paper, we propose a new model to simulate the movement of virtual humans based on trajectories captured automatically from filmed video sequences. These trajectories are grouped into similar classes using an unsupervised clustering algorithm, and an extrapolated velocity field is generated for each class. A physically-based simulator is then used to animate virtual humans, aiming to reproduce the trajectories fed to the algorithm and at the same time avoiding collisions with other agents. The proposed approach provides an automatic way to reproduce the motion of real people in a virtual environment, allowing the user to change the number of simulated agents while keeping the same goals observed in the filmed video. Copyright © 2007 John Wiley & Sons, Ltd. [source]

Algorithm for Spatial Clustering of Pavement Segments

Chientai Yang
This article formulates a new spatial search model for determining appropriate pavement preservation project termini. A spatial clustering algorithm using fuzzy c-mean clustering is developed to minimize the rating variation in each cluster (project) of pavement segments while considering minimal project scope (i.e., length) and cost, initial setup cost, and barriers, such as bridges. A case study using the actual roadway and pavement condition data in fiscal year 2005 on Georgia State Route 10 shows that the proposed algorithm can identify more appropriate segment clustering scheme, than the historical project termini. The benefits of using the developed algorithm are summarized, and recommendations for future research are discussed. [source]

Sequence-related amplified polymorphism, an effective molecular approach for studying genetic variation in Fasciola spp. of human and animal health significance

Qiao-Yan Li
Abstract In the present study, a recently described molecular approach, namely sequence-related amplified polymorphism (SRAP), which preferentially amplifies ORFs, was evaluated for the studies of genetic variation among Fasciola hepatica, Fasciola gigantica and the "intermediate" Fasciola from different host species and geographical locations in mainland China. Five SRAP primer combinations were used to amplify 120 Fasciola samples after ten SRAP primer combinations were evaluated. The number of fragments amplified from Fasciola samples using each primer combination ranged from 12 to 20, with an average of 15 polymorphic bands per primer pair. Fifty-nine main polymorphic bands were observed, ranging in size from 100 to 2000,bp, and SRAP bands specific to F. hepatica or F. gigantica were observed. SRAP fragments common to F. hepatica and the "intermediate" Fasciola, or common to F. gigantica and the "intermediate" Fasciola were identified, excised and confirmed by PCR amplification of genomic DNA using primers designed based on sequences of these SRAP fragments. Based on SRAP profiles, unweighted pair-group method with arithmetic averages clustering algorithm categorized all of the examined representative Fasciola samples into three groups, representing the F. hepatica, the "intermediate" Fasciola, or the F. gigantica. These results demonstrated the usefulness of the SRAP technique for revealing genetic variability between F. hepatica, F. gigantica and the "intermediate" Fasciola, and also provided genomic evidence for the existence of the "intermediate" Fasciola between F. hepatica and F. gigantica. This technique provides an alternative and a useful tool for the genetic characterization and studies of genetic variability in parasites. [source]

Augmentation of a nearest neighbour clustering algorithm with a partial supervision strategy for biomedical data classification

EXPERT SYSTEMS, Issue 1 2009
Sameh A. Salem
Abstract: In this paper, a partial supervision strategy for a recently developed clustering algorithm, the nearest neighbour clustering algorithm (NNCA), is proposed. The proposed method (NNCA-PS) offers classification capability with a smaller amount of a priori knowledge, where a small number of data objects from the entire data set are used as labelled objects to guide the clustering process towards a better search space. Experimental results show that NNCA-PS gives promising results of 89% sensitivity at 95% specificity when used to segment retinal blood vessels, and a maximum classification accuracy of 99.5% with 97.2% average accuracy when applied to a breast cancer data set. Comparisons with other methods indicate the robustness of the proposed method in classification. Additionally, experiments on parallel environments indicate the suitability and scalability of NNCA-PS in handling larger data sets. [source]

Genome-wide association studies using haplotype clustering with a new haplotype similarity

Lina Jin
Abstract Association analysis, with the aim of investigating genetic variations, is designed to detect genetic associations with observable traits, which has played an increasing part in understanding the genetic basis of diseases. Among these methods, haplotype-based association studies are believed to possess prominent advantages, especially for the rare diseases in case-control studies. However, when modeling these haplotypes, they are subjected to statistical problems caused by rare haplotypes. Fortunately, haplotype clustering offers an appealing solution. In this research, we have developed a new befitting haplotype similarity for "affinity propagation" clustering algorithm, which can account for the rare haplotypes primely, so as to control for the issue on degrees of freedom. The new similarity can incorporate haplotype structure information, which is believed to enhance the power and provide high resolution for identifying associations between genetic variants and disease. Our simulation studies show that the proposed approach offers merits in detecting disease-marker associations in comparison with the cladistic haplotype clustering method CLADHC. We also illustrate an application of our method to cystic fibrosis, which shows quite accurate estimates during fine mapping. Genet. Epidemiol. 34: 633,641, 2010. © 2010 Wiley-Liss, Inc. [source]

A novel clustering algorithm using hypergraph-based granular computing

Qun Liu
Clustering is an important technique in data mining. In this paper, we introduce a new clustering algorithm. This algorithm, based on granular computing, constructs a hypergraph (simplicial complex) by the hypergraph bisection algorithm. It will discover the similarities and associations among documents. In some experiments on Web data, the proposed algorithm is used; the results are quite satisfactory. © 2009 Wiley Periodicals, Inc. [source]

Entropy-based metrics in swarm clustering

Bo Liu
Ant-based clustering methods have received significant attention as robust methods for clustering. Most ant-based algorithms use local density as a metric for determining the ants' propensities to pick up or deposit a data item; however, a number of authors in classical clustering methods have pointed out the advantages of entropy-based metrics for clustering. We introduced an entropy metric into an ant-based clustering algorithm and compared it with other closely related algorithms using local density. The results strongly support the value of entropy metrics, obtaining faster and more accurate results. Entropy governs the pickup and drop behaviors, while movement is guided by the density gradient. Entropy measures also require fewer training parameters than density-based clustering. The remaining parameters are subjected to robustness studies, and a detailed analysis is performed. In the second phase of the study, we further investigated Ramos and Abraham's (In: Proc 2003 IEEE Congr Evol Comput, Hoboken, NJ: IEEE Press; 2003. pp 1370,1375) contention that ant-based methods are particularly suited to incremental clustering. Contrary to expectations, we did not find substantial differences between the efficiencies of incremental and nonincremental approaches to data clustering. © 2009 Wiley Periodicals, Inc. [source]

Selective sampling for approximate clustering of very large data sets

Liang Wang
A key challenge in pattern recognition is how to scale the computational efficiency of clustering algorithms on large data sets. The extension of non-Euclidean relational fuzzy c-means (NERF) clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF) clustering algorithm, which comprises four phases: (i) finding distinguished features that monitor progressive sampling; (ii) progressively sampling from a N ◊ N relational matrix RN to obtain a n ◊ n sample matrix Rn; (iii) clustering Rn with literal NERF; and (iv) extending the clusters in Rn to the remainder of the relational data. Previously published examples on several fairly small data sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases (i) and (ii), i.e., finding Rn, are not very practical because the sample size n often turns out to be roughly 50% of n, and this over-sampling defeats the whole purpose of eNERF. In this paper, we examine the performance of the sampling scheme of eNERF with respect to different parameters. We propose a modified sampling scheme for use with eNERF that combines simple random sampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT (scalable visual assessment of clustering tendency). We demonstrate that our modified sampling scheme can eliminate over-sampling of the original progressive sampling scheme, thus enabling the processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000 vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility and effectiveness of the proposed sampling method. We also find that actually running eNERF on a data set of this size is very costly in terms of computation time. Thus, our results demonstrate that further modification of eNERF, especially the extension stage, will be needed before it is truly practical for VL data. © 2008 Wiley Periodicals, Inc. [source]

Phenotypic study by numerical taxonomy of strains belonging to the genus Aeromonas

L. Valera
Aims: ,This study was undertaken to cluster and identify a large collection of Aeromonas strains. Methods and Results: ,Numerical taxonomy was used to analyse phenotypic data obtained on 54 new isolates taken from water, fish, snails, sputum and 99 type and reference strains. Each strain was tested for 121 characters but only the data for 71 were analysed using the `SSM' and `SJ' coefficients, and the UPGMA clustering algorithm. At SJ values of , 81∑6% the strains clustered into 22 phenons which were identified as Aer. jandaei, Aer. hydrophila, Aer. encheleia, Aer. veronii biogroup veronii, Aer. trota, Aer. caviae, Aer. eucrenophila, Aer. ichthiosmia, Aer. sobria, Aer. allosaccharophila, Aer. media, Aer. schubertii and Aer. salmonicida. The species Aer. veronii biogroup sobria was represented by several clusters which formed two phenotypic cores, the first related to reference strain CECT 4246 and the second related to CECT 4835. A good correlation was generally observed among this phenotypic clustering and previous genomic and phylogenetic data. In addition, three new phenotypic groups were found, which may represent new Aeromonas species. Conclusions: ,The phenetic approach was found to be a necessary tool to delimitate and identify the Aeromonas species. Significance and Impact of the Study: ,Valuable traits for identifying Aeromonas as well as the possible existence of new Aeromonas species or biotypes are indicated. [source]

Multi-component analysis: blind extraction of pure components mass spectra using sparse component analysis

Ivica Kopriva
Abstract The paper presents sparse component analysis (SCA)-based blind decomposition of the mixtures of mass spectra into pure components, wherein the number of mixtures is less than number of pure components. Standard solutions of the related blind source separation (BSS) problem that are published in the open literature require the number of mixtures to be greater than or equal to the unknown number of pure components. Specifically, we have demonstrated experimentally the capability of the SCA to blindly extract five pure components mass spectra from two mixtures only. Two approaches to SCA are tested: the first one based on ,1 norm minimization implemented through linear programming and the second one implemented through multilayer hierarchical alternating least square nonnegative matrix factorization with sparseness constraints imposed on pure components spectra. In contrast to many existing blind decomposition methods no a priori information about the number of pure components is required. It is estimated from the mixtures using robust data clustering algorithm together with pure components concentration matrix. Proposed methodology can be implemented as a part of software packages used for the analysis of mass spectra and identification of chemical compounds. Copyright © 2009 John Wiley & Sons, Ltd. [source]

Experimental and statistical analysis methods for peptide detection using surface-enhanced Raman spectroscopy

Breeana L. Mitchell
Abstract Surface-enhanced Raman spectroscopy (SERS) has the potential to make a significant impact in biology research due to its ability to provide information orthogonal to that obtained by traditional techniques such as mass spectrometry (MS). While SERS has been well studied for its use in chemical applications, detailed investigations with biological molecules are less common. In addition, a clear understanding of how methodology and molecular characteristics impact the intensity, the number of peaks, and the signal-to-noise of SERS spectra is largely missing. By varying the concentration and order of addition of the SERS-enhancer salt (LiCl) with colloidal silver, we were able to evaluate the impact of these variables on peptide spectra using a quantitative measure of spectra quality based on the number of peaks and peak intensity. The LiCl concentration and order of addition that produced the best SERS spectra were applied to a panel of synthetic peptides with a range of charges and isoelectric points (pIs) where the pI was directly correlated with higher spectral quality. Those peptides with moderate to high pIs and spectra quality scores were differentiated from each other using the improved method and a hierarchical clustering algorithm. In addition, the same method and algorithm was applied to a set of highly similar phosphorylated peptides, and it was possible to successfully classify the majority of peptides on the basis of species-specific peak differences. Copyright © 2008 John Wiley & Sons, Ltd. [source]

Identifying similar pages in Web applications using a competitive clustering algorithm

Andrea De Lucia
Abstract We present an approach based on Winner Takes All (WTA), a competitive clustering algorithm, to support the comprehension of static and dynamic Web applications during Web application reengineering. This approach adopts a process that first computes the distance between Web pages and then identifies and groups similar pages using the considered clustering algorithm. We present an instance of application of the clustering process to identify similar pages at the structural level. The page structure is encoded into a string of HTML tags and then the distance between Web pages at the structural level is computed using the Levenshtein string edit distance algorithm. A prototype to automate the clustering process has been implemented that can be extended to other instances of the process, such as the identification of groups of similar pages at content level. The approach and the tool have been evaluated in two case studies. The results have shown that the WTA clustering algorithm suggests heuristics to easily identify the best partition of Web pages into clusters among the possible partitions. Copyright © 2007 John Wiley & Sons, Ltd. [source]

High-speed rough clustering for very large document collections

Kazuaki Kishida
Abstract Document clustering is an important tool, but it is not yet widely used in practice probably because of its high computational complexity. This article explores techniques of high-speed rough clustering of documents, assuming that it is sometimes necessary to obtain a clustering result in a shorter time, although the result is just an approximate outline of document clusters. A promising approach for such clustering is to reduce the number of documents to be checked for generating cluster vectors in the leader,follower clustering algorithm. Based on this idea, the present article proposes a modified Crouch algorithm and incomplete single-pass leader,follower algorithm. Also, a two-stage grouping technique, in which the first stage attempts to decrease the number of documents to be processed in the second stage by applying a quick merging technique, is developed. An experiment using a part of the Reuters corpus RCV1 showed empirically that both the modified Crouch and the incomplete single-pass leader,follower algorithms achieve clustering results more efficiently than the original methods, and also improved the effectiveness of clustering results. On the other hand, the two-stage grouping technique did not reduce the processing time in this experiment. [source]

Clustering work and family trajectories by using a divisive algorithm

Raffaella Piccarreta
Summary., We present an approach to the construction of clusters of life course trajectories and use it to obtain ideal types of trajectories that can be interpreted and analysed meaningfully. We represent life courses as sequences on a monthly timescale and apply optimal matching analysis to compute dissimilarities between individuals. We introduce a new divisive clustering algorithm which has features that are in common with both Ward's agglomerative algorithm and classification and regression trees. We analyse British Household Panel Survey data on the employment and family trajectories of women. Our method produces clusters of sequences for which it is straightforward to determine who belongs to each cluster, making it easier to interpret the relative importance of life course factors in distinguishing subgroups of the population. Moreover our method gives guidance on selecting the number of clusters. [source]

Structuring Chemical Space: Similarity-Based Characterization of the PubChem Database

Giovanni Cincilla
Abstract The ensemble of conceivable molecules is referred to as the Chemical Space. In this article we describe a hierarchical version of the Affinity Propagation (AP) clustering algorithm and apply it to analyze the LINGO-based similarity matrix of a 500 000-molecule subset of the PubChem database, which contains more than 19,million compounds. The combination of two highly efficient methods, namely the AP clustering algorithm and LINGO-based molecular similarity calculations, allows the unbiased analysis of large databases. Hierarchical clustering generates a numerical diagonalization of the similarity matrix. The target-independent, intrinsic structure of the database , derived without any previous information on the physical or biological properties of the compounds, maps together molecules experimentally shown to bind the same biological target or to have similar physical properties. [source]

Combining idiographic and nomothetic methods in the study of internal working models

Attachment theory's notion of internal working model refers to an affective,cognitive structure that guides how individuals experience, and act within, their close relationships. Understanding working models in general (i.e., nomothetically) can be greatly enhanced by attending to the unique (i.e., idiographic) properties of individuals'data. A general method is described for eliciting and empirically representing both the common and unique properties of individuals'descriptions of self and others. This approach is illustrated by two studies in which participants described self and others in a variety of significant roles and relationships by choosing from a list of attachment-related descriptive terms. A hierarchical clustering algorithm, HICLAS (DeBoeck & Rosenberg, 1988), is used to generate a unique graphical representation for each individual's responses. We illustrate the use of HICLAS to (a) assess nomothetic properties of the structures and relate those properties to other variables such as attachment style, and (b) link aspects of any individual's structure with other idiographic data such as interview narratives. Data from HICLAS enhances the interpretation of other, more qualitative idiographic information, and helps to produce new constructs, variables, and propositions amenable to rigorous hypothesis tests in future research. [source]

Toward responsive visualization services for scatter/gather browsing

Weimao Ke
As a type of relevance feedback, Scatter/Gather demonstrates an interactive approach to relevance mapping and reinforcement. The Scatter/Gather model, proposed by Cutting, Karger, Pedersen, and Tukey (1992), is well known for its effectiveness in situations where it is difficult to precisely specify a query. However, online clustering on a large data corpus is computationally complex and extremely time consuming. This has prohibited the method's real world application for responsive services. In this paper, we proposed and evaluated a new clustering algorithm called LAIR2, which has linear worst-case time complexity and constant running time average for Scatter/Gather browsing. Our experiment showed when running on a single processor, the LAIR2 online clustering algorithm is several hundred times faster than a classic parallel algorithm running on multiple processors. The efficiency of the LAIR2 algorithm promises real-time Scatter/Gather browsing services. We have implemented an online visualization prototype, namely, LAIR2 Scatter/Gather browser, to demonstrate its utility and usability. [source]

In-process Control of Design Inspection Effectiveness

Tzvi Raz
Abstract We present a methodology for the in-process control of design inspection focusing on escaped defects. The methodology estimates the defect escape probability at each phase in the process using the information available at the beginning of a particular phase. The development of the models is illustrated by a case involving data collected from the design inspections of software components. The data include the size of the product component, as well as the time invested in preparing for the inspection and actually carrying it out. After smoothing the original data with a clustering algorithm, to compensate for its excessive variability, a series of regression models exhibiting increasingly better fits to the data as more information becomes available was obtained. We discuss how management can use such models to reduce escape risk as the inspection process evolves. Copyright © 2003 John Wiley & Sons, Ltd. [source]

The PTPN22 C1858T polymorphism is associated with skewing of cytokine profiles toward high interferon-, activity and low tumor necrosis factor , levels in patients with lupus

Silvia N. Kariuki
Objective The C1858T polymorphism in PTPN22 has been associated with the risk of systemic lupus erythematosus (SLE) as well as multiple other autoimmune diseases. We have previously shown that high serum interferon-, (IFN,) activity is a heritable risk factor for SLE. The aim of this study was to determine whether the PTPN22 risk variant may shift serum cytokine profiles to higher IFN, activity, resulting in risk of disease. Methods IFN, was measured in 143 patients with SLE, using a functional reporter cell assay, and tumor necrosis factor , (TNF,) was measured by enzyme-linked immunosorbent assay. The rs2476601 single-nucleotide polymorphism in PTPN22 (C1858T) was genotyped in the same patients. Patients were grouped, using a clustering algorithm, into 4 cytokine groups (IFN, predominant, IFN, and TNF, correlated, TNF, predominant, and both IFN, and TNF, low). Results SLE patients carrying the risk allele of PTPN22 had higher serum IFN, activity than patients lacking the risk allele (P = 0.027). TNF, levels were lower in carriers of the risk allele (P = 0.030), and the risk allele was more common in patients in the IFN,-predominant and IFN, and TNF,-correlated groups as compared with patients in the TNF,-predominant and both IFN, and TNF,-low groups (P = 0.001). Twenty-five percent of male patients carried the risk allele, compared with 10% of female patients (P = 0.024); however, cytokine skewing was similar in both sexes. Conclusion The autoimmune disease risk allele of PTPN22 is associated with skewing of serum cytokine profiles toward higher IFN, activity and lower TNF, levels in vivo in patients with SLE. This serum cytokine pattern may be relevant in other autoimmune diseases associated with the PTPN22 risk allele. [source]