Home About us Contact | |||
Implementations
Kinds of Implementations Selected AbstractsParallel computation of a highly nonlinear Boussinesq equation model through domain decompositionINTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN FLUIDS, Issue 1 2005Khairil Irfan Sitanggang Abstract Implementations of the Boussinesq wave model to calculate free surface wave evolution in large basins are, in general, computationally very expensive, requiring huge amounts of CPU time and memory. For large scale problems, it is either not affordable or practical to run on a single PC. To facilitate such extensive computations, a parallel Boussinesq wave model is developed using the domain decomposition technique in conjunction with the message passing interface (MPI). The published and well-tested numerical scheme used by the serial model, a high-order finite difference method, is identical to that employed in the parallel model. Parallelization of the tridiagonal matrix systems included in the serial scheme is the most challenging aspect of the work, and is accomplished using a parallel matrix solver combined with an efficient data transfer scheme. Numerical tests on a distributed-memory super-computer show that the performance of the current parallel model in simulating wave evolution is very satisfactory. A linear speedup is gained as the number of processors increases. These tests showed that the CPU time efficiency of the model is about 75,90%. Copyright © 2005 John Wiley & Sons, Ltd. [source] Language Awareness: A History and ImplementationsJOURNAL OF LINGUISTIC ANTHROPOLOGY, Issue 2 2006Alicia Pousada [source] Power and sample size for nested analysis of molecular varianceMOLECULAR ECOLOGY, Issue 19 2009BENJAMIN M. FITZPATRICK Abstract Analysis of molecular variance (amova) is a widely used tool for quantifying the contribution of various levels of population structure to patterns of genetic variation. Implementations of amova use permutation tests to evaluate null hypotheses of no population structure within groups and between groups. With few populations per group, between-group structure might be impossible to detect because only a few permutations of the sampled populations are possible. In fact, with fewer than six total populations, permutation tests will never result in P -values <0.05 for higher-level population structure. I present minimum numbers of replicates calculated from multinomial coefficients and an r script that can be used to evaluate the minimum P -value for any sampling scheme. While it might seem counterintuitive that a large sample of individuals is uninformative about hierarchical structure, the power to detect between-group differences depends on the number of populations per group and investigators should sample appropriately. [source] Flexible and Robust Implementations of Multivariate Adaptive Regression Splines Within a Wastewater Treatment Stochastic Dynamic ProgramQUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, Issue 7 2005Julia C. C. Tsai Abstract This paper presents an automatic and more robust implementation of multivariate adaptive regression splines (MARS) within the orthogonal array (OA)/MARS continuous-state stochastic dynamic programming (SDP) method. MARS is used to estimate the future value functions in each SDP level. The default stopping rule of MARS employs the maximum number of basis functions Mmax, specified by the user. To reduce the computational effort and improve the MARS fit for the wastewater treatment SDP model, two automatic stopping rules, which automatically determine an appropriate value for Mmax, and a robust version of MARS that prefers lower-order terms over higher-order terms are developed. Computational results demonstrate the success of these approaches. Copyright © 2005 John Wiley & Sons, Ltd. [source] Computer-based management environment for an assembly language programming laboratoryCOMPUTER APPLICATIONS IN ENGINEERING EDUCATION, Issue 1 2007Santiago Rodríguez Abstract This article describes the environment used in the Computer Architecture Department of the Technical University of Madrid (UPM) for managing small laboratory work projects and a specific application for an Assembly Language Programming Laboratory. The approach is based on a chain of tools that a small team of teachers can use to efficiently manage a course with a large number of students (400 per year). Students use this tool chain to complete their assignments using an MC88110 CPU simulator also developed by the Department. Students use a Delivery Agent tool to send files containing their implementations. These files are stored in one of the Department servers. Every student laboratory assignment is tested by an Automatic Project Evaluator that executes a set of previously designed and configured tests. These tools are used by teachers to manage mass courses thereby avoiding restrictions on students working on the same assignment. This procedure may encourage students to copy others' laboratory work and we have therefore developed a complementary tool to help teachers find "replicated" laboratory assignment implementations. This tool is a plagiarism detection assistant that completes the tool-chain functionality. Jointly, these tools have demonstrated over the last decade that important benefits can be gained from the exploitation of a global laboratory work management system. Some of the benefits may be transferable to an area of growing importance that we have not directly explored, i.e. distance learning environments for technical subjects. © 2007 Wiley Periodicals, Inc. Comput Appl Eng Educ 15: 41,54, 2007; Published online in Wiley InterScience (www.interscience.wiley.com); DOI 10.1002/cae.20094 [source] Fast BVH Construction on GPUsCOMPUTER GRAPHICS FORUM, Issue 2 2009C. Lauterbach We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses the surface area heuristic (SAH) to build hierarchies optimized for fast ray tracing. Both algorithms are combined into a hybrid algorithm that removes existing bottlenecks in the algorithm for GPU construction performance and scalability leading to significantly decreased build time. The resulting hierarchies are close in to optimized SAH hierarchies, but the construction process is substantially faster, leading to a significant net benefit when both construction and traversal cost are accounted for. Our preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems. In practice, we can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications. [source] SIMD Optimization of Linear Expressions for Programmable Graphics HardwareCOMPUTER GRAPHICS FORUM, Issue 4 2004Chandrajit Bajaj Abstract The increased programmability of graphics hardware allows efficient graphical processing unit (GPU) implementations of a wide range of general computations on commodity PCs. An important factor in such implementations is how to fully exploit the SIMD computing capacities offered by modern graphics processors. Linear expressions in the form of, where A is a matrix, and and are vectors, constitute one of the most basic operations in many scientific computations. In this paper, we propose a SIMD code optimization technique that enables efficient shader codes to be generated for evaluating linear expressions. It is shown that performance can be improved considerably by efficiently packing arithmetic operations into four-wide SIMD instructions through reordering of the operations in linear expressions. We demonstrate that the presented technique can be used effectively for programming both vertex and pixel shaders for a variety of mathematical applications, including integrating differential equations and solving a sparse linear system of equations using iterative methods. [source] Dye Advection Without the Blur: A Level-Set Approach for Texture-Based Visualization of Unsteady FlowCOMPUTER GRAPHICS FORUM, Issue 3 2004D. Weiskopf Dye advection is an intuitive and versatile technique to visualize both steady and unsteady flow. Dye can be easily combined with noise-based dense vector field representations and is an important element in user-centric visual exploration processes. However, fast texture-based implementations of dye advection rely on linear interpolation operations that lead to severe diffusion artifacts. In this paper, a novel approach for dye advection is proposed to avoid this blurring and to achieve long and clearly defined streaklines or extended streak-like patterns. The interface between dye and background is modeled as a level-set within a signed distance field. The level-set evolution is governed by the underlying flow field and is computed by a semi-Lagrangian method. A reinitialization technique is used to counteract the distortions introduced by the level-set evolution and to maintain a level-set function that represents a local distance field. This approach works for 2D and 3D flow fields alike. It is demonstrated how the texture-based level-set representation lends itself to an efficient GPU implementation and therefore facilitates interactive visualization. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Picture/Image Generation I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism [source] Phase-rotation in in-vivo localized spectroscopyCONCEPTS IN MAGNETIC RESONANCE, Issue 3 2007Saadallah Ramadan Abstract Phase-rotation is an alternative method to phase-cycling in acquisition of magnetic resonance spectroscopic data. However, there has only been two papers describing its implementation in point resolved spectroscopy (PRESS) and stimulated echo acquisition mode (STEAM) to date. This article aims to introduce and explain the principles of phase-rotation, describe the implementations that were carried out so far in the current literature, compare phase-rotation and phase-cycling experimentally, and introduce the application of phase-rotation in double-quantum filtered (DQF) spectroscopy. © 2007 Wiley Periodicals, Inc. Concepts Magn Reson Part A 30A: 147,153, 2007 [source] Geometric algebra and transition-selective implementations of the controlled-NOT gateCONCEPTS IN MAGNETIC RESONANCE, Issue 1 2004Timothy F. Havel Geometric algebra provides a complete set of simple rules for the manipulation of product operator expressions at a symbolic level, without any explicit use of matrices. This approach can be used not only to describe the state and evolution of a spin system, but also to derive the effective Hamiltonian and associated propagator in full generality. In this article, we illustrate the use of geometric algebra via a detailed analysis of transition-selective implementations of the controlled-NOT gate, which plays a key role in NMR-based quantum information processing. In the appendices, we show how one can also use geometric algebra to derive tight bounds on the magnitudes of the errors associated with these implementations of the controlled-NOT. © 2004 Wiley Periodicals, Inc. Concepts Magn Reson Part A 23A: 49,62, 2004 [source] Accurate garbage collection in uncooperative environments revisited,CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 12 2009J. Baker Abstract Implementing a concurrent programming language such as Java by means of a translator to an existing language is attractive as it provides portability over all platforms supported by the host language and reduces development time,as many low-level tasks can be delegated to the host compiler. The C and C++ programming languages are popular choices for many language implementations due to the availability of efficient compilers on a wide range of platforms. For garbage-collected languages, however, they are not a perfect match as no support is provided for accurately discovering pointers to heap-allocated data on thread stacks. We evaluate several previously published techniques and propose a new mechanism, lazy pointer stacks, for performing accurate garbage collection in such uncooperative environments. We implemented the new technique in the Ovm Java virtual machine with our own Java-to-C/C++ compiler using GCC as a back-end compiler. Our extensive experimental results confirm that lazy pointer stacks outperform existing approaches: we provide a speedup of 4.5% over Henderson's accurate collector with a 17% increase in code size. Accurate collection is essential in the context of real-time systems, we thus validate our approach with the implementation of a real-time concurrent garbage collection algorithm. Copyright © 2009 John Wiley & Sons, Ltd. [source] Factors affecting the performance of parallel mining of minimal unique itemsets on diverse architecturesCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 9 2009D. J. Haglin Abstract Three parallel implementations of a divide-and-conquer search algorithm (called SUDA2) for finding minimal unique itemsets (MUIs) are compared in this paper. The identification of MUIs is used by national statistics agencies for statistical disclosure assessment. The first parallel implementation adapts SUDA2 to a symmetric multi-processor cluster using the message passing interface (MPI), which we call an MPI cluster; the second optimizes the code for the Cray MTA2 (a shared-memory, multi-threaded architecture) and the third uses a heterogeneous ,group' of workstations connected by LAN. Each implementation considers the parallel structure of SUDA2, and how the subsearch computation times and sequence of subsearches affect load balancing. All three approaches scale with the number of processors, enabling SUDA2 to handle larger problems than before. For example, the MPI implementation is able to achieve nearly two orders of magnitude improvement with 132 processors. Performance results are given for a number of data sets. Copyright © 2009 John Wiley & Sons, Ltd. [source] Design and implementation of a high-performance CCA event service,CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 9 2009Ian Gorton Abstract Event services based on publish,subscribe architectures are well-established components of distributed computing applications. Recently, an event service has been proposed as part of the common component architecture (CCA) for high-performance computing (HPC) applications. In this paper we describe our implementation, experimental evaluation, and initial experience with a high-performance CCA event service that exploits efficient communications mechanisms commonly used on HPC platforms. We describe the CCA event service model and briefly discuss the possible implementation strategies of the model. We then present the design and implementation of the event service using the aggregate remote memory copy interface as an underlying communication layer for this mechanism. Two alternative implementations are presented and evaluated on a Cray XD-1 platform. The performance results demonstrate that event delivery latencies are low and that the event service is able to achieve high-throughput levels. Finally, we describe the use of the event service in an application for high-speed processing of data from a mass spectrometer and conclude by discussing some possible extensions to the event service for other HPC applications. Published in 2009 by John Wiley & Sons, Ltd. [source] On the effectiveness of runtime techniques to reduce memory sharing overheads in distributed Java implementationsCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 13 2008Marcelo Lobosco Abstract Distributed Java virtual machine (dJVM) systems enable concurrent Java applications to transparently run on clusters of commodity computers. This is achieved by supporting Java's shared-memory model over multiple JVMs distributed across the cluster's computer nodes. In this work, we describe and evaluate selective dynamic diffing and lazy home allocation, two new runtime techniques that enable dJVMs to efficiently support memory sharing across the cluster. Specifically, the two proposed techniques can contribute to reduce the overheads due to message traffic, extra memory space, and high latency of remote memory accesses that such dJVM systems require for implementing their memory-coherence protocol either in isolation or in combination. In order to evaluate the performance-related benefits of dynamic diffing and lazy home allocation, we implemented both techniques in Cooperative JVM (CoJVM), a basic dJVM system we developed in previous work. In subsequent work, we carried out performance comparisons between the basic CoJVM and modified CoJVM versions for five representative concurrent Java applications (matrix multiply, LU, Radix, fast Fourier transform, and SOR) using our proposed techniques. Our experimental results showed that dynamic diffing and lazy home allocation significantly reduced memory sharing overheads. The reduction resulted in considerable gains in CoJVM system's performance, ranging from 9% up to 20%, in four out of the five applications, with resulting speedups varying from 6.5 up to 8.1 for an 8-node cluster of computers. Copyright © 2007 John Wiley & Sons, Ltd. [source] Parallel processing of remotely sensed hyperspectral imagery: full-pixel versus mixed-pixel classificationCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 13 2008Antonio J. Plaza Abstract The rapid development of space and computer technologies allows for the possibility to store huge amounts of remotely sensed image data, collected using airborne and satellite instruments. In particular, NASA is continuously gathering high-dimensional image data with Earth observing hyperspectral sensors such as the Jet Propulsion Laboratory's airborne visible,infrared imaging spectrometer (AVIRIS), which measures reflected radiation in hundreds of narrow spectral bands at different wavelength channels for the same area on the surface of the Earth. The development of fast techniques for transforming massive amounts of hyperspectral data into scientific understanding is critical for space-based Earth science and planetary exploration. Despite the growing interest in hyperspectral imaging research, only a few efforts have been devoted to the design of parallel implementations in the literature, and detailed comparisons of standardized parallel hyperspectral algorithms are currently unavailable. This paper compares several existing and new parallel processing techniques for pure and mixed-pixel classification in hyperspectral imagery. The distinction of pure versus mixed-pixel analysis is linked to the considered application domain, and results from the very rich spectral information available from hyperspectral instruments. In some cases, such information allows image analysts to overcome the constraints imposed by limited spatial resolution. In most cases, however, the spectral bands collected by hyperspectral instruments have high statistical correlation, and efficient parallel techniques are required to reduce the dimensionality of the data while retaining the spectral information that allows for the separation of the classes. In order to address this issue, this paper also develops a new parallel feature extraction algorithm that integrates the spatial and spectral information. The proposed technique is evaluated (from the viewpoint of both classification accuracy and parallel performance) and compared with other parallel techniques for dimensionality reduction and classification in the context of three representative application case studies: urban characterization, land-cover classification in agriculture, and mapping of geological features, using AVIRIS data sets with detailed ground-truth. Parallel performance is assessed using Thunderhead, a massively parallel Beowulf cluster at NASA's Goddard Space Flight Center. The detailed cross-validation of parallel algorithms conducted in this work may specifically help image analysts in selection of parallel algorithms for specific applications. Copyright © 2008 John Wiley & Sons, Ltd. [source] Parallel tiled QR factorization for multicore architecturesCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 13 2008Alfredo Buttari Abstract As multicore systems continue to gain ground in the high-performance computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine-grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data (referred to as ,tiles'). These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out-of-order execution of the tasks that will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can be exploited only at the level of the BLAS operations and with vendor implementations. Copyright © 2008 John Wiley & Sons, Ltd. [source] Seine: a dynamic geometry-based shared-space interaction framework for parallel scientific applicationsCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 15 2006L. Zhang Abstract While large-scale parallel/distributed simulations are rapidly becoming critical research modalities in academia and industry, their efficient and scalable implementations continue to present many challenges. A key challenge is that the dynamic and complex communication/coordination required by these applications (dependent on the state of the phenomenon being modeled) are determined by the specific numerical formulation, the domain decomposition and/or sub-domain refinement algorithms used, etc. and are known only at runtime. This paper presents Seine, a dynamic geometry-based shared-space interaction framework for scientific applications. The framework provides the flexibility of shared-space-based models and supports extremely dynamic communication/coordination patterns, while still enabling scalable implementations. The design and prototype implementation of Seine are presented. Seine complements and can be used in conjunction with existing parallel programming systems such as MPI and OpenMP. An experimental evaluation using an adaptive multi-block oil-reservoir simulation is used to demonstrate the performance and scalability of applications using Seine. Copyright © 2006 John Wiley & Sons, Ltd. [source] Task Pool Teams: a hybrid programming environment for irregular algorithms on SMP clustersCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 12 2006Judith Hippold Abstract Clusters of symmetric multiprocessors (SMPs) are popular platforms for parallel programming since they provide large computational power for a reasonable price. For irregular application programs with dynamically changing computation and data access behavior, a flexible programming model is needed to achieve efficiency. In this paper we propose Task Pool Teams as a hybrid parallel programming environment to realize irregular algorithms on clusters of SMPs. Task Pool Teams combine task pools on single cluster nodes by an explicit message passing layer. They offer load balance together with multi-threaded, asynchronous communication. Appropriate communication protocols and task pool implementations are provided and accessible by an easy-to-use application programmer interface. As application examples we present a branch and bound algorithm and the hierarchical radiosity algorithm. Copyright © 2006 John Wiley & Sons, Ltd. [source] Middleware benchmarking: approaches, results, experiences,CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 15 2005Paul Brebner Abstract The report summarizes the results of the Workshop on Middleware Benchmarking held during OOPSLA 2003. The goal of the workshop was to help advance the current practice of gathering performance characteristics of middleware implementations through benchmarking. The participants of the workshop have focused on identifying requirements of and obstacles to middleware benchmarking and forming a position on the related issues. Selected requirements and obstacles are presented, together with guidelines to adhere to when benchmarking, open issues of current practice, and perspectives on further research. Copyright © 2005 John Wiley & Sons, Ltd. [source] Ibis: a flexible and efficient Java-based Grid programming environmentCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 7-8 2005Rob V. van Nieuwpoort Abstract In computational Grids, performance-hungry applications need to simultaneously tap the computational power of multiple, dynamically available sites. The crux of designing Grid programming environments stems exactly from the dynamic availability of compute cycles: Grid programming environments (a) need to be portable to run on as many sites as possible, (b) they need to be flexible to cope with different network protocols and dynamically changing groups of compute nodes, while (c) they need to provide efficient (local) communication that enables high-performance computing in the first place. Existing programming environments are either portable (Java), or flexible (Jini, Java Remote Method Invocation or (RMI)), or they are highly efficient (Message Passing Interface). No system combines all three properties that are necessary for Grid computing. In this paper, we present Ibis, a new programming environment that combines Java's ,run everywhere' portability both with flexible treatment of dynamically available networks and processor pools, and with highly efficient, object-based communication. Ibis can transfer Java objects very efficiently by combining streaming object serialization with a zero-copy protocol. Using RMI as a simple test case, we show that Ibis outperforms existing RMI implementations, achieving up to nine times higher throughputs with trees of objects. Copyright © 2005 John Wiley & Sons, Ltd. [source] Adding tuples to Java: a study in lightweight data structuresCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 5-6 2005C. van Reeuwijk Abstract Java classes are very flexible, but this comes at a price. The main cost is that every class instance must be dynamically allocated. Also, their access by reference introduces pointer de-references and complicates program analysis. These costs are particularly burdensome for small, ubiquitous data structures such as coordinates and state vectors. For such data structures a lightweight representation is desirable, allowing such data to be handled directly, similar to primitive types. A number of proposals introduce restricted or mutated variants of standard Java classes that could serve as lightweight representation, but the impact of these proposals has never been studied. Since we have implemented a Java compiler with lightweight data structures we are in a good position to do this evaluation. Our lightweight data structures are tuples. As we will show, using tuples can result in significant performance gains: for a number of existing benchmark programs we gain more than 50% in performance relative to our own compiler, and more than 20% relative to Sun's Hotspot 1.4 compiler. We expect similar performance gains for other implementations of lightweight data structures. With respect to the expressiveness of Java, lightweight variants of standard Java classes have little impact. In contrast, tuples add a different language construct that, as we will show, can lead to substantially more concise program code. Copyright © 2005 John Wiley & Sons, Ltd. [source] A cache-efficient implementation of the lattice Boltzmann method for the two-dimensional diffusion equationCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 14 2004A. C. Velivelli Abstract The lattice Boltzmann method is an important technique for the numerical solution of partial differential equations because it has nearly ideal scalability on parallel computers for many applications. However, to achieve the scalability and speed potential of the lattice Boltzmann technique, the issues of data reusability in cache-based computer architectures must be addressed. Utilizing the two-dimensional diffusion equation, , this paper examines cache optimization for the lattice Boltzmann method in both serial and parallel implementations. In this study, speedups due to cache optimization were found to be 1.9,2.5 for the serial implementation and 3.6,3.8 for the parallel case in which the domain decomposition was optimized for stride-one access. In the parallel non-cached implementation, the method of domain decomposition (horizontal or vertical) used for parallelization did not significantly affect the compute time. In contrast, the cache-based implementation of the lattice Boltzmann method was significantly faster when the domain decomposition was optimized for stride-one access. Additionally, the cache-optimized lattice Boltzmann method in which the domain decomposition was optimized for stride-one access displayed superlinear scalability on all problem sizes as the number of processors was increased. Copyright © 2004 John Wiley & Sons, Ltd. [source] Identification and authentication of integrated circuitsCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 11 2004Blaise Gassend Abstract This paper describes a technique to reliably and securely identify individual integrated circuits (ICs) based on the precise measurement of circuit delays and a simple challenge,response protocol. This technique could be used to produce key-cards that are more difficult to clone than ones involving digital keys on the IC. We consider potential venues of attack against our system, and present candidate implementations. Experiments on Field Programmable Gate Arrays show that the technique is viable, but that our current implementations could require some strengthening before it can be considered as secure. Copyright © 2004 John Wiley & Sons, Ltd. [source] Linda implementations in Java for concurrent systemsCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 10 2004G. C. Wells Abstract This paper surveys a number of the implementations of Linda that are available in Java. It provides some discussion of their strengths and weaknesses, and presents the results from benchmarking experiments using a network of commodity workstations. Some extensions to the original Linda programming model are also presented and discussed, together with examples of their application to parallel processing problems. Copyright © 2004 John Wiley & Sons, Ltd. [source] Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithmsCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 8 2004F. Desprez Abstract In this paper we study the impact of the simultaneous exploitation of data- and task-parallelism, so called mixed-parallelism, on the Strassen and Winograd matrix multiplication algorithms. This work takes place in the context of Grid computing and, in particular, in the Client,Agent(s),Server(s) model, where data can already be distributed on the platform. For each of those algorithms, we propose two mixed-parallel implementations. The former follows the phases of the original algorithms while the latter has been designed as the result of a list scheduling algorithm. We give a theoretical comparison, in terms of memory usage and execution time, between our algorithms and classical data-parallel implementations. This analysis is corroborated by experiments. Finally, we give some hints about heterogeneous and recursive versions of our algorithms. Copyright © 2004 John Wiley & Sons, Ltd. [source] A comparison of concurrent programming and cooperative multithreading under load balancing applicationsCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 4 2004Justin T. Maris Abstract Two models of thread execution are the general concurrent programming execution model (CP) and the cooperative multithreading execution model (CM). CP provides nondeterministic thread execution where context switches occur arbitrarily. CM provides threads that execute one at a time until they explicitly choose to yield the processor. This paper focuses on a classic application to reveal the advantages and disadvantages of load balancing during thread execution under CP and CM styles; results from a second classic application were similar. These applications are programmed in two different languages (SR and Dynamic C) on different hardware (standard PCs and embedded system controllers). An SR-like run-time system, DesCaRTeS, was developed to provide interprocess communication for the Dynamic C implementations. This paper compares load balancing and non-load balancing implementations; it also compares CP and CM style implementations. The results show that in cases of very high or very low workloads, load balancing slightly hindered performance; and in cases of moderate workload, both SR and Dynamic C implementations of load balancing generally performed well. Further, for these applications, CM style programs outperform CP style programs in some cases, but the opposite occurs in some other cases. This paper also discusses qualitative tradeoffs between CM style programming and CP style programming for these applications. Copyright © 2004 John Wiley & Sons, Ltd. [source] Data partitioning-based parallel irregular reductionsCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 2-3 2004Eladio Gutiérrez Abstract Different parallelization methods for irregular reductions on shared memory multiprocessors have been proposed in the literature in recent years. We have classified all these methods and analyzed them in terms of a set of properties: data locality, memory overhead, exploited parallelism, and workload balancing. In this paper we propose several techniques to increase the amount of exploited parallelism and to introduce load balancing into an important class of these methods. Regarding parallelism, the proposed solution is based on the partial expansion of the reduction array. Load balancing is discussed in terms of two techniques. The first technique is a generic one, as it deals with any kind of load imbalance present in the problem domain. The second technique handles a special case of load imbalance which occurs whenever a large number of write operations are concentrated on small regions of the reduction arrays. Efficient implementations of the proposed optimizing solutions for a particular method are presented, experimentally tested on static and dynamic kernel codes, and compared with other parallel reduction methods. Copyright © 2004 John Wiley & Sons, Ltd. [source] An analysis of VI Architecture primitives in support of parallel and distributed communicationCONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 1 2002Andrew Begel Abstract We present the results of a detailed study of the Virtual Interface (VI) paradigm as a communication foundation for a distributed computing environment. Using Active Messages and the Split-C global memory model, we analyze the inherent costs of using VI primitives to implement these high-level communication abstractions. We demonstrate a minimum mapping cost (i.e. the host processing required to map one abstraction to a lower abstraction) of 5.4 ,s for both Active Messages and Split-C using four-way 550 MHz Pentium III SMPs and the Myrinet network. We break down this cost to the use of individual VI primitives in supporting flow control, buffer management and event processing and identify the completion queue as the source of the highest overhead. Bulk transfer performance plateaus at 44 Mbytes/s for both implementations are due to the addition of fragmentation requirements. Based on this analysis, we present the implications for the VI successor, Infiniband. Copyright © 2002 John Wiley & Sons, Ltd. [source] Providers' Beliefs, Attitudes, and Behaviors before Implementing a Computerized Pneumococcal Vaccination ReminderACADEMIC EMERGENCY MEDICINE, Issue 12 2006Judith W. Dexheimer MS Abstract Background The emergency department (ED) has been recommended as a suitable setting for offering pneumococcal vaccination; however, implementations of ED vaccination programs remain scarce. Objectives To understand beliefs, attitudes, and behaviors of ED providers before implementing a computerized reminder system. Methods An anonymous, five-point Likert-scale, 46-item survey was administered to emergency physicians and nurses at an academic medical center. The survey included aspects of ordering patterns, implementation strategies, barriers, and factors considered important for an ED-based vaccination initiative as well as aspects of implementing a computerized vaccine-reminder system. Results Among 160 eligible ED providers, the survey was returned by 64 of 67 physicians (96%), and all 93 nurses (100%). The vaccine was considered to be cost effective by 71% of physicians, but only 2% recommended it to their patients. Although 98% of physicians accessed the computerized problem list before examining the patient, only 28% reviewed the patient's health-maintenance section. Physicians and nurses preferred a computerized vaccination-reminder system in 93% and 82%, respectively. Physicians' preferred implementation approach included a nurse standing order, combined with physician notification; nurses, however, favored a physician order. Factors for improving vaccination rates included improved computerized documentation, whereas increasing the number of ED staff was less important. Relevant implementation barriers for physicians were not remembering to offer vaccination, time constraints, and insufficient time to counsel patients. The ED was believed to be an appropriate setting in which to offer vaccination. Conclusions Emergency department staff had favorable attitudes toward an ED-based pneumococcal vaccination program; however, considerable barriers inherent to the ED setting may challenge such a program. Applying information technology may overcome some barriers and facilitate an ED-based vaccination initiative. [source] Grid-induced biases in connectivity metric implementations that use regular gridsECOGRAPHY, Issue 3 2010Adam G. Dunn Graph-theoretic connectivity analyses provide opportunities to solve problems related to the management, design and maintenance of fragmented landscapes. However, several modern connectivity metrics are implemented using algorithms that are affected by a grid-induced bias. When paths through a regular grid are calculated, distance errors are introduced into the metric outputs, with patterns based on the shape and orientation of the underlying grid structure. The bias is significant in the proposed implementations of the conditional minimum transit cost method introduced by Pinto and Keitt, and the effective resistance method introduced by McRae, Dickson, Keitt and Shah. One solution for ameliorating the bias that affects regular grids is to use an irregular lattice to represent the landscape. The purpose of this paper is to serve as a timely reminder of the grid-induced bias and to provide a demonstration of the irregular grid as a simple solution to the problem. [source] |