ANNOUNCEMENT – COE MS THESIS DEFENSE
Mr. Allam Abdalgani Fatayer, Part-Time COE MS Student, will defend his MS Thesis on Thursday, December 11, 2014 at 01:00 p.m. in 22-130. His MS thesis title is “EXPERIMENTAL EVALUATION OF PRALLEL PROGRAM SCALABILITY ON THE XEON PHI SMP ”. His thesis advisor is “Dr. Mayez Al-Mouhamed, Professor, COE Department”.
You are cordially invited to attend.
Abstract: As the era of Moore’s Law and increasing CPU clock rates nears its stopping point the focus of chip and hardware design has shifted to increasing the number of computation cores present on the chip. This increases can be most clearly seen in the rise of the Many Integrated Core processors (MIC). Programming for these chips produces a new set of challenges and concerns. In this context, we present an experimental evaluation of parallel program scalability on the MIC Shared Memory Multiprocessor (SMP) using OpenMP programing paradigm. We address two classes of applications 1) Static and 2) Semi static. For first class we select a set of applications from the class of Basic Linear Algebra and numerical algorithms (Matrix-Matrix Multiplication (MM) and JACOBI SOLVER). Particularly, we analysis, optimize and implement these applications. For MM we used the Strassen matrix multiplication algorithm. The basic Strassen-MM (S-MM) algorithm having time complexity of O (n2:807) instead of O (n3) of standard MM algorithm. Our optimization is based on a reordering approach to reduce the storage, use of a depth first walk (DFW), and invocation of the MKL optimized library for smaller matrix-matrix multiplications. The results of MM using STRASSEN outperform Math Kernel Library (MKL) within large matrix size with percentage from 8% to 24%. For JS, we noticed that it does not scale well because of the excessive synchronization overhead, which must be implemented across all the working threads. To improve JS scalability, we explored (1) Synchronous Jacobi (SJ), (2) Asynchronous Jacobi (AJ), and Relaxed Jacobi (RJ). In SJ we used explicate barrier synchronization. In AJ a non-exact solution is computed because completing threads start the next iterations using current data, which is a mixing of new and old. AJ slows down the convergence rate. In RJ, completing threads at iteration K start the next iteration (k+1) using newly computed data. RJ provides overlap between two iterations at the cost of managing the availability of currently available intermediate results. Experiments show that SJ synchronization time takes 50% from the execution time on matrix size 4096. AJ produces the best results over the others because of barrier elimination, if a non-exact solution is acceptable. For exact solutions, our evaluation shows a performance gain of 24.4%, 32.6%, 38.9%, and 57.16% for RJ over SJ for matrices of size 3840, 7680, 15360 and 30720, respectively using 60 cores. For the second class, we select a semi static classical problem (N-Body simulation). In this application, an approximated solution using Barns Hut algorithm (BH) is implemented. BH uses an oct-tree, in which each node stores the aggregate mass of all of its children nodes (sub-tree) at their center of mass. Another problem is that the thread load moderately changes from one iteration to another due to body motion in space. A Dynamic Load Balancing (DLB) combined with data locality approach is used to improve Scalability, we call it Iterative Cost Zone Load Balancing (ICZB). Our implementation on MIC shows that the execution time and aggregate load scales linearly with the problem size when using 60 cores for problem sizes within the range of 1 million to 4 million. In addition, our DLB-BH provides an increased speedup of 42% and 36% on problem size 1 million and 4 million respectively, as compared to traditional S-BH. DLB is recommended as a compiler strategy as one optimization strategy for semi-static applications.
Refreshment (Tea, cold drink, water and cakes) will be served
Dr. Ahmad Almulhem
Chairman, COE Dept.