Performance prediction and race detection in message-passing parallel applications

The combination of low cost clusters and multicore processors lowers the barrier for accessing massive amounts of computing power. As computational sciences advance, the use of in silico simulations to complement in vivo experiments promises parallel programming a bright future in multiple scientific fields. It is therefore increasingly important to develop tools helping developers to write efficient and bug-free parallel applications. This thesis focuses on performance prediction and advanced testing tools for distributed memory message-passing parallel applications. The tools have been implemented within the Dynamic Parallel Schedules (DPS) parallelization framework. They have also been partly adapted to applications written using the Message Passing Interface (MPI) standard. The first part presents a parallel application simulator which has been integrated into the DPS framework. We identified a small set of processing and networking parameters that characterize the hardware platform on which the application is running. After parameterizing the hardware platform, the running time of parallel applications can be predicted using direct execution without requiring any change to the application source code. We propose a partial direct execution technique that reduces the execution time and memory consumption of the simulation. Using partial direct execution, the simulation is no longer tied to the platform to be simulated. Simulations may thus run on a desktop computer rather than on the target parallel machine. The proposed parameterization of the application and of the hardware properties enable using the simulator to study the sensitivity of a parallel application to various operating conditions such as the data subdivision granularity, the adopted parallelization strategy and the underlying hardware platform properties. The proposed simulator helps developers identifying the factors having the largest impact on their application's performance, and determining the most suitable cluster hardware configuration. Speed should not come at the expense of correctness. Since improving parallelization efficiency often requires loosening synchronizations or implementing more complex communication patterns, developers need to ensure that their changes do not introduce potential message races or deadlocks. Deadlocks and message races are common sources of problems in parallel applications and stem from the fact that the delivery of messages from different sources is not deterministically ordered. This non-determinism makes such synchronization errors hard to reproduce and debug. The second part of the thesis presents methods for uncovering potential deadlocks and message races by taking advantage of the flow graph structures and checkpointing capabilities of DPS. We developed a debugger for DPS applications that displays an instantaneous graphical view of the global computation state and is able to control the ordering of message delivery in order to explicitly test specific orderings. The number of possible orderings explodes when the number of messages sent by the application increases. Manual testing can only cover a tiny fraction of possible executions. Therefore, we use the simulator's ability to control the execution of a parallel application, in order to automatically detect deadlocks and message races. A first method for reducing the number of orderings to be tested relies on a partial-order reduction of the search space and on the decomposition of the application execution into independently testable subparts. This method relies on a static analysis of an execution trace of the application, and can therefore only be applied to parallel applications that produce a fixed set of messages, i.e. applications producing the same messages for all delivery orderings. In order to overcome this limitation, we propose an approach relying on the dynamic construction of a state graph expressing possible executions. Both methods reduce the testing costs by several orders of magnitude, and can be combined to further improve the results. Nevertheless, testing durations may remain prohibitive for longer running applications. We therefore also define algorithms generating subsets of possible orderings that are likely to reveal erroneous executions. In the recent years, the MPI standard has emerged as the de facto standard for writing message-passing parallel applications. The final part of this thesis therefore focuses on adapting the aforementioned parallel application testing concepts to MPI applications. We first describe the extension of our work on visualizing the execution of parallel applications. We then discuss the limits and the benefits of using partial-order execution graphs to describe MPI application executions, and show that our dynamic message-passing state graph construction approach can be successfully applied.

Related material