Fichiers

Résumé

The performance monitoring of computer systems is a complex affair, made even more challenging by the increasing gap between hardware and software. Methods that collect and feed data to performance analysis can usually be classified into one of two groups. Methods in the first group (e.g., software instrumentation), can be accurate but have unacceptable overheads and offer no insight into the interaction of the software and the hardware. Methods in the second group (e.g., hardware supported) run in real time and connect code to the response in the architecture, at a considerable decrease in accuracy, with a high degree of specificity to the hardware and therefore with limited actionable information made available. In this work, we focus on the latter group - the collection and analysis of raw performance data sourced from hardware Performance Monitoring Units, built into the majority of processors. The periodic collection of samples (Event Based Sampling), containing code locations, for instance, can be triggered on events such as retired instructions or cache misses. We quantify accuracy improvements to existing methods, propose a better performing method of performance data collection and a new architecture-agnostic method of performance data analysis. In our experiments, we analyze compiled, large-scale production workloads, industry standard benchmarks, and micro-benchmarks reproducing specific architectural behaviors. First, we focus on performance data collection methods and their accuracy, when establishing instruction retirement rates. These vary from least to most advanced and we examine the low level factors influencing accuracy. In particular, we employ a method based on Last Branch Records, a hardware facility present on some Intel architectures, which allows for the sampling of short records of last taken branches leading up to the sample. We compare the most commonly used method, its improved derivatives and the Last Branch Record based method. With respect to the first method, we observe accuracy improvements of up to 18x on synthetic benchmarks and 15x on real applications. Second, to further improve accuracy, we propose a new collection method named Hybrid Basic Block Profiling, fusing those based on Last Branch Records and the widely used Event Based Sampling. We apply simple machine learning techniques to determine the operating criteria. As a demonstration of capability, our method and a profiling tool are used to generate Instruction Mixes, which traditionally require good quality data at the most detailed level of granularity - basic blocks. Compared to software instrumentation, we observe an improvement in runtime of up to 76x, while keeping instruction attribution errors at 2.1% on average. In our third contribution we describe an architecture-agnostic performance analysis methodology called Hierarchical Cycle Accounting. The user is presented with a navigable hierarchy of microarchitectural issues, expressed in a metric that is universal and simple to compare - core cycles. We develop a reference implementation for the Intel Ivy Bridge microarchitecture and provide pointers for implementations on other processor architectures. Our approach and tool are successfully used by non-experts to improve and vectorize complex scientific code. Experts benefit as well - by being able to pinpoint software issues undetected by other methods, and identifying a bug in a family of Intel processors.

Détails

Actions