




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This dissertation is about the construction of idempotent regions and their applications in architecture design. The author discusses static analysis of idempotent regions, program transformation, optimizing for dynamic behavior, and code generation of idempotent regions. The document also includes acknowledgments to the author's advisor and committee members, as well as fellow students. The dissertation was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the University of Wisconsin-Madison in 2012.
Typology: Thesis
1 / 164
This page cannot be seen from the preview
Don't miss anything!
By Marc A. de Kruijf
A dissertation submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy (Computer Sciences)
at the UNIVERSITY OF WISCONSIN–MADISON 2012
Date of final oral examination: 07/20/
The dissertation is approved by the following members of the Final Oral Committee: Karthikeyan Sankaralingam, Assistant Professor, Computer Sciences Mark Hill, Professor, Computer Sciences Gurindar Sohi, Professor, Computer Sciences Somesh Jha, Professor, Computer Sciences Mikko Lipasti, Professor, Electrical and Computer Engineering
i
This research product owes many things to many people. First is my advisor, Karu, who was instrumental in many ways. More than simply mentoring me in research, he has helped me to frame my own life—to strive to live happily, peacefully, and positively. His tireless work ethic in combination with his profound respect for life-balance is something that I admire greatly. Thanks almost entirely to him, my graduate career was rarely frustrating, full of interesting and exciting work, and very rewarding. I do not know what the future awaits, but thanks to you, Karu, I feel more prepared than ever before. The other committee member who deserves special thanks is Mark, my other professional role model. When I arrived at UW-Madison, Mark was my initial mentor, my CS 552 instructor, and he was the one to invite me to my first Computer Architecture Affiliates meeting only a month after my arrival. Mark never wavered in his support or his willingness to offer guidance. He may not know it, but Mark made me a computer architect. Thank you, Mark. Among the other members of committee, Guri forced me to think critically about my own ideas while remaining always supportive, Somesh was a crucial resource in developing key pieces of my research—he taught me to think in very precise terms, and his jovial and spirited nature was always an inspiration to me—and Mikko was an excellent resource for technical discussions in addition to being just an all-around great person. Thank you, Guri, Somesh, and Mikko. There is no shortage of fellow students to thank. First, the Vertical group. From the days when I was the only student of the group (with an office all to myself), the group is now over ten students strong. Thanks to everyone, with special thanks to my officemates, Venkat and Tony, who always provided great fuel for discussion and distraction. Among the other members, Emily, Chen-han, Jai, Raghu, Zach, Ryan, and Chris also deserve special mention for their support and camaraderie.
Contents................................................. iii
vi
In the field of computer architecture today, out-of-order execution is important to maximize archi- tectural efficiency, the shadow of unreliable hardware is ever-looming, and, with the emergence of mainstream parallel hardware, programmability is once again an important and fundamental challenge. Traditionally, hardware checkpointing and buffering techniques are used to assist with each of these problems. However, these techniques introduce overheads, add complexity to the hardware, and often save more state than necessary. With today’s renewed focus on energy effi- ciency, and with the commercial importance of reduced hardware complexity in today’s processor market, the efficacy of these techniques is no longer absolute. This thesis develops a novel compiler-based technique to efficiently support a range of hardware features without the need for checkpoints or buffers. The technique breaks programs into idempotent regions —regions that can be freely re-executed—to enable recovery by simple re-execution. The thesis observes that programs can be executed entirely as sequences of idempotent regions, and builds a classification framework to concretely reason about different interpretations of idempotence that apply in the context of computer architecture. It develops static analysis and compiler code gen- eration algorithms and techniques to construct idempotent regions and subsequently demonstrates low overheads and potentially large region sizes for an LLVM-based compiler implementation. Finally, it demonstrates applicability across a range of modern architecture designs in addressing a variety of problems. The thesis presents several findings. First, it finds that inherently large idempotent regions, in the range of tens to hundreds of instructions, exist across entire programs. It also finds that a compiler algorithm for constructing the largest possible regions, through careful allocation of function-local state, is capable of constructing regions close to these sizes. Various algorithms are
vii
demonstrated that are able to sub-divide these regions into smaller regions to optimize for specific constraints. In the end, however, code generation of small idempotent regions forces relatively high compiler-induced run-time overheads in the range of 10-20% (often increasing register pressure by over 50%), while, for larger regions, this overhead quickly approaches zero as region size grows beyond a few tens of instructions. Thus, the compiler-induced costs of constructing small regions are often out-weighed by any benefits, and optimization trade-offs thus generally favor constructing regions that are a few tens of instructions or more. This optimization goal tailors the suitability of idempotence-based recovery to specific architecture domains; this thesis considers specifically architecture design and evaluation for general exception support in GPUs, out-of-order retirement in general-purpose processors, and hardware fault tolerance in emerging processor designs.
Correctness: As transistor technology continues to scale to lower feature sizes, hardware is be- coming increasingly unreliabile [ 19 ]. To allow programs to continue to operate correctly even in the face of hardware transient or permanent faults, some form of recovery support is increasingly needed. To reconcile the need to reduce processor overheads with the desire to support program recovery, this thesis develops idempotence to support efficient recovery by re-execution. Idempotence is the property that re-execution has no side-effects; that is, an operation can be executed multiple times with the same effect as executing it only once. At the coarsest granularity any application whose inputs do not change during execution is idempotent. At the finest granularity every instruction that does not modify its source operands is also idempotent. In both cases, re-executing the operation does not change the effect of the initial execution.
Why Idempotence?
An operation is idempotent if its inputs do not change over the course of its execution. Hence, idempotence can be thought of as implicitly forming a checkpoint with respect to the inputs of the operation. In this manner, idempotence over a region of code can render traditional hardware checkpointing techniques unnecessary; in the event of failure, idempotence can be used to correct the state of the system by simple re-execution. Moving from explicit hardware checkpointing to an implicit checkpointing model built upon idempotence can benefit computer systems at multiple levels. First, at the microarchitecture level, the absence of hardware buffering and/or checkpointing reduces interdependencies between processor structures, reduces power and area, and allows existing hardware resources to be used more efficiently in the absence of contention. Second, at the circuit level, lower threshold voltages and tighter noise margins on transistors make hardware design and verification increasingly difficult; hence, less functionality in hardware implies substantially lower hardware design and verification effort. Finally, at the program level, checkpoints can be inflexible. This inflexibility is not only inconvenient, but it can also hurt overall efficiency if the checkpoints are overly conservative.
This thesis observes that applications can fully decompose into idempotent regions of code, and that these regions can be used to recover from a range of failure scenarios. The size and arrangement of idempotent regions is configurable during compilation, and a compiler can construct idempotent regions that are usefully large at the expense of only a small amount of run-time overhead. This thesis makes the following specific contributions:
Idempotence in computer architecture: It presents the first comprehensive analysis of idempo- tence and its implications for architecture and compiler design. In particular:
Compiler design – code generation: Code generation of small (semantically-constrained) idempo- tent regions commonly forces performance overheads of over 10%. For larger regions, this overhead approaches zero in the limit as region size grows beyond a few tens of instructions. Compiler design – ISA sensitivity: Among three ways in which the ISA could affect the run- time overheads of idempotence-based compilation, none appear significant. Independent of the ISA, small (semantically-constrained) idempotent regions increase register pressure by approximately 60%. For larger regions, register pressure effects approach zero in the limit as region size grows beyond a few tens of instructions. Architecture design: GPUs can support general exceptions cleanly using idempotence with run- time overheads of less than 2% (for traditional GPU workloads). CPUs can be simplified to support exceptions with out-of-order retirement with typical run-time overheads of 10%. Adding support for efficient branch-misprediction recovery using idempotence on CPUs increases the typical run-time overheads to 20%. Finally, architectures can use idempotence to support hardware fault recovery with run-time overheads of roughly 10%, assuming low-latency fault detection capability.
The core of the dissertation is organized into three parts: idempotence models in computer architecture , compiler design & evaluation , and architecture design & evaluation. These three parts span Chapters 2-6, with the closing Chapters 7-8 presenting related work and conclusions.
Idempotence Models in Computer Architecture Chapter 2 explores and analyzes the concept of idempotence as it applies to computer architecture. As background, it presents examples of idempotence applied in computer science and subsequently develops a taxonomy to reason about idempotence specifically as it applies to computer architecture. Leveraging this taxonomy, it performs an empirical study of the sizes of idempotent regions that could be attained for different idempotence models arising from the taxonomy given semantic
program constraints. Finally, it identifies the two idempotence models— architectural and contextual idempotence—that are developed in the remainder of the dissertation.
Compiler Design & Evaluation Chapters 3-5 present the static analysis, code generation, and evaluation of a compiler design that constructs idempotent regions in programs, optimizing for architectural and contextual idempo- tence, across a range of application and environmental constraints. Chapter 3 develops a static analysis for identifying the largest idempotent regions given semantic program constraints. Chap- ter 4 develops support for sub-dividing regions and preserving the idempotence property of these regions as they are compiled down to machine instructions. Finally, Chapter 5 presents a comprehensive evaluation of a full, end-to-end compiler implementation.
Architecture Design & Evaluation Chapter 6 motivates and develops the architecture support to utilize idempotence for recovery across a range of architecture designs. The overall architectural vision is one where the analysis of idempotence occurs in software (e.g. in a compiler), and the hardware consumes the output of this analysis to enable hardware design simplification and flexibility. Specifically, the applications to GPU, CPU, and emerging fault-tolerant architecture designs are explored and evaluated. In constrast to the rigorous compiler implementation evaluation of Chapter 5, the individual architecture evaluations are more abstract, using simulation-based evaluation. Detailed microarchitecture design and implementation is left as a topic for follow-on work.
All three parts of the dissertation are empirically grounded with a largely common experimental methodology used throughout. However, there are differences as the experimental purpose varies. Table 1.1 highlights the primary differences. Regarding benchmarks, the benchmark suites we study throughout are SPEC 2006 [ 99 ], a suite targeted at conventional single-threaded workloads, PARSEC [ 16 ], a suite targeted at emerging
Prior Work Topic Chapters PLDI 2012 [29] Static analysis and compiler design 3, 5, 6 ISCA 2012 [70] Application to GPU architecture 6 MICRO 2011 [28] Application to CPU architecture 5, 6 Table 1.2: The relation of the author’s prior work to the dissertation material.
compiler implementation that balances the execution overheads associated with smaller idempotent regions against those potentially associated with larger regions. Namely, Chapter 2, Chapter 4, and parts of Chapter 5 are largely unique to this thesis and are not part of previously published work.
This chapter analyzes idempotence and idempotence-based recovery specifically in the context of application programs executed as sequences of instructions. It develops a framework for the analysis of idempotence in this context and develops a taxonomy to reason about a spectrum of idempotence models. It subsequently offers empirical and qualitative analysis to identify two specific models— architectural and contextual idempotence—that are deemed meaningful for exploration in subsequent chapters. Parts of this chapter are heavy on formalism; with an understanding of certain specific char- acteristics of architectural and contextual idempotence, the impatient reader is free to skip this chapter and continue on to the remaining chapters of this dissertation. The relevant characteristics are as follows. Both models allow the construction of idempotent regions of maximal size with respect to the common case of data-race-free multi-threaded execution. Importantly, both models specifically assume invariable control flow semantics upon re-execution with respect to non-local memory state. Where the two models differ is in what they assume with respect to other (local) state: while architectural idempotence again assumes invariable control flow, contextual idempotence allows for variable control flow semantics. The chapter is organized as follows. Section 2.1 presents the intuition behind taxonomy it develops, presenting example idempotence models over sequences of instructions. Section 2. then formally defines key terms and Section 2.3 presents the taxonomy, identifying three axes of variation within the taxonomy. A permutation of the points along these axes forms an idempotence model. Section 2.4 analyzes the space of idempotence models and then distills the space to two models, architectural and contextual idempotence, that are deemed most meaningful. Section 2. presents a summary and conclusions.
processing state of switches used to deliver the request over the network. Finally, a load or store instruction causing an exception may invoke an operating system service routine that updates some system-internal state (e.g. page table entries). From this discussion, it is evident that the power of idempotence applied over a system lies in part with how that system is defined. Considering the architecture underlying the execution of an application program as the system, there are multiple definitions, or models , of idempotence that are meaningful. This chapter develops a formal taxonomy to concretely reason about these different models as they emerge from assumptions about the architecture environment. The discussion below presents intuition by presenting example models.
Example Idempotence Models
As stated earlier, an operation is idempotent if the effect of executing it multiple times is the same as the effect of executing it only once. This property is achieved if the operation’s inputs are preserved throughout its execution; with the same inputs, the operation will produce the same outputs each time it executes. However, what it means to “preserve an input” is subject to interpretation, and many different interpretations make sense depending on the context. This section presents four different example interpretations (models) that are all meaningful in the context of programs executed as sequences of instructions. A region is considered the unit of operation, and the following definitions are assumed:
Region: A region is defined as a collection of instructions uniquely identified by the single instruc- tion that forms its entry point. A region contains the set of instructions reachable by control flow from its entry point up to its exit points.
Live-in: A variable is live-in to a region if the variable may hold a value that is (a) defined (written) before entry to the region and (b) potentially used (read) after entry to the region.
The code of the function shown in Figure 2.1, written in the C programming language, is used as an example of inherently non-idempotent code that can be divided into idempotent regions. The function, list_push, checks a list for overflow and then pushes an integer element onto the
Figure 2.1: Example source code.
end of the list. The left side of Figure 2.2 shows the function compiled to a stylized assembly code organized into basic blocks, with arrows connecting the control flow between basic blocks. The code assumes four registers are available, R0-R3, with function arguments held in registers R0 and R1, and R0 also the return register. In the discussion that follows, the effect of a given idempotence model is measured by forming the set of maximally-sized idempotent regions found by greedily scanning and incrementally adding instructions to a region until doing so would render the region non-idempotent, at which point a new idempotent region is formed starting at the next instruction that is itself idempotent. In practice, identifying idempotent regions—in particular, semantically idempotent regions—requires a more sophisticated analysis (see Chapter 3); this algorithm is assumed for illustration purposes only.