Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

RISC Architecture: A Performance-Oriented Approach with One Instruction Per Cycle, Lecture notes of Architecture

An overview of Reduced Instruction Set Computers (RISC) architecture, its development, and the performance advantages it offers. RISC architecture is a performance-oriented design that exploits parallelism through pipelining to achieve one instruction per cycle execution. the history of RISC, its key features, and the benefits it brings compared to Complex Instruction Set Architectures (CISC).

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

aaroncastle1
aaroncastle1 🇬🇧

4.3

(8)

223 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
REDUCED INSTRUCTION SET COMPUTERS
Prof. Vojin G. Oklobdzija
Integration
Berkeley, CA 94708
Keywords: IBM 801; RISC; computer architecture; Load/Store Architecture; instruction
sets; pipelining; super-scalar machines; super-pipeline machines; optimizing compiler;
Branch and Execute; Delayed Branch; Cache; Harvard Architecture; Delayed Load;
Super-Scalar; Super-Pipelined.
Fall 1999
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download RISC Architecture: A Performance-Oriented Approach with One Instruction Per Cycle and more Lecture notes Architecture in PDF only on Docsity!

REDUCED INSTRUCTION SET COMPUTERS

Prof. Vojin G. Oklobdzija Integration Berkeley, CA 94708

Keywords: IBM 801; RISC; computer architecture; Load/Store Architecture; instruction sets; pipelining; super-scalar machines; super-pipeline machines; optimizing compiler; Branch and Execute; Delayed Branch; Cache; Harvard Architecture; Delayed Load; Super-Scalar; Super-Pipelined.

Fall 1999

1. ARCHITECTURE

The term Computer Architecture was first defined in the paper by Amdahl, Blaauw and Brooks of International Business Machines (IBM) Corporation announcing IBM System/360 computer family on April 7, 1964 [1,17]. On that day IBM Corporation introduced, in the words of IBM spokesman, "the most important product announcement that this corporation has made in its history".

Computer architecture was defined as the attributes of a computer seen by the machine language programmer as described in the Principles of Operation. IBM referred to the Principles of Operation as a definition of the machine which enables machine language programmer to write functionally correct, time independent programs that would run across a number of implementations of that particular architecture.

The architecture specification covers: all functions of the machine that are observable by the program [2]. On the other hand Principles of Operation. are used to define the functions that the implementation should provide. In order to be functionally correct it is necessary that the implementation conforms to the Principles of Operation. Principles of Operation document defines computer architecture which includes:

  • _Instruction set
  • Instruction format
  • Operation codes
  • Addressing modes
  • All registers and memory locations that may be directly manipulated_ _or tested by a machine language program
  • Formats for data representation_

Machine Implementation was defined as the actual system organization and hardware structure encompassing the major functional units, data paths, and control.

Machine Realization includes issues such as logic technology, packaging and interconnections.

Separation of the machine architecture from implementation enabled several embodiment of the same architecture to be built. Operational evidence proved that architecture and implementation could be separated and that one need not imply the other. This separation made it possible to transfer programs routinely from one model to another and expect them to produce the same result which defined the notion of architectural compatibility. Implementation of the whole line of computers according to a common architecture requires unusual attention to details and some new procedures which are described in the Architecture Control Procedure. The design and control of system architecture is an

1.2. RISC Performance

Since the early beginning, the quest for higher performance has been present in every computer model and architecture. This has been the driving force behind the introduction of every new architecture or system organization. There are several ways to achieve performance: technology advances, better machine organization, better architecture, and also the optimization and improvements in compiler technology. By technology, machine performance can be enhanced only in proportion to the amount of technology improvements and this is, more or less, available to everyone. It is in the machine organization and the machine architecture where the skills and experience of computer design are shown. RISC deals with these two levels - more precisely their interaction and trade-offs.

The work that each instruction of the RISC machine performs is simple and straight forward. Thus, the time required to execute each instruction can be shortened and the number of cycles reduced. Typically the instruction execution time is divided in five stages, machine cycles , and as soon as processing of one stage is finished, the machine proceeds with executing the second stage. However, when the stage becomes free it is used to execute the same operation that belongs to the next instruction. The operation of the instructions is performed in a pipeline fashion, similar to the assembly line in the factory process. Typically those five pipeline stages are:

IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory Access WB – Write Back

By overlapping the execution of several instructions in a pipeline fashion (as shown in Fig. 1.), RISC achieves its inherent execution parallelism which is responsible for the performance advantage over the Complex Instruction Set Architectures (CISC).

Fig. 1. Typical five stage RISC pipeline

At any given time there are 5 instructions in different stages of execution

I1: IF D EX MA WB

I2:

I3:

I4:

I5: IF

D

EX

MA

The goal of RISC is to achieve execution rate of one Cycle Per Instruction (CPI=1.0) which would be the case when no interruptions in the pipeline occurs. However, this is not the case.

The instructions and the addressing modes in RISC architecture are carefully selected and tailored upon the most frequently used instructions, in a way that will result in a most efficient execution of the RISC pipeline.

The simplicity of the RISC instruction set is traded for more parallelism in execution. On average a code written for RISC will consist of more instructions than the one written for CISC. The typical trade-off that exists between RISC and CISC can be expressed in the total time required to execute a certain task:

Time (task) = I x C x P x T 0

I = No. of Instructions / Task C = No. of Cycles / Instruction P = No. of Clock Periods / Cycle (usually P=1) T 0 = Clock Period (nS)

While CISC instruction will typically have less instructions for the same task, the execution of its complex operations will require more cycles and more clock ticks within the cycle as compared to RISC [19]. On the other hand RISC requires more instructions for the same task. However, RISC executes its instructions at the rate of one instruction per cycle and its machine cycle requires only one clock tick (typically). In addition, given the simplicity of the instruction set, as reflected in simpler machine implementation, the clock period T 0 in RISC can be shorter allowing RISC machine to run at the higher speed as compared to CISC. Typically as of today RISC machines have been running at the rate in excess of 667 MHz reaching 1 GHz, while CISC is hardly at 500MHz clock rate.

The trade-off between RISC and CISC can be summarized as follows:

a. CISC achieves its performance advantage by denser program consisting of a fewer number of powerful instructions. b. RISC achieves its performance advantage by having simpler instructions resulting in simpler and therefore faster implementation allowing more parallelism and running at higher speed.

2. RISC MACHINE IMPLEMENTATION

The main feature of RISC is the architectural support for the exploitation of parallelism on the instruction level. Therefore all distinguished features of RISC architecture should be considered in light of their support for the RISC pipeline. In addition to that RISC takes advantage of the principle of locality: spatial and temporal. What that means is that the data that was used recently is more likely to be used again. This justifies the

Memory Access is accomplished through Load and Store instructions only, thus the term “Load/Store Architecture” is often used when referring to RISC. The RISC pipeline is specified in a way in which it must accommodate both: operation and memory access with equal efficiency. The various pipeline stages of the Load and Store operations in RISC are shown in Fig. 3.

Fig. 3. The Operation of Load/Store Pipeline

2.2. Carefully Selected Set of Instructions

The principle of locality is applied throughout RISC. The fact that only a small set of instructions is most frequently used, was used in determining the most efficient pipeline organization with a goal of exploiting instruction level parallelism in the most efficient way. The pipeline is “tailored” for the most frequently used instructions. Such derived pipeline must serve efficiently the three main instruction classes:

  • Access to Cache: Load/Store
  • Operation: Arithmetic/Logical
  • Branch

Given the simplicity of the pipeline the control part of RISC is implemented in hardware, unlike its CISC counterpart which relies heavily on the use of micro-coding.

IAR

Cache Instr.

IR

IF

Register File

Decode

Data Cache

DEC

E-Address Calculation Cache Access WB

ALU

WR RD

Displacement E-Address = B+Displacement

Register File

Data from Cache

Base

D-S WA

However, this is the most misunderstood part of RISC architecture which has even resulted in the inappropriate name: RISC. Reduced Instruction Set Computer implies, that the number of instructions in RISC is small. This has created a widely spread misunderstanding that the main feature characterizing RISC is a small instruction set. This is not true. The number of instructions in the instruction set of RISC can be substantial. This number of RISC instructions can grow until the complexity of the control logic begin to impose an increase in the clock period. In practice this point is further beyond the number of instructions commonly used. Therefore we have reached a possibly paradoxical situation, that several of representative RISC machines known today, have an instruction set larger than that of CISC.

For example: IBM PC-RT Instruction architecture contains 118 instructions, while IBM RS/6000 (PowerPC) contains 184 instructions. This should be contrasted to the IBM System/360 containing 143 instructions and IBM System/370 containing 208. The first two are representatives of RISC architecture while the later two are not.

Fig. 4. Branch Instruction

2.3. Fixed format instructions

What really matters for RISC is that the instructions have fixed and predetermined format which facilitates decoding in one cycle and simplifies the control hardware. Usually the size of RISC instructions is also fixed to the size of the word (32-bits), however, there are

IR

Register File

Decode

φ 1 φ 0 φ 1

Decode

Instr. Cache

Instruction Address Register: IAR

+4 MUX

Offset

IAR+

Ra=Rb

φ 1 φ 0

Instruction Fetch

Yes

It is Branch

Condition is satisfied?

In case of the cache this locality can be spatial and temporal. Spatial locality means that the most likely location in the memory to be referenced next will be the location in the neighborhood of the location that was just referenced previously. On the other hand, the temporal locality means that the most likely location to be referenced next will be from the set of memory locations that were referenced just recently. The cache operates on this principle.

Fig. 5. Pipeline Flow of the Branch Instruction

The RISC machines are based on the exploitation of that principle as well. The first level in the memory hierarchy is the general-purpose register file GPR, where we expect to find the operands most of the time. Otherwise the Register-to-Register operation feature would not be very effective. However, if the operands are not to be found in the GPR, the time to fetch the operands should not be excessive. This requires the existence of a fast memory next to the CPU – the Cache. The cache access should also be fast so that the time allocated for Memory Access in the pipeline is not exceeded. One-cycle cache is a requirement for RISC machine and the performance is seriously degraded if the cache access requires two or more CPU cycles. In order to maintain the required one-cycle cache bandwidth the data and instruction access should not collide. It is from there that the separation of instruction and data caches, the so called Harvard Architecture , is a must feature for RISC.

2.6. Branch and Execute Instruction

Branch and Execute or Delayed Branch instruction is a new feature of the instruction architecture that was introduced and fully exploited in RISC. When a Branch instruction is encountered in the pipeline, one cycle will be inevitably lost. This is illustrated in Fig.

breq: IF D EX MA WB

inst+1:^ IF

target: (^) IF D EX MA WB

the earliest available target instruction address

RISC architecture solves the lost cycle problem by introducing Branch and Execute instruction [5,7] (also known as Delayed Branch Instruction ), which consists of an instruction pair: Branch and the Branch Subject instruction which is always executed. It is the task of the compiler to find an instruction which can be placed in that, otherwise wasted, pipeline cycle.

The subject instruction can be found in the instruction stream preceding the branch instruction, in the target instruction stream or in the fall-through instruction stream. It is the task of the compiler to find such an instruction and to fill-in this execution cycle [8].

Given the frequency of the branch instructions which varies from one out of five to one out of fifteen (depending on the nature of the code) the number of those otherwise lost cycles can be substantial. Fortunately a good compiler can fill-in 70% of those cycles which amounts to an up to 15% performance improvement [11]. This is the single most performance contributing instruction from the RISC instruction architecture.

However, in the later generations of super-scalar RISC machines (which execute more than one instruction in the pipeline cycle) Branch and Execute instruction has been abandoned in favor of Brand Prediction [13] [21].

The Load instruction can also exhibit this lost pipeline cycle as shown in Fig.6.

Fig. 6. Lost cycle during the execution of the Load Instruction

The same principle of scheduling an independent instruction in the otherwise lost cycle, that was applied for in Branch and Execute, can be applied to Load instruction. This is also known as Delayed Load. An example of what the compiler can do to schedule instructions and utilize those otherwise lost cycles is shown in Fig.7. [8,11].

dependency

IF D Addrs C-Acc write

IF D EX MA WB

Ld:

Add:

data needed

data available from cache

data written to register data available from the register file

ld r5, r3, d

add r7, r5, r

varies and a CPI = 1.3 is considered quite good while CPI between 1.4 to 1.5 is more common in single-instruction issue implementations of the RISC architecture.

However, once the CPI was brought close to one, the next goal in implementing RISC machines was to bring CPI bellow one in order for the architecture to deliver more performance. This goal requires an implementation that can execute more than one instruction in the pipeline cycle a so called Super-Scalar implementation [13,16]. A substantial effort has been done on the part of the leading RISC machine designers to build such machines. However, machines that execute up to four instructions in one cycle are common today and a machine that executes up to six instructions in one cycle has been introduced last year.

2.9. Pipelining

Finally, the single most important feature of RISC is pipelining. Degree of parallelism in the RISC machine is determined by the depth of the pipeline. It could be stated that all the features of RISC (that were listed in this article), could easily be derived from the requirements for pipelining and maintaining an efficient execution model. The sole purpose of many of those features is to support an efficient execution of RISC pipeline. It is clear that without pipelining the goal of CPI = 1 is not possible. An example of the instruction execution in the absence of pipelining is shown in Fig.8.

Fig. 8. Instruction execution in the absence of pipelining

We may be lead to think that by increasing the number of pipeline stages (the pipeline depth), thus introducing more parallelism, we may increase the RISC machine performance further. However, this idea does not lead to a simple and straight forward to realization. The increase in the number of pipeline stages introduces an overhead not only in hardware (needed to implement the additional pipeline registers), but also the overhead in time due to the delay of the latches used to implement the pipeline stages as well as the cycle time lost due to the clock skews and clock jitter. This could very soon bring us to the point of diminishing returns where further increase in the pipeline depth would result in less performance. An additional side effect of deeply pipelined systems is hardware complexity necessary to resolve all the possible conflicts that can occur between the increased number of instructions residing in the pipeline at one time. The number of the pipeline stages is mainly determined by the type of the instruction core (the most

IF D EX MA WB IF D EX MA WB

I1 I

Total of 10 cycles for two instructions

frequent instructions) and the operations required by those instructions. The pipeline depth depends, as well, on the technology used. If the machine is implemented in a very high speed technology characterized by the very small number of gate levels (such as GaAs or ECL), characterized with a very good control of the clock skews, it makes sense to pipeline the machine deeper. The RISC machines that achieve performance through the use of many pipeline stages are known as super-pipelined machines.

Today the most common number of the pipeline stages encountered is five (as in the examples given in this text). However, twelve or more pipeline stages are encountered in some machine implementations.

The features of RISC architecture that support pipelining are listed in Table 1.

Table 1. Features of RISC Architecture

Feature Characteristic

Load / Store Architecture All of the operations are Register to Register. In this way Operation is decoupled from the access to memory Carefully selected sub-set of instructions

Control is implemented in hardware. There is no microcoding in RISC. Also this set of instructions is not necessarily small* Simple Addressing Modes Only the most frequently used addressing modes are used. Also it is important that they can fit into the existing pipeline. Fixed size and fixed fields instructions

This is necessary to be able to decode instruction and access operands in one cycle. Though there are architectures using two sizes for the instruction format (IBM PC-RT) Delayed Branch Instruction (known also as Branch and Execute)

The most important performance improvement through instruction architecture. (no longer true in new designs) One Instruction Per Cycle execution rate, CPI = 1.

Possible only through the use of pipelining

Optimizing Compiler Close coupling between the architecture and the compiler. Compiler "knows" about the pipeline. Harvard Architecture Separation of Instruction and Data Cache resulting in increased memory bandwidth.

  • IBM PC-RT Instruction architecture contains 118 instructions, while IBM RS/ (PowerPC) contains 184 instructions. This should be contrasted to the IBM System/ containing 143 instructions and IBM System/370 containing 208. The first two are representatives of RISC architecture while the later two are not.

3.1. History of RISC

The RISC project started in 1975 at the IBM T. J. Watson Research Center under the name of the 801. The original intent of the 801 project was to develop an emulator for System/360 code [5]. IBM 801 was built in ECL technology and was completed by the early 1980s [5-6]. This project was not known to the world outside of IBM until early 1980s and the results of that work are mainly unpublished. The idea of simpler computer especially the one that can be implemented on the single chip in the university environment was appealing and two other projects with similar objectives started in the early 1980s at the University of California Berkeley and Stanford University [9,10]. These two academic projects had much more influence on the industry than the IBM 801 project. Sun Microsystems developed its own architecture currently known as SPARC as a result of the University of California Berkeley work. Similarly, the Stanford University work was directly transferred to MIPS [20]. The chronology illustrating RISC development is illustrated in Fig. 10.

Fig. 10. History of RISC development

The features of some contemporary RISC processors are shown in Table 2.

CDC 6600: 1963

Cyber

Cray -I: 1976

HP-PA: 1986

IBM ASC: 1970

IBM 801: 1975

IBM PC/RT: 1986

IBM RS/6000: 1990

PowerPC: 1993

RISC- Berkeley 1981

SPARC v.8: 1987

SPARC v.9: 1994

MIPS Stanford 1982

MIPS-1: 1986

MIPS-2: 1989

MIPS-3: 1992

MIPS-4:^1994

DEC - Alpha: 1992

Table 2. Contemporary RISC processors features Feature Digital 21164

MIPS

PowerP C 620

HP 8000 Sun UltraSparc

Frequency 500 MHz

200 MHz 200 MHz 180 MHz 250 MHz

Pipeline Stages 7 5-7 5 7-9 6-

Issue Rate 4 4 4 4 4

Out-of-Order Exec. 6 loads 32 16 56 none

Register Renam. (int/FP)

none/8 32/32 8/8 56 none

Transistors/ Logic transistors

9.3M/

1.8M

5.9M/

2.3M

6.9M/

2.2M

3.9M*/

3.9M

3.8M/

2.0M

SPEC

(Intg/FlPt)

Perform./ Log-trn (Intg/FP)

  • no cache