Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Application Specific Architectures, Lecture notes of Computer Architecture and Organization

The concept of application-specific architectures and their importance in overcoming the diminishing returns of general-purpose architectures. It covers topics such as Moore's Law, dark silicon, Amdahl's Law, and the benefits of specialization. The document also includes a case study of CryptoManiac, a highly specialized and efficient crypto-processor design. a lecture note from the University of Michigan's EECS 573 course in Fall 2016.

Typology: Lecture notes

2015/2016

Uploaded on 05/11/2023

tylar
tylar šŸ‡ŗšŸ‡ø

4.8

(19)

240 documents

1 / 34

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Application Specific Architectures
Introduction and Motivation
Todd Austin
EECS 573
Fall 2016
University of Michigan
Advanced Computer Architecture Laboratory
University of Michigan
Application Specific Architectures
Todd Austin
Architecture’s Diminishing Return
•Staples of value we strive for…
•High Speed
•Low Power
•Low Cost
•Tricks of the trade
•Faster clock rates, via pipelining
•Higher instruction throughput, via ILP extraction
•Homogeneous parallel systems
•Strong evidence of diminishing return, PIII vs. P4
•PIII vs. P4: 22% less P4 throughput (0.35 vs. 0.45 SPECInt/MHz)
•Parallel resources not fully harnessed by today’s software
•Less return ļƒžless value ļƒž
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22

Partial preview of the text

Download Application Specific Architectures and more Lecture notes Computer Architecture and Organization in PDF only on Docsity!

Application Specific Architectures

Introduction and Motivation

Todd Austin

EECS 573

Fall 2016 University of Michigan

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

Architecture’s Diminishing Return

  • Staples of value we strive for…
    • High Speed
    • Low Power
    • Low Cost
  • Tricks of the trade
    • Faster clock rates, via pipelining
    • Higher instruction throughput, via ILP extraction
    • Homogeneous parallel systems
  • Strong evidence of diminishing return, PIII vs. P
    • PIII vs. P4: 22% less P4 throughput (0.35 vs. 0.45 SPECInt/MHz)
    • Parallel resources not fully harnessed by today’s software
  • Less return ļƒž less value ļƒž

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

Moore’s Law Performance Gap

3

Today, gap is cresting 10x Lack of perceived value Dark silicon

Diminished ILP

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

(^180 ) (^90 ) 45 32 22 14 10 7

1

10

100

1000

Technology Node (nm)

10nm slips by 5-6 quarters

14nm slips by 2 quarters 7nm by end 2020?

Is Density Still Scaling?

Street Dates for Intel’s Lead Generation Products Compiled with David Brooks @ Harvard 4

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

The Dark Silicon Dilemma

7

Courtesy Michael Taylor @ UCSD

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

The Dark Silicon Dilemma

8

Courtesy Michael Taylor @ UCSD

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

The Dark Silicon Dilemma

9

Courtesy Michael Taylor @ UCSD

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

The Tyranny of Amdahl’s Law

10

(P)

(N)

(S)

Where we need to be today! (10x)

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

Crypto-Specific Instructions

  • frequent SBOX substitutions
    • X = sbox[(y >> c) & 0xff]
  • SBOX instruction
    • Incorporates byte extract
    • Speeds address generation through alignment restrictions
    • 4-cycle Alpha code sequence becomes a single CryptoManiac instruction
  • SBOX caches provide a high-

bandwidth substitution

capability (4 SBOX’s/cycle)

31 10 0 24 16 8 0

opcode

00

SBOX Table

Table Index

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

Crypto-Specific Functional Unit

Pipelined 32-Bit MUL 1K Byte SBOX Cache

32-Bit Adder

32-Bit Rotator

XOR AND

Logical Unit

XOR AND

Logical Unit

{tiny}

{short}

{tiny}

{long}

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

Crypto-Specific Circuits

  • Overclock design until decryption check fails
    • Demonstrated approach with dual SA-1110 IPAQs
  • 26% performance increase at room temperature
    • Chill for more improvements, ~10% per 30 degree C

Advanced Computer Architecture Laboratory University of Michigan Application Specific Architectures Todd Austin

CryptoManiac Results

  • Design implemented in 0.25um physical design flow
    • All components synthesized with Synopsys tools
    • Evaluated with timing analysis and high-level simulation
  • Encryption Speed
    • Nearly 1.5x faster than a 600Mhz Alpha 21264 (both 0.25um)
    • 2.25x fast for AES encryption standard
  • Design Cost
    • 2 mm 2 total area for a single CryptoManiac processor
    • Less than 1/100th^ the size of an Alpha 21264 (205 mm 2 )
  • Power Characteristics
    • Less than 750 mW total power dissipation
    • Nearly 1/100th^ the power dissipation of an Alpha 21264 (72 W)

Performance of Various Platforms

2965.01 3943.

8036.77 8296.

Platform ARM 720T^ ARM 7TDMI^ ARM 920T^ ARM 1020T^1 st-gen^1 st-gen^1 st-gen Voltage (V) 1.2 1.2 1.2 1.2 1.2 0.5 0. Speed (Hz) 100M 133M 250M 325M 114M 9M 168k

xRT rating : how many times faster than real-time the processor can handle the worst-case data stream rate on the most computationally intensive sensor benchmark

The Basics of Subthreshold Circuit Operation

A Short Animation by Leyla Nazhandali

Episode 1: Inverter operation in

superthreshold domain

November 2, 2016 22

Superthreshold

P

N

P

N

1.2V 0V

IN OUT

November 2, 2016 25

P

N

P

N

P

N

1.2V
0V
1.2V
0V

IN OUT

Superthreshold

November 2, 2016 26

NN

P

N

1.2V
0V
1.2V
0V

IN OUT

Superthreshold

P

November 2, 2016 27

N

P

N

0V 1.2V

IN OUT

Superthreshold

P

November 2, 2016 28

N

P

N

0V 1.2V

IN OUT

Superthreshold

P

November 2, 2016 31

P

N

P

N

0.2V 0V

IN OUT

Subthreshold

November 2, 2016 32

P

N

P

N

0.2V IN OUT

0V 0V
0.2V

Subthreshold

November 2, 2016 33

N

P

N

0.2V IN OUT

0V 0V
0.2V

P

Subthreshold

November 2, 2016 34

N

P

N

0.2V IN OUT

0V 0V
0.2V

P

Subthreshold

November 2, 2016 37

N

P

N

IN OUT

0V 0.2V

P

Subthreshold

November 2, 2016 38

P

N

P

N

0.2V IN OUT

0V
0.2V
0V

Subthreshold

November 2, 2016 39

P

N

P

N

0.2V 0V

IN OUT

Subthreshold

Summary from Architecture Study

Minimize area (^)  To reduce leakage energy per cycle

Maximize Transistor utility (^)  To reduce Vmin and energy per cycle

Minimize CPI (^)  To reduce Energy per instruction

 We studied 21 different processors experimenting with following options:

 Number of stages  w/ vs. w/o instruction prefetch buffer  w/ vs. w/o explicit register file  Harvard vs. Von-Neumann architecture

 To minimize energy at subthreshold voltages, architects must:

 The memory comprises the single largest factor of leakage energy, as

such, efficient designs must reduce memory storage requirements.