Computer structure pipeline презентация

Июль 26, 2021

Главная
Без категории
Computer structure pipeline

Содержание

2. A Basic Processor Memory Write back Execute Decode Fetch PC Data Cache Register File Control signals
3. Pipelined Car Assembly chassis engine finish 1 hour 2 hours 1 hour Car 1 Car 2
4. Data Access Data Access Data Access Data Access Data Access Pipelining Instructions Ideal speedup is number
5. Pipelining Pipelining does not reduce the latency of single task, it increases the throughput of entire
6. Pipelined CPU dst
7. Structural Hazard Different instructions using the same resource at the same time Register File: Accessed in
8. Pipeline Example: cycle 1 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7
9. Pipeline Example: cycle 2 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7
10. Pipeline Example: cycle 3 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7
11. Pipeline Example: cycle 4 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7
12. Pipeline Example: cycle 5 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7
13. RAW Dependency sub R2, R1, R3 and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw
14. Using Bypass to Solve RAW Dependency sub R2, R1, R3 and R12,R2, R5 or R13,R6, R2
15. RAW Dependency 0 lw R4,9(R1) 4 sub R5,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7 dst
16. Forwarding Hardware 0 lw R4,9(R1) 4 sub R5,R2,R3 8 and R12,R4,R5 12 or R13,R6,R7 dst
17. Forwarding Control Forwarding from EXE (L3) if (L3.RegWrite and (L3.dst == L2.src1)) ALUSelA = 1 if
18. Register File Split Register file is written during first half of the cycle Register file is
19. Load word can still causes a hazard: an instruction tries to read a register following a
20. De-assert the enable to the L1 latch, and to the IP The dependent instruction (and) stays
21. Example: code for (assume all variables are in memory): a = b + c; d =
22. Control Hazards
23. Control Hazard on Branches
24. Control Hazard on Branches
25. Control Hazard on Branches
26. Control Hazard on Branches
27. Control Hazard on Branches
28. Control Hazard on Branches And Beq sub mul The 3 instructions following the branch get into
29. Control Hazard: Stall Stall pipe when branch is encountered until resolved Stall impact: assumptions CPI =
30. Control Hazard: Predict Not Taken Execute instructions from the fall-through (not-taken) path As if there is
31. Dynamic Branch Prediction Add a Branch Target Buffer (BTB) that predicts (at fetch) Instruction is a
32. BTB Allocation Allocate instructions identified as branches (after decode) Both conditional and unconditional branches are allocated
33. BTB (cont.) Wrong prediction Predict not-taken, actual taken Predict taken, actual not-taken, or actual taken but
34. Adding a BTB to the Pipeline 4 50 50 Lookup current IP in IC and in
35. Adding a BTB to the Pipeline 50 taken 8 jcc sub 50 54 0 or 4
36. Adding a BTB to the Pipeline or jcc 50 taken 0 or 4 jcc 50 8
37. Using The BTB PC moves to next instruction Inst Mem gets PC and fetches new inst
38. Using The BTB (cont.) ID EXE MEM WB Calculate br cond & trgt Flush pipe &
39. Backup
40. R-type (register insts) I-type (Load, Store, Branch, inst’s w/imm data) J-type (Jump) op: operation of the
41. Each memory location is 8 bit = 1 byte wide has an address We assume 32
42. Register File The Register File holds 32 registers Each register is 32 bit wide The RF
43. Memory Components Inputs Address: address of the memory location we wish to access Read: read data
44. The Program Counter (PC) Holds the address (in memory) of the next instruction to be executed
45. Fetch Fetch instruction pointed by PC from I-Cache Decode Decode instruction (generate control signals) Fetch operands
46. The MIPS CPU Instruction Decode / register fetch Instruction fetch Execute / address calculation Memory access
47. Executing an Add Instruction R3 R5 3 5 R3 + R5 + [PC]+4 2 Add R2,
48. Executing a Load Instruction LW R1, (30)R2 ; R1 ← Mem[R2+30]
49. Executing a Store Instruction SW R1, (30)R2 ; Mem[R2+30] ← R1
50. Executing a BEQ Instruction BEQ R4, R5, 27 ; if (R4-R5=0) then PC ← PC+4+SignExt(27)*4 ;
51. Control Signals
52. Pipelined CPU: Load (cycle 1 – Fetch) lw op rs rt immediate 0 16 21 26
53. Pipelined CPU: Load (cycle 2 – Dec) op rs rt immediate 0 16 21 26 31
54. Pipelined CPU: Load (cycle 3 – Exe) op rs rt immediate 0 16 21 26 31
55. Pipelined CPU: Load (cycle 4 – Mem) op rs rt immediate 0 16 21 26 31
56. Pipelined CPU: Load (cycle 5 – WB) op rs rt immediate 0 16 21 26 31
57. Datapath with Control
58. Multi-Cycle Control Pass control signals along just like the data
59. Five Execution Steps Instruction Fetch Use PC to get instruction and put it in the Instruction
60. Five Execution Steps (cont.) Execution ALU is performing one of three functions, based on instruction type:
61. The Store Instruction sw rt, rs, imm16 mem[PC] Fetch the instruction from memory Addr Calculate the
62. RAW Hazard: SW Solution I M R e g C C 1 C C 2 C
63. Delayed Branch Define branch to take place AFTER n following instruction HW executes n instructions following
65. Скачать презентацию

Слайд 2

A Basic Processor
Memory
Write
back
Execute
Decode
Fetch
PC
Data
Cache
Register
File
Control signals
ALU
Inst.
src1
Inst. Cache
address
Sign
Ext.
Mem Rd/Wr
ALU control
decode
src1 data
src2 data
src2
dst
data
dst reg
ALU source
imm
src2
src1
Write-back value
imm
opcode
src1 reg
src2

reg

Слайд 3

Pipelined Car Assembly
chassis
engine
finish
1 hour
2 hours
1 hour
Car 1
Car 2
Car 3

Слайд 4

Data
Access
Data
Access
Data
Access
Data
Access
Data
Access
Pipelining Instructions
Ideal speedup is number of stages in the pipeline. Do

we achieve this?

Inst
Fetch

Reg

ALU

Reg

Inst
Fetch

Reg

ALU

Reg

Inst
Fetch

Reg

ALU

Reg

Inst
Fetch

Reg

ALU

Reg

Inst
Fetch

Reg

ALU

Reg

2 ns

8 ns

Слайд 5

Pipelining
Pipelining does not reduce the latency of single task, it increases

the throughput of entire workload
Potential speedup = Number of pipe stages
Pipeline rate is limited by the slowest pipeline stage
Partition the pipe to many pipe stages
Make the longest pipe stage to be as short as possible
Balance the work in the pipe stages
Pipeline adds overhead (e.g., latches)
Time to “fill” pipeline and time to “drain” it reduces speedup
Stall for dependencies
Too many pipe-stages start to loose performance
IPC of an ideal pipelined machine is 1
Every clock one instruction finishes

Слайд 6

Pipelined CPU
dst

Слайд 7

Structural Hazard
Different instructions using the same resource at the same time
Register

File:
Accessed in 2 stages:
Read during stage 2 (ID)
Write during stage 5 (WB)
Solution: 2 read ports, 1 write port
Memory
Accessed in 2 stages:
Instruction Fetch during stage 1 (IF)
Data read/write during stage 4 (MEM)
Solution: separate instruction cache and data cache
Each functional unit can only be used once per instruction
Each functional unit must be used at the same stage for all instructions

Слайд 8

Pipeline Example: cycle 1
0 lw R10,9(R1)
4 sub R11,R2,R3
8

and R12,R4,R5 12 or R13,R6,R7

dst

Слайд 9

Pipeline Example: cycle 2
0 lw R10,9(R1)
4 sub R11,R2,R3
8

and R12,R4,R5 12 or R13,R6,R7

dst

Слайд 10

Pipeline Example: cycle 3
0 lw R10,9(R1)
4 sub R11,R2,R3
8

and R12,R4,R5 12 or R13,R6,R7

dst

Слайд 11

Pipeline Example: cycle 4
0 lw R10,9(R1)
4 sub R11,R2,R3
8

and R12,R4,R5 12 or R13,R6,R7

dst

Слайд 12

Pipeline Example: cycle 5
0 lw R10,9(R1)
4 sub R11,R2,R3
8

and R12,R4,R5 12 or R13,R6,R7

dst

Слайд 13

RAW Dependency
sub R2, R1, R3
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw

R15,100(R2)

Program
execution
order

Слайд 14

Using Bypass to Solve RAW Dependency
sub R2, R1, R3
and R12,R2, R5
or

R13,R6, R2
add R14,R2, R2
sw R15,100(R2)

Program
execution
order

Bypass result directly from EXE output to EXE input

Слайд 15

RAW Dependency
0 lw R4,9(R1)
4 sub R5,R2,R3
8 and R12,R4,R5 12

or R13,R6,R7

dst

Слайд 16

Forwarding Hardware
0 lw R4,9(R1)
4 sub R5,R2,R3
8 and R12,R4,R5 12

or R13,R6,R7

dst

Слайд 17

Forwarding Control
Forwarding from EXE (L3)
if (L3.RegWrite and (L3.dst == L2.src1))

ALUSelA = 1
if (L3.RegWrite and (L3.dst == L2.src2)) ALUSelB = 1
Forwarding from MEM (L4)
if (L4.RegWrite and
((not L3.RegWrite) or (L3.dst ≠ L2.src1)) and
(L4.dst = L2.src1)) ALUSelA = 2
if (L4.RegWrite and
((not L3.RegWrite) or (L3.dst ≠ L2.src2)) and
(L4.dst = L2.src2)) ALUSelB = 2

Слайд 18

Register File Split
Register file is written during first half of the

cycle
Register file is read during second half of the cycle
Register file is written before it is read ⇒ returns the correct data

Слайд 19

Load word can still causes a hazard:
an instruction tries to read

a register following a load instruction that writes to the same register

A hazard detection unit is needed to “stall” the load instruction

Can't Always Forward

Слайд 20

De-assert the enable to the L1 latch, and to the IP
The

dependent instruction (and) stays another cycle in L1
Issue a NOP into the L2 latch (instead of the stalled inst.)
Allow the stalling instruction (lw) to move on

Stall If Cannot Forward

if (L2.RegWrite and (L2.opcode == lw) and
( (L2.dst == L1.src1) or (L2.dst == L1.src2) ) then stall

Слайд 21

Example: code for (assume all variables are in memory):
a = b

+ c;
d = e – f;
Slow code
LW Rb,b
LW Rc,c
Stall ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
Stall SUB Rd,Re,Rf
SW d,Rd
Instruction order can be changed as long as the correctness is kept

Software Scheduling to Avoid Load Hazards

Fast code
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

Слайд 22

Control Hazards

Слайд 23

Control Hazard on Branches

Слайд 24

Control Hazard on Branches

Слайд 25

Control Hazard on Branches

Слайд 26

Control Hazard on Branches

Слайд 27

Control Hazard on Branches

Слайд 28

Control Hazard on Branches
And
Beq
sub
mul
The 3 instructions
following the branch
get into

the pipe even
if the branch is taken

Inst from target

Слайд 29

Control Hazard: Stall
Stall pipe when branch is encountered until resolved
Stall impact:

assumptions
CPI = 1
20% of instructions are branches
Stall 3 cycles on every branch
⇒ CPI new = 1 + 0.2 × 3 = 1.6
(CPI new = CPI Ideal + avg. stall cycles / instr.)
We loose 60% of the performance

Слайд 30

Control Hazard: Predict Not Taken
Execute instructions from the fall-through (not-taken) path
As

if there is no branch
If the branch is not-taken (~50%), no penalty is paid
If branch actually taken
Flush the fall-through path instructions before they change the machine state (memory / registers)
Fetch the instructions from the correct (taken) path
Assuming ~50% branches not taken on average
CPI new = 1 + (0.2 × 0.5) × 3 = 1.3

Слайд 31

Dynamic Branch Prediction
Add a Branch Target Buffer (BTB) that predicts (at

fetch)
Instruction is a branch
Branch taken / not-taken
Taken branch target
BTB allocated at execute – after all branch info is known
BTB is looked up at instruction fetch

Слайд 32

BTB
Allocation
Allocate instructions identified as branches (after decode)
Both conditional and unconditional branches

are allocated
Not taken branches need not be allocated
BTB miss implicitly predicts not-taken
Prediction
BTB lookup is done parallel to IC lookup
BTB provides
Indication that the instruction is a branch (BTB hits)
Branch predicted target
Branch predicted direction
Branch predicted type (e.g., conditional, unconditional)
Update (when branch outcome is known)
Branch target
Branch history (taken / not-taken)

Слайд 33

BTB (cont.)
Wrong prediction
Predict not-taken, actual taken
Predict taken, actual not-taken, or actual

taken but wrong target
In case of wrong prediction – flush the pipeline
Reset latches (same as making all instructions to be NOPs)
Select the PC source to be from the correct path
Need get the fall-through with the branch
Start fetching instruction from correct path
Assuming P% correct prediction rate
CPI new = 1 + (0.2 × (1-P)) × 3
For example, if P=0.7
CPI new = 1 + (0.2 × 0.3) × 3 = 1.18

Слайд 34

Adding a BTB to the Pipeline
4
50
50
Lookup current IP in IC and

in BTB in parallel

IC provides the instruction bytes

jcc

BTB provides predicted target and direction

0 or
4 jcc 50 8 and ...
50 sub
54 mul
58 add

taken

Слайд 35

Adding a BTB to the Pipeline
50
taken
8
jcc
sub
50
54
0 or
4 jcc 50

8 and ...
50 sub
54 mul
58 add

Слайд 36

Adding a BTB to the Pipeline
or
jcc
50
taken
0 or
4 jcc 50

8 and ...
50 sub
54 mul
58 add

sub

mul

taken

Verify direction

Verify target (if taken)

Issue flush in case of mismatch

Along with the repair IP

Слайд 37

Using The BTB
PC moves to next instruction
Inst Mem gets PC
and fetches

new inst

BTB gets PC
and looks it up

IF/ID latch loaded
with new inst

PC ← PC + 4

PC ← perd addr

IF/ID latch loaded
with pred inst

IF/ID latch loaded
with seq. inst

yes

EXE

Слайд 38

Using The BTB (cont.)
ID
EXE
MEM
WB
Calculate br
cond & trgt
Flush pipe &
update PC
yes
no
IF/ID latch

loaded
with correct inst

continue

Update BTB

yes

continue

Слайд 39

Backup

Слайд 40

R-type
(register insts)
I-type (Load,
Store, Branch,
inst’s w/imm
data)
J-type

(Jump)
op: operation of the instruction
rs, rt, rd: the source and destination register specifiers
shamt: shift amount
funct: selects the variant of the operation in the “op” field
address / immediate: address offset or immediate value
target address: target address of the jump instruction

MIPS Instruction Formats

Слайд 41

Each memory location
is 8 bit = 1 byte wide
has an

address
We assume 32 byte address
An address space of 232 bytes
Memory stores both instructions and data
Each instruction is 32 bit wide ⇒ stored in 4 consecutive bytes in memory
Various data types have different width

The Memory Space

Слайд 42

Register File
The Register File holds 32 registers
Each register is 32 bit

wide
The RF supports parallel
reading any two registers and
writing any register
Inputs
Read reg 1/2: #register whose value will be output on Read data 1/2
RegWrite: write enable

Write reg (relevant when RegWrite=1)
#register to which the value in Write data is written to
Write data (relevant when RegWrite=1)
data written to Write reg
Outputs
Read data 1/2: data read from Read reg 1/2

Слайд 43

Memory Components
Inputs
Address: address of the memory location we wish to access
Read:

read data from location
Write: write data into location
Write data (relevant when Write=1) data to be written into specified location
Outputs
Read data (relevant when Read=1) data read from specified location

Cache

Memory components are slow relative to the CPU
A cache is a fast memory which contains only small part of the memory
Instruction cache stores parts of the memory space which hold code
Data Cache stores parts of the memory space which hold data

Слайд 44

The Program Counter (PC)
Holds the address (in memory) of the next

instruction to be executed
After each instruction, advanced to point to the next instruction
If the current instruction is not a taken branch,
the next instruction resides right after the current instruction
PC ← PC + 4
If the current instruction is a taken branch,
the next instruction resides at the branch target
PC ← target (absolute jump)
PC ← PC + 4 + offset×4 (relative jump)

Слайд 45

Fetch
Fetch instruction pointed by PC from I-Cache
Decode
Decode instruction (generate control signals)
Fetch

operands from register file
Execute
For a memory access: calculate effective address
For an ALU operation: execute operation in ALU
For a branch: calculate condition and target
Memory Access
For load: read data from memory
For store: write data into memory
Write Back
Write result back to register file
update program counter

Instruction Execution Stages

Instruction
Fetch

Instruction
Decode

Execute

Memory

Result
Store

Слайд 46

The MIPS CPU
Instruction
Decode /
register fetch
Instruction
fetch
Execute /
address
calculation
Memory
access
Write
back

Слайд 47

Executing an Add Instruction
R3
R5
3
5
R3 + R5
+
[PC]+4
2
Add R2, R3, R5 ; R2

← R3+R5

ADD

Слайд 48

Executing a Load Instruction
LW R1, (30)R2 ; R1 ← Mem[R2+30]

Слайд 49

Executing a Store Instruction
SW R1, (30)R2 ; Mem[R2+30] ← R1

Слайд 50

Executing a BEQ Instruction
BEQ R4, R5, 27 ; if (R4-R5=0) then

PC ← PC+4+SignExt(27)*4 ;
else PC ← PC+4

immediate

BEQ

Слайд 51

Control Signals

Слайд 52

Pipelined CPU: Load (cycle 1 – Fetch)
lw
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

← Mem[R2+30]

PC+4

Слайд 53

Pipelined CPU: Load (cycle 2 – Dec)
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

← Mem[R2+30]

PC+4

Слайд 54

Pipelined CPU: Load (cycle 3 – Exe)
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

← Mem[R2+30]

R2+30

Слайд 55

Pipelined CPU: Load (cycle 4 – Mem)
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

← Mem[R2+30]

Слайд 56

Pipelined CPU: Load (cycle 5 – WB)
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

← Mem[R2+30]

Слайд 57

Datapath with Control

Слайд 58

Multi-Cycle Control
Pass control signals along just like the data

Слайд 59

Five Execution Steps
Instruction Fetch
Use PC to get instruction and put it

in the Instruction Register.
Increment the PC by 4 and put the result back in the PC.
IR = Memory[PC]; PC = PC + 4;
Instruction Decode and Register Fetch
Read registers rs and rt
Compute the branch address
A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC + (sign-extend(IR[15-0]) << 2);
We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Слайд 60

Five Execution Steps (cont.)
Execution
ALU is performing one of three functions,

based on instruction type:
Memory Reference: effective address calculation. ALUOut = A + sign-extend(IR[15-0]);
R-type: ALUOut = A op B;
Branch: if (A==B) PC = ALUOut;
Memory Access or R-type instruction completion
Write-back step

Слайд 61

The Store Instruction
sw rt, rs, imm16
mem[PC] Fetch the instruction from memory
Addr <- R[rs]

+ SignExt(imm16)
Calculate the memory address
Mem[Addr] <- R[rt] Store the register into memory
PC <- PC + 4 Calculate the next instruction’s address

Mem[Rs + SignExt[imm16]] <- Rt Example: sw rt, rs, imm16

Слайд 62

RAW Hazard: SW Solution
I
M
R
e
g
C
C

1
C
C

2
C
C

3
C
C

4
C
C

5
C
C

6
Time (clock

cycles)

–

Value of R2

sub R2, R1, R3
NOP
NOP
NOP
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)

Program
execution
order

-20

Have compiler avoid hazards by adding NOP instructions

Problem: this really slows us down!

Слайд 63

Delayed Branch
Define branch to take place AFTER n following instruction
HW executes

n instructions following the branch regardless of branch is taken or not
SW puts in the n slots following the branch instructions that need to be executed regardless of branch resolution
Instructions that are before the branch instruction, or
Instructions from the converged path after the branch
If cannot find independent instructions, put NOP

Original Code
r3 = 23
R4 = R3+R5
If (r1==r2) goto x
R1 = R4 + R5
X: R7 = R1

New Code
If (r1==r2) goto x r3 = 23
R4 = R3 +R5
NOP
R1 = R4 + R5
X: R7 = R1

⇒

Computer structure pipeline презентация

Содержание

A Basic ProcessorMemoryWritebackExecuteDecodeFetchPCDataCacheRegister FileControl signalsALUInst.src1Inst. CacheaddressSignExt.Mem Rd/WrALU controldecodesrc1 datasrc2 datasrc2dstdatadst regALU sourceimmsrc2src1Write-back valueimmopcodesrc1 regsrc2

Pipelined Car Assemblychassisenginefinish1 hour2 hours1 hourCar 1Car 2Car 3

DataAccessDataAccessDataAccessDataAccessDataAccessPipelining InstructionsIdeal speedup is number of stages in the pipeline. Do

PipeliningPipelining does not reduce the latency of single task, it increases

Pipelined CPUdst

Structural HazardDifferent instructions using the same resource at the same timeRegister

Pipeline Example: cycle 1 0 lw R10,9(R1) 4 sub R11,R2,R3 8

Pipeline Example: cycle 2 0 lw R10,9(R1) 4 sub R11,R2,R3 8

Pipeline Example: cycle 3 0 lw R10,9(R1) 4 sub R11,R2,R3 8

Pipeline Example: cycle 4 0 lw R10,9(R1) 4 sub R11,R2,R3 8

Pipeline Example: cycle 5 0 lw R10,9(R1) 4 sub R11,R2,R3 8

RAW Dependencysub R2, R1, R3 and R12,R2, R5 or R13,R6, R2 add R14,R2, R2 sw

Using Bypass to Solve RAW Dependencysub R2, R1, R3 and R12,R2, R5 or

RAW Dependency 0 lw R4,9(R1) 4 sub R5,R2,R3 8 and R12,R4,R5 12

Forwarding Hardware 0 lw R4,9(R1) 4 sub R5,R2,R3 8 and R12,R4,R5 12

Forwarding Control Forwarding from EXE (L3)if (L3.RegWrite and (L3.dst == L2.src1))

Register File SplitRegister file is written during first half of the

Load word can still causes a hazard:an instruction tries to read

De-assert the enable to the L1 latch, and to the IPThe

Example: code for (assume all variables are in memory): a = b

Control Hazards

Control Hazard on Branches

Control Hazard on Branches

Control Hazard on Branches

Control Hazard on Branches

Control Hazard on Branches

Control Hazard on BranchesAndBeqsubmulThe 3 instructions following the branch get into

Control Hazard: StallStall pipe when branch is encountered until resolvedStall impact:

Control Hazard: Predict Not TakenExecute instructions from the fall-through (not-taken) pathAs

Dynamic Branch PredictionAdd a Branch Target Buffer (BTB) that predicts (at

BTBAllocationAllocate instructions identified as branches (after decode)Both conditional and unconditional branches

BTB (cont.)Wrong predictionPredict not-taken, actual takenPredict taken, actual not-taken, or actual

Adding a BTB to the Pipeline45050Lookup current IP in IC and

Adding a BTB to the Pipeline50taken8jccsub5054 0 or 4 jcc 50

Adding a BTB to the Pipelineorjcc50taken 0 or 4 jcc 50

Using The BTBPC moves to next instructionInst Mem gets PCand fetches

Using The BTB (cont.)IDEXEMEMWBCalculate brcond & trgtFlush pipe &update PCyesnoIF/ID latch

Backup

R-type (register insts)I-type (Load, Store, Branch, inst’s w/imm data)J-type

Each memory location is 8 bit = 1 byte widehas an

Register FileThe Register File holds 32 registersEach register is 32 bit

Memory ComponentsInputsAddress: address of the memory location we wish to accessRead:

The Program Counter (PC)Holds the address (in memory) of the next

FetchFetch instruction pointed by PC from I-CacheDecodeDecode instruction (generate control signals)Fetch

The MIPS CPUInstruction Decode /register fetchInstruction fetchExecute /address calculationMemoryaccessWriteback

Executing an Add InstructionR3R535R3 + R5+[PC]+42Add R2, R3, R5 ; R2

Executing a Load InstructionLW R1, (30)R2 ; R1 ← Mem[R2+30]

Executing a Store InstructionSW R1, (30)R2 ; Mem[R2+30] ← R1

Executing a BEQ InstructionBEQ R4, R5, 27 ; if (R4-R5=0) then

Control Signals

Pipelined CPU: Load (cycle 1 – Fetch)lwoprsrtimmediate0162126312130LWLW R1, (30)R2 ; R1

Pipelined CPU: Load (cycle 2 – Dec)oprsrtimmediate0162126312130LWLW R1, (30)R2 ; R1

Pipelined CPU: Load (cycle 3 – Exe)oprsrtimmediate0162126312130LWLW R1, (30)R2 ; R1

Pipelined CPU: Load (cycle 4 – Mem)oprsrtimmediate0162126312130LWLW R1, (30)R2 ; R1

Pipelined CPU: Load (cycle 5 – WB)oprsrtimmediate0162126312130LWLW R1, (30)R2 ; R1

Datapath with Control

Multi-Cycle ControlPass control signals along just like the data

Five Execution StepsInstruction FetchUse PC to get instruction and put it

Five Execution Steps (cont.)Execution ALU is performing one of three functions,

The Store Instructionsw rt, rs, imm16mem[PC] Fetch the instruction from memoryAddr <- R[rs]

RAW Hazard: SW SolutionIMRegCC 1CC 2CC 3CC 4CC 5CC 6Time (clock

Delayed BranchDefine branch to take place AFTER n following instructionHW executes

Похожие презентации

A Basic Processor
Memory
Write
back
Execute
Decode
Fetch
PC
Data
Cache
Register
File
Control signals
ALU
Inst.
src1
Inst. Cache
address
Sign
Ext.
Mem Rd/Wr
ALU control
decode
src1 data
src2 data
src2
dst
data
dst reg
ALU source
imm
src2
src1
Write-back value
imm
opcode
src1 reg
src2

Pipelined Car Assembly
chassis
engine
finish
1 hour
2 hours
1 hour
Car 1
Car 2
Car 3

Data
Access
Data
Access
Data
Access
Data
Access
Data
Access
Pipelining Instructions
Ideal speedup is number of stages in the pipeline. Do

Pipelining
Pipelining does not reduce the latency of single task, it increases

Pipelined CPU
dst

Structural Hazard
Different instructions using the same resource at the same time
Register

Pipeline Example: cycle 1
0 lw R10,9(R1)
4 sub R11,R2,R3
8

Pipeline Example: cycle 2
0 lw R10,9(R1)
4 sub R11,R2,R3
8

Pipeline Example: cycle 3
0 lw R10,9(R1)
4 sub R11,R2,R3
8

Pipeline Example: cycle 4
0 lw R10,9(R1)
4 sub R11,R2,R3
8

Pipeline Example: cycle 5
0 lw R10,9(R1)
4 sub R11,R2,R3
8

RAW Dependency
sub R2, R1, R3
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw

Using Bypass to Solve RAW Dependency
sub R2, R1, R3
and R12,R2, R5
or

RAW Dependency
0 lw R4,9(R1)
4 sub R5,R2,R3
8 and R12,R4,R5 12

Forwarding Hardware
0 lw R4,9(R1)
4 sub R5,R2,R3
8 and R12,R4,R5 12

Forwarding Control
Forwarding from EXE (L3)
if (L3.RegWrite and (L3.dst == L2.src1))

Register File Split
Register file is written during first half of the

Load word can still causes a hazard:
an instruction tries to read

De-assert the enable to the L1 latch, and to the IP
The

Example: code for (assume all variables are in memory):
a = b

Control Hazard on Branches
And
Beq
sub
mul
The 3 instructions
following the branch
get into

Control Hazard: Stall
Stall pipe when branch is encountered until resolved
Stall impact:

Control Hazard: Predict Not Taken
Execute instructions from the fall-through (not-taken) path
As

Dynamic Branch Prediction
Add a Branch Target Buffer (BTB) that predicts (at

BTB
Allocation
Allocate instructions identified as branches (after decode)
Both conditional and unconditional branches

BTB (cont.)
Wrong prediction
Predict not-taken, actual taken
Predict taken, actual not-taken, or actual

Adding a BTB to the Pipeline
4
50
50
Lookup current IP in IC and

Adding a BTB to the Pipeline
50
taken
8
jcc
sub
50
54
0 or
4 jcc 50

Adding a BTB to the Pipeline
or
jcc
50
taken
0 or
4 jcc 50

Using The BTB
PC moves to next instruction
Inst Mem gets PC
and fetches

Using The BTB (cont.)
ID
EXE
MEM
WB
Calculate br
cond & trgt
Flush pipe &
update PC
yes
no
IF/ID latch

R-type
(register insts)
I-type (Load,
Store, Branch,
inst’s w/imm
data)
J-type

Each memory location
is 8 bit = 1 byte wide
has an

Register File
The Register File holds 32 registers
Each register is 32 bit

Memory Components
Inputs
Address: address of the memory location we wish to access
Read:

The Program Counter (PC)
Holds the address (in memory) of the next

Fetch
Fetch instruction pointed by PC from I-Cache
Decode
Decode instruction (generate control signals)
Fetch

The MIPS CPU
Instruction
Decode /
register fetch
Instruction
fetch
Execute /
address
calculation
Memory
access
Write
back

Executing an Add Instruction
R3
R5
3
5
R3 + R5
+
[PC]+4
2
Add R2, R3, R5 ; R2

Executing a Load Instruction
LW R1, (30)R2 ; R1 ← Mem[R2+30]

Executing a Store Instruction
SW R1, (30)R2 ; Mem[R2+30] ← R1

Executing a BEQ Instruction
BEQ R4, R5, 27 ; if (R4-R5=0) then

Pipelined CPU: Load (cycle 1 – Fetch)
lw
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

Pipelined CPU: Load (cycle 2 – Dec)
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

Pipelined CPU: Load (cycle 3 – Exe)
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

Pipelined CPU: Load (cycle 4 – Mem)
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

Pipelined CPU: Load (cycle 5 – WB)
op
rs
rt
immediate
0
16
21
26
31
2
1
30
LW
LW R1, (30)R2 ; R1

Multi-Cycle Control
Pass control signals along just like the data

Five Execution Steps
Instruction Fetch
Use PC to get instruction and put it

Five Execution Steps (cont.)
Execution
ALU is performing one of three functions,

The Store Instruction
sw rt, rs, imm16
mem[PC] Fetch the instruction from memory
Addr <- R[rs]

RAW Hazard: SW Solution
I
M
R
e
g
C
C

1
C
C

2
C
C

3
C
C

4
C
C

5
C
C

6
Time (clock

Delayed Branch
Define branch to take place AFTER n following instruction
HW executes