### PIPELINING AND VECTOR PROCESSING

- Parallel Processing
- Pipelining
- Arithmetic Pipeline
- Instruction Pipeline
- RISC Pipeline
- Vector Processing
- Array Processors

## PARALLEL PROCESSING

Execution of *Concurrent Events* in the computing process to achieve faster *Computational Speed* 

### **Levels of Parallel Processing**

- Job or Program level
- Task or Procedure level
- Inter-Instruction level
- Intra-Instruction level

### PARALLEL COMPUTERS

### **Architectural Classification**

- Flynn's classification
  - » Based on the multiplicity of *Instruction Streams* and Data Streams
  - » Instruction Stream
    - Sequence of Instructions read from memory
  - » Data Stream
    - Operations performed on the data in the processor

|                                     |          | Number of Data Streams |          |  |  |
|-------------------------------------|----------|------------------------|----------|--|--|
|                                     |          | Single                 | Multiple |  |  |
| Number of<br>Instruction<br>Streams | Single   | SISD                   | SIMD     |  |  |
|                                     | Multiple | MISD                   | MIMD     |  |  |



## SISD COMPUTER SYSTEMS



#### **Characteristics**

- Standard von Neumann machine
- Instructions and data are stored in memory
- One operation at a time

#### Limitations

**Von Neumann bottleneck** 

Maximum speed of the system is limited by the Memory Bandwidth (bits/sec or bytes/sec)

- Limitation on *Memory Bandwidth*
- Memory is shared by CPU and I/O

## SISD PERFORMANCE IMPROVEMENTS

- Multiprogramming
- Spooling
- Multifunction processor
- Pipelining
- Exploiting instruction-level parallelism
  - Superscalar
  - Superpipelining
  - VLIW (Very Long Instruction Word)

## MISD COMPUTER SYSTEMS



### **Characteristics**

- There is no computer at present that can be classified as MISD

## SIMD COMPUTER SYSTEMS



#### **Characteristics**

- Only one copy of the program exists
- A single controller executes one instruction at a time

## MIMD COMPUTER SYSTEMS



### **Characteristics**

- Multiple processing units
- Execution of multiple instructions on multiple data

### Types of MIMD computer systems

- Shared memory multiprocessors
- Message-passing multicomputers

## **PIPELINING**

A technique of decomposing a sequential process into suboperations, with each subprocess being executed in a partial dedicated segment that operates concurrently with all other segments.



 $R1 \leftarrow A_{i}, R2 \leftarrow B_{i} \qquad \text{Load } A_{i} \text{ and } B_{i}$   $R3 \leftarrow R1 * R2, R4 \leftarrow C_{i} \qquad \text{Multiply and load } C_{i}$   $R5 \leftarrow R3 + R4 \qquad \qquad \text{Add}$ 

# **OPERATIONS IN EACH PIPELINE STAGE**

| Clock<br>Pulse | Segment 1  |             | Segmei  | nt 2      | Segment 3    |  |  |
|----------------|------------|-------------|---------|-----------|--------------|--|--|
| Number         | R1         | R1 R2 R3 R4 |         | R5        |              |  |  |
| 1              | <b>A</b> 1 | B1          |         |           |              |  |  |
| 2              | <b>A2</b>  | <b>B2</b>   | A1 * B1 | C1        |              |  |  |
| 3              | <b>A3</b>  | <b>B3</b>   | A2 * B2 | C2        | A1 * B1 + C1 |  |  |
| 4              | <b>A4</b>  | B4          | A3 * B3 | C3        | A2 * B2 + C2 |  |  |
| 5              | A5         | <b>B5</b>   | A4 * B4 | C4        | A3 * B3 + C3 |  |  |
| 6              | <b>A6</b>  | <b>B6</b>   | A5 * B5 | <b>C5</b> | A4 * B4 + C4 |  |  |
| 7              | <b>A7</b>  | B7          | A6 * B6 | C6        | A5 * B5 + C5 |  |  |
| 8              |            |             | A7 * B7 | <b>C7</b> | A6 * B6 + C6 |  |  |
| 9              |            |             |         |           | A7 * B7 + C7 |  |  |

## **GENERAL PIPELINE**

### **General Structure of a 4-Segment Pipeline**



### **Space-Time Diagram**



## PIPELINE SPEEDUP

13

n: Number of tasks to be performed

**Conventional Machine (Non-Pipelined)** 

t<sub>n</sub>: Clock cycle

 $\tau_1$ : Time required to complete the n tasks

$$\tau_1 = n * t_n$$

Pipelined Machine (k stages)

t<sub>p</sub>: Clock cycle (time to complete each suboperation)

 $\tau_{\kappa}$ : Time required to complete the n tasks

$$\tau_{\kappa} = (k + n - 1) * t_{p}$$

Speedup

S<sub>k</sub>: Speedup

$$S_k = n^*t_n / (k + n - 1)^*t_p$$

$$\lim_{n \to \infty} S_k = \frac{t_n}{t_p}$$

### PIPELINE AND MULTIPLE FUNCTION UNITS

### Example

- 4-stage pipeline
- subopertion in each stage;  $t_p = 20$ nS
- 100 tasks to be executed
- 1 task in non-pipelined system; 20\*4 = 80nS

Pipelined System 
$$(k + n - 1)^*t_p = (4 + 99) * 20 = 2060nS$$

Non-Pipelined System  

$$n*k*t_p = 100 * 80 = 8000nS$$

Speedup 
$$S_k = 8000 / 2060 = 3.88$$

4-Stage Pipeline is basically identical to the system



## ARITHMETIC PIPELINE

# Floating-point adder

 $X = A \times 2^a$   $Y = B \times 2^b$ 

- [1] Compare the exponents
- [2] Align the mantissa
- [3] Add/sub the mantissa
- [4] Normalize the result



Computer Architectures Lab

## INSTRUCTION CYCLE

### Six Phases\* in an Instruction Cycle

- [1] Fetch an instruction from memory
- [2] Decode the instruction
- [3] Calculate the effective address of the operand
- [4] Fetch the operands from memory
- [5] Execute the operation
- [6] Store the result in the proper place
- \* Some instructions skip some phases
- \* Effective address calculation can be done in the part of the decoding phase
- \* Storage of the operation result into a register is done automatically in the execution phase

### ==> 4-Stage Pipeline

- [1] FI: Fetch an instruction from memory
- [2] DA: Decode the instruction and calculate the effective address of the operand
- [3] FO: Fetch the operand
- [4] EX: Execute the operation

# **INSTRUCTION PIPELINE**

**Execution of Three Instructions in a 4-Stage Pipeline** 

Conventional

### **Pipelined**

```
i FI DA FO EX i+1 FI DA FO EX i+2 FI DA FO EX
```

### INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE



## MAJOR HAZARDS IN PIPELINED EXECUTION

### **Structural hazards(Resource Conflicts)**

Hardware Resources required by the instructions in simultaneous overlapped execution cannot be met

### **Data hazards (Data Dependency Conflicts)**

An instruction scheduled to be executed in the pipeline requires the result of a previous instruction, which is not yet available

#### **Control hazards**

Branches and other instructions that change the PC make the fetch of the next instruction to be delayed



Hazards in pipelines may make it necessary to *stall* the pipeline



Pipeline Interlock:

Detect Hazards Stall until it is cleared

## RISC PIPELINE

20

#### **RISC**

- Machine with a very fast clock cycle that executes at the rate of one instruction per cycle
  - <- Simple Instruction Set Fixed Length Instruction Format Register-to-Register Operations

**Instruction Cycles of Three-Stage Instruction Pipeline** 

**Data Manipulation Instructions** 

- I: Instruction Fetch
- A: Decode, Read Registers, ALU Operations
- E: Write a Register

Load and Store Instructions

- I: Instruction Fetch
- A: Decode, Evaluate Effective Address
- E: Register-to-Memory or Memory-to-Register

**Program Control Instructions** 

- I: Instruction Fetch
- A: Decode, Evaluate Branch Address
- E: Write Register(PC)

# **DELAYED LOAD**

LOAD:  $R1 \leftarrow M[address 1]$ LOAD:  $R2 \leftarrow M[address 2]$ 

 $R3 \leftarrow R1 + R2$ ADD:

STORE: M[address 3]  $\leftarrow$  R3

Three-segment pipeline timing

Pipeline timing with data conflict

1 2 3 4 5 6 I A E clock cycle Load R1 I A E Load R2 Add R1+R2

Pipeline timing with delayed load

clock cycle 1 2 3 4 5 6 7 Load R1 I A E

Store R3

IAE Load R2 I A E NOP

Add R1+R2

A E Store R3

The data dependency is taken care by the compiler rather than the hardware

### **DELAYED BRANCH**

Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps

### **Using no-operation instructions**

| Clock cycles:  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|----------------|---|---|---|---|---|---|---|---|---|----|
| 1. Load        | I | Α | Ε |   |   |   |   |   |   |    |
| 2. Increment   |   | I | Α | Е |   |   |   |   |   |    |
| 3. Add         |   |   | I | Α | Е |   |   |   |   |    |
| 4. Subtract    |   |   |   | I | Α | Е |   |   |   |    |
| 5. Branch to X |   |   |   |   | I | Α | Ш |   |   |    |
| 6. NOP         |   |   |   |   |   | _ | Α | Е |   |    |
| 7. NOP         |   |   |   |   |   |   | _ | Α | Ш |    |
| 8. Instr. in X |   |   |   |   |   |   |   | Ī | A | Ε  |

### **Rearranging the instructions**

| Clock cycles:  |  | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|----------------|--|---|---|---|---|---|---|---|
| 1. Load        |  | Α | Е |   |   |   |   |   |
| 2. Increment   |  | I | Α | Е |   |   |   |   |
| 3. Branch to X |  |   | ı | Α | Е |   |   |   |
| 4. Add         |  |   |   | ı | Α | Е |   |   |
| 5. Subtract    |  |   |   |   | ı | Α | Е |   |
| 6. Instr. in X |  |   |   |   |   | I | A | Ε |

### **VECTOR PROCESSING**

#### **Vector Processing Applications**

- Problems that can be efficiently formulated in terms of vectors
  - Long-range weather forecasting
  - Petroleum explorations
  - Seismic data analysis
  - Medical diagnosis
  - Aerodynamics and space flight simulations
  - Artificial intelligence and expert systems
  - Mapping the human genome
  - Image processing

#### **Vector Processor (computer)**

Ability to process vectors, and related data structures such as matrices and multi-dimensional arrays, much faster than conventional computers

**Vector Processors may also be pipelined** 

## **VECTOR PROGRAMMING**

24

### **Conventional computer**

```
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = i + 1
If I ≤ 100 goto 20
```

### **Vector computer**

$$C(1:100) = A(1:100) + B(1:100)$$

# **VECTOR INSTRUCTIONS**

f1: V \* V f2: V \* S f3: V x V \* V f4: V x S \* V V: Vector operand S: Scalar operand

| Туре | Mnemonic | Description (I = 1,, n) |                          |  |  |  |  |
|------|----------|-------------------------|--------------------------|--|--|--|--|
| f1   | VSQR     | Vector square root      | B(I) * SQR(A(I))         |  |  |  |  |
|      | VSIN     | Vector sine             | B(I) * sin(A(I))         |  |  |  |  |
|      | VCOM     | Vector complement       | $A(I) * \overline{A(I)}$ |  |  |  |  |
| f2   | VSUM     | <b>Vector summation</b> | $S * \Sigma A(I)$        |  |  |  |  |
|      | VMAX     | Vector maximum          | S * max{A(I)}            |  |  |  |  |
| f3   | VADD     | Vector add              | C(I) * A(I) + B(I)       |  |  |  |  |
|      | VMPY     | Vector multiply         | C(I) * A(I) * B(I)       |  |  |  |  |
|      | VAND     | Vector AND              | C(I) * A(I) . B(I)       |  |  |  |  |
|      | VLAR     | Vector larger           | C(I) * max(A(I),B(I))    |  |  |  |  |
|      | VTGE     | Vector test >           | C(I) * 0 if A(I) < B(I)  |  |  |  |  |
|      |          |                         | C(I) * 1 if A(I) > B(I)  |  |  |  |  |
| f4   | SADD     | Vector-scalar add       | B(I) * S + A(I)          |  |  |  |  |
|      | SDIV     | Vector-scalar divide    | B(I) * A(I) / S          |  |  |  |  |

## **VECTOR INSTRUCTION FORMAT**

#### **Vector Instruction Format**

| Operation | Base address | Base address | Base address | Vector |
|-----------|--------------|--------------|--------------|--------|
| code      | source 1     | source 2     | destination  | length |

### **Pipeline for Inner Product**



### MULTIPLE MEMORY MODULE AND INTERLEAVING

### **Multiple Module Memory**





#### **Address Interleaving**

Different sets of addresses are assigned to different memory modules