Chapter Three

Pipelined/Overlap Computers

In Chapters 1 and 2, we have talked about the uni-processor in terms of CPU design, I/O interface, VM, memory organisation, etc. In this chapter, we will start to look at multi-processor which is formed by grouping a number of processors together. In the following three chapters, Pipelined architecture which splits a large computer into a number of functional units, overlapped architecture which combines CPU execution and I/O together, array architecture which group a number of similar smaller processors together under a central unit and true parallel computer which allows individual computers to share data/loading through a sophiscated internal network. Additional reference information can be found from the following reference books.  

The architecture of pipelined computers by Peter M. Kogge ( QA 76.5 .K587)

Tutorial, Computer Architecture (QA 76 .9 .A73 C649 ) Page 38- 80

Upon completion of this chapter, the student should

Understand the evolution, organization and features of pipeline computer
overlapped computer and
CICS and RISC

Terminology

Term

Description

CISC

Complex Instruction Set Computer. A processor where each instruction can perform several low-level operations such as memory access,    arithmetic operations or address calculations. The term was coined in contrast to RISC.

SISD

Single Instruction and Single Data (Uniprocessor)

SIMD

Single instruction and Multiple Data

MISD

Multiple Instruction and Single Data

MIMD

Multiple Instruction and Multiple Data

Stage

Pipelining takes the approach of splitting the function to be performed into smaller pieces and allocating separate hardware to each piece is termed a stage

Overlap

The concurrent use of many different stages by different items is called overlap

RISC

Reduced Instruction Set Computer. A processor whose design is based on the rapid execution of a sequence of simple instructions rather than on the provision of a large variety of complex instructions.

Introduction

In the world of high-speed computers, there are two approaches, namely, Parallelism and Pipelining, to achieve high speed.

Type

Description

Parallelism

It achieves high speed by replicating some basic function many times and providing each replication with one piece of the input data. This is equivalent to have 5 workers to perform the same task such as laying brick.

Pipelining

It takes the same function and partitions it into many autonomous but interconnected subfunctions. This is equivalent to the assembly work in an electronics factory where a task will be broken into a number of sub-tasks performed by 5 workers in sequence.

Pipeline architecture in this chapter is best described as follows:

It is a processing technique aiming for a steady throughput. (steady means smooth and is not the initial transient stage)
The processing power is decentralized (decentralised means split) with numbers of processing stations, called pipeling stages or segments.
The processing distribution is more or less uniform over the processing path. (if the processing distribution is non-uniform, the performance is undesirable.)
The number of datapaths is relatively small, and the data movement between segments is systematic.

Based on Flynn’s taxonomy, SISD refers to traditional Von Neumann architecture, SIMD refers to array processor, MISD refers to pipeline processor and MIMD refers to multiprocessor.

Historic background

The following quotes four high speed computers which makes use of pipelined approach. Note that for high speed computer, we are talking about 32-64 GB for real memory and they might use a popular operating system such as Unix.

Machine type

Description

UNIVAC I

It is supposed to be one of the earliest machines to use the concept of pipelining. It overlapped program execution with certain limited I/O activities.

IBM 7094

It used the pipelining and overlap concept within the computing facility to speed up instruction fetch times. It had a 72-bit wide memory and 36-bit instruction. Each time a memory access was made to fetch an instruction, the unused 36 bits were saved to eliminate half of the instruction fetches.

CDC 6000 series

They were among the first to attempt to overcome the dependency hold time problem which is caused by an instruction at one stage waiting for the completion of processing at another stage.

IBM 360/91

It used the pipeline concept was designed to upgrade computational performance by one or two orders of magnitude compared to the 7090.

Classification of pipelines

Broadly speaking, pipelines according to their capabilities and how they are actually used are classified into unifunction pipeline and multifunction pipeline.

Type

Description

Unifunction pipeline

Is one that is capable of only one basic kind of function evaluation. The pipelining system performs the same operations on every set of inputs given it with no variations.

Multifunction pipeline

Is one that is capable of several different kinds of function evaluations.

Pipeline Characterization

A pipe line architecture can be distinguished by its design configurations and control strategies. It can be applied at more than one level such as top-down or level-by-level. There are two major points to describe the characteristics of pipeline architecture as follows.

Item

Description

Levels of pipelining

Pipelining at the same system level is exemplified in the design of the instruction processing unit. The processing unit can be decomposed into various functional segments such as instruction fetch, decode and address generation.


Figure depicts the operation of a pipeline concept in which a task is passed through each segment. It will start to output one task per minor cycle after a stream of tasks has entered this pipeline. The space diagram demonstrating the steps for achieving the steady state is shown in Figure . Here, an instruction is broken into four sub-instructions, namely, Exec (execution), OF (Operand fetch), ID (Instruction decode) and IF (Instruction fetch). The instructions are labelled by number such as 1, 2, 3 or 4. In this diagram, during the second slot, when 1 is in ID, 2 is in IF. This means that when the first instruction is in the Instruction Decode stage, the following instruction (second) is only in the Instruction Fetch. The steady state refers to the time when the machine produces one completion instruction (IF, ID, OF and Exec).

Subsystem level

This is about the subsystem pipeline level. The typical examples include pipelined arithmetic units with diagram given below .

Pipeline configurations

It is further classified into static or dynamic pipe based on different design and control strategies as follows.

Static

Dynamic

At any instant, only one configuration is active

Permits overlapped processing among several active configurations simultaneously

Easier to control in multifunctional pipes in arithmetic units

More elaborate control and sequencing are required

Steady-state behavior

A simple pipeline is a time-synchronized model similar to an industrial assembly line without any feedback sequence. Consider a pipeline model with M microprocessing stages (S0. S1.....) with different processing duration (t0......) as given in Figure .

The jth stage can

accept input of aj
Process local work of wj within the time interval tj
Produce the output of bj

The M processing stages can process M units together, one after the other. Using this model, bj = aj+1 for all j. The processing time for each stage is assumed to be constant T( in reality, this value is unequal).

This model has following characteristics:

It has a processing duration of MT
At every cycle, it can receive one set of inputs, generate one set of outputs, and perform an amount of work equaling to the summation of Work
The steady-stage work rate is the summation per cycle.
The steady-state is reached at time of t0 + MT (where M is the number of stages and T is the time unit for the longest stage)
The global input at time (t0 + kT) is a task of Tk, which moves through the pipeline at steady rate
The joining of two pipelines with equal cycle is again a pipeline, with a steady-state throughput equaling the sum of the two original steady-state throughputs.
In terms of the rate of task handling, the pipeline at the steady-state finishes one task per cycle.

Draw a timing diagram showing the relationship between the individual tasks and elapsed time.

Example of finding out the steady state.

A pipelined computer with different processing times is given below. Based on these figures, calculate the throughput during the steady state and find out how long it takes (duration) to reach the steady state. (Past Examination question)

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

1.1 ms

1.2 ms

0.9 ms

1.3 ms

0.8 ms

5*1.3ms (where 1.3 ms is the longest time in this pipelined system).

Performance considerations

This is to review the advantages, requirements and limitations of pipelining architecture in terms of:

Throughput considerations
Efficiency considerations
Maximum speed limitation
Design optimization
Bounds on execution time and efficiency

Here, only the throughput and efficiency are considered.

Throughput considerations

It is defined as the number of outputs per unit time and is a measure of system’s performance. For instance, if a task takes 500 seconds to complete, the through means within one second, it can complete 1/500.  The characteristics/ features are given below:

Directly reflects the processing power of a processor system
The higher its throughput rate, the more powerful the system is
It is determined by its slowest facility (component) within the system.
Say tb = max {t1, t2....} = speed of the slowest facility in the pipeline.
The maximum throughput rate is 1/tb
The execution time for a non-pipeling system is tp = t1 + t2 ........
The throughput is therefor 1/tp, much less than a pipelined system
In case there is a bottleneck, the throughput can be improved by subdivision of the bottleneck or putting parallel facilities as shown in Figure . Here, a unit which takes 3 time units to complete is split into three units each of which takes only one time unit. Before splitting, the throughput is 1/3t. After splitting the throughput is 1/t. This shows that the throughput has been improved by three times.

Efficiency considerations

This is another important performance measure for a system and is also called utilization factor. It:

is a direct reflection of how effective a processing scheme is;
can be used to indicate how future improvements;
can be evaluated both analytically and experimentally by measurements
To evaluate the efficiency of the processor system as an entity, It was proposed to use a uniform space-time span index:
efficiency of pipeline = (Total space-time span of tasks)/( Total space-time span of facilities )
Term task and process are used to fit the loose definition of a pipeline.

Example of finding out the efficiency

A pipelined computer with different processing times is given below. Based on these figures, find out the efficiency of this pipelined system. (Past examination question)

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

1.1 ms

0.9 ms

0.8 ms

1.2 ms

1 ms

The efficiency is {1.1 + 0.9 + 0.8 + 1.2 + 1}/{5x1.2}

Comparison between Parallelism and Pipeline

Parallelism and pipeline as mentioned are two major approaches in high speed computer design. Below summarizes major differences between them. (No need to memorise)

Parameter

Parallel architecture

Pipelined architecture

Design

Independent execution of subprograms on separate hardware within the system

Partition the function into N interrelated N subfunctions

Performance

N results for every T seconds

One result for every T/N seconds

System clock

Time for one function evaluation

Time for one stage

Usual supported architecture

SIMD,MIMD

SISD,SIMD

Nature of problem to be solved

Array problems with lengths multiples of number of processor

Speedup of conventional instruction sets

Timing diagram

Two-dimensional processor # vs. time

Two-dimensional stage # vs. time

Memory organization

Multiple, independent memory modules

Single, multiway-interleaved memory

Detailed control

Handled by user

Handled largely by hardware

Overlap processing

Overlap is the phenomenon of concurrent processing. In a computer system, the operating speed can be upgraded by overlapping some of the units such as I/O unit. To demonstrate the benefits, two types of overlaps, namely, I/O overlap and I/E Overlap are chosen.

I/O Overlap

It is quite natural to overlap the I/O operations with processor operations, as two operations can be separated with additional buffers. Almost all the current minicomputer design has a separate communication controller specifically responsible for handling all the I/O units.

The I/O units still require CPU resource.

I/E Overlap

The instruction fetch time (I) and execution time (E) can be overlapped to reduce the cycling time required to compete a few instructions. A diagram showing the difference between no overlap and I, E overlap is given in Figure . Using this approach, the resultant I/E times are nearly half the conventional design. Note that the hardware design becomes complicated and mechanism to reduce contention is required.

Example

If overlap principle is applied to the following fetch (I) and execution (E) time diagram, the total time required in terms of machine cycles to complete the instructions is reduced to:

I

E

I

E

I

E

I

E

3

4

2

3

3

2

4

2

The machine cycles is (3 + {max 4, 2} + {max 3,3} + {max 2, 4} + 2) = 16 cycles.

RISC vs. CISC

The term, RISC, was first used to name a research architecture at Berkeley:  the RISC microprocessor.  It has come to (loosely) mean a single chip processor that has the following qualities:

load/store architecture
very few addressing modes
simple instructions
pipelined implementation
small instruction set — easily decoded instructions
fixed-size instructions

This term, CISC - Complex Instruction Set Computer, was coined to distinguish computers that were not RISC. It generally is applied to computers that have the following qualities:

complex instructions
large instruction set
many addressing modes

It impose difficulties with these terms as they are not precisely defined and these terms introduced/applied to earlier machines. However, “RISC” has become a marketing tool to show that the machine is more efficient.

Reduced - Reduced Instruction Set Computer

The earliest computer were extremely simple. Before Wilkes invented micro-programming, all computers were RISC machines, with simple instructions directly executed by the hardware. With the microprogramming, computers became more complex and less efficient. Figure shows how a low-level languages can directly drive the hardware. In this diagram, it shows that RISC is simpler and faster.

CICS - Complex Instruction Set Computer

The widespread use of high-level languages, such as COBOL, more translations, complex compilers. Figure shows how a high level language drives the hardware through procedures call interpreted by library routines.

The IBM 360, other mainframe, DEC VAX, Intel 80386, Motorola 68030 are examples of CICS machines. High-level languages need library routines reside in main memory, this approach is slow, it’s better to put these routines into fast-speed ROM inside the CPU. For instance, to change these routines into micro-programmes, making the micro-programme complex.

RISC in Modern Computer Architecture

As the machine language gets larger and more complicated, its interpreter (the micro-programme) gets bigger and slower; more instructions mean more time spent in decoding the opcodes. Figure shows how a high level language drives the hardware after compilation through micro-codes in ROM. Machines using RISC have a small number of vertical micro-instructions. user programmes are compiled into sequences of these micro-instructions, and then directly executed by the hardware, with no intervening interpreter. It results in simple instructions, such as adding two registers, can now be completed in a single micro-instruction. The following shows examples of CICS and RISC machines.

 

IBM 310

Xerox Dorado

IBM 801

Standford MIPS

The reason why RISC can be made feasible again is that it is now possible to let the compiler produce micro-code directly, skipping the interpreter. Below shows the comparison between CICS and RISC.

CICS

RISC

Not pipelined or less pipeline

Pipelined

Many instructions and modes

Few instructions and modes

Complex instructions taking multiple cycles

Simple instruction taking 1 cycle

Complexity is in the microprogram

Complexity is in the compiler

Instructions interpreted by the micro-program

Instructions directly executed by hardware

Speed Factors

Fewer micro-instructions, few time spent in decoding
RISC instructions are completed in a single data path cycle. Few data path cycle means saving the time spent in moving data back-and-forth between registers and ALU.
All RISC are pipelined: the CPU contains several independent units that work in parallel
No micro-instructions interpretation, as all programs are compiled into micro-codes.

Summary

This is the first chapter on the high speed computer using pipelined and overlapped approaches. Pipelined machine is ideal for those tasks that can be equally segmented (be broken). Otherwise, the performance does not outperform than a non-pipelined machine. Two performance indictors, namely, throughput and efficiency were introduced. Throughput measures the completion of task per unit time while efficiency measures how well the Pipelined system performs. 100% efficiency means that there is no idle time between two stages. Overlap can only improve the system up to 50% (theoretically less than 50%) only. It is to perform one thing while doing the other to achieve parallel execution. RISC and CICS which deal with the instruction sets are also included in this chapter.

Self-test Question

It it takes 3 seconds to complete a task, the throughput is _____(3, 1/3) tasks/second.
All the computer system can be segmented into a number of equal sub-systems. (True/False)
Using IE overlap can achieve double performance. (True/False)
A pipelined computer with different processing times is given below. Based on these figures, calculate the throughput during the steady state and find out how long it takes (duration) to reach the steady state. (Past Examination Question)
 

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

1.1 ms

1.2 ms

1.3 ms

0.8 ms

0.9 ms