 SU.HARDW.PC.CPU (2:5020/299)  SU.HARDW.PC.CPU 
 From : Aleksandr Konosevich                2:5004/9.7      Mon 28 Nov 94 16:54
 Subj : M1 ARCHITECTURAL OVERVIEW (ADVANCE INFORMATION)


⠪, 宭 稭   subj (ᥣ ⠬ 뫮 11 p).
 p諮   p 䠪ᮬ  殬p (    ᪠p !)
 OCR' WinFax', ⠪  -,  ᬮp  pp, 
p . ⠪,   p p p :

----------------------------------------------------------------------------

* SUPERSCALAR, SUPERPIPELINED
   ARCHITECTURE
- Dual 7-stage integer pipelines
- High performance on-chip FPU
- 100 MHz and greater operating frequency

* X86 INSTRUCTlON SET COMPATlBLE
- Runs Windows, DOS, UNIX, Novell and others

* OPTIMUM PERFORMANCE  WITHOUT
  RECOMPlLATlON
- Intelligent instruction dispatch
- Out-of-order instruction completion
- Register renaming
- Data forwarding
- Branch prediction
- Speculative execution

The Cyrix superscalar, superpipelined M1 architecture provides next
generation performance to IBM PC-compatible software. Because the M1
is fully compatible with the 486 instruction set, it is capable of
executing a wide range of existing and future operating systems and applications
including Windows, DOS, UNIX, Windows NT, Novell, OS/2, and Solaris.

The M1 achieves unsurpassed performance levels through the use of two
supetpipelined integer units and an on-chip floating point unit. The super
pipelined architecture reduces timing constraints and allows the M1 to operate
at core frequencies of 100 MHz and above. Additionally, the M1's integer and
floating point units are optimized for maximum instruction throughput by using
advanced architectural techniques including register renaming, out-of-order
completion, data forwarding, branch prediction, and speculative execution, These
design innovations eliminate many data dependencies and resource conflicts that
otherwise would degrade performance of existing non-optimized software programs.

1.0     OVERVIEW

The M1 architecture achieves perfonnance by incorporating both superscalar and
superpipelined features. The suprrscalar architecture enables the M1 to execute
multiple instructions in parallel. Traditionally, the disadvantage of a
superscalar architecture is that the circuir cotnplexity prohibits high
frequency of operation. In contrast, the Ml architecture
divides the most complex stages of operation into simpler sub-stages. This
technique is referred to as superpipeling and allows the superscalar Ml
architecture to operate at very high core frequencies (100 MHz and above).

The M1 architecture consists of five major functional blocks as shown in the
high-level block diagram:

*  Integer Unlit (IU)
*  Floating Point Unit (FPU)
*  Cache
*  Memory Management Unit  (MMU)
*  Bus Interface Unlt (BIU)

The IU, FPU and Cache are discussed in more
detail in the following sections.

2.0     Integer Unit

2.1     Pipeline Description

The Ml integer unit contains dual 7-stage integer pipelines, referred to as the
X and Y pipelines, that provide parallel instruction execution capability. The 7
pipeline stages include:

*  Prefetcoh (PF)
*  instruction Decode 1  (ID1)
*  Instruction Decode 2  (ID2)
*  Address Calculation 1 (AC1)
*  Address Calculation 2 (AC2)
*  Execute (EX)
*  Write-back (WB)

Figure 2-1 illustrates the X and Y pipeline
stages.

The Prefetch (PF) stage is common to both the X and Y pipe. During this stage,
16 bytes of code are fetched per core clock from the memory subsystem.
Additionally, the code stream is checked to identify the presence of
instructions that modify the normal sequential execution of the prog ram. These
instructions are referred to as branch instructions. Two types of branch
instructions exist : (1) unconditional branches that always modify the
instruction flow, and (2) conditional branches that modify the instruction flow
based on a variable. If either type of branch instruction is detected, the
branch prediction logic provides the predicted target address for the
instruction. The prefetch stage then begins fetching at the predicted address.

The Instruction Decode stage is superpipelined and consists of two sub-stages
lD1 and lD2. The lD1 stage evaluates the code stream provided by the prefetch
stage and determines the number of instruction bytes for up to two

--------------------------------------------
t !!natrueUon Fetoh

!n-C rdar
lnGtn}t9ng

Outiof Order
CcmploUon

        I U

[      intiruotIon Deoode 1


  ; in t. D i 11 lnst. be od@2
  LoT

  . . 1
  1 Addro66 Ca Address C
  TT
       L Ir T

  { Attires Calo. 2 11 Addreee Cats. 2 1

  i i

, H

     X Plpsline  Y Plpellne

                            1 727XO

--------------------------------------

Figure 2-1. Integer Unit Pipelines

instruction  per clock, The ID2 stage then decodes the two instructions and
selects either the X or Y pipeline for further execution. A load balancing
algorithm is used for pipeline selection. This algorithm determines which
pipeline is least likely to delay instruction completion due to interaction with
previously dispatched instructions.

The Address Calculation stage is also superpipeiined and consists of the two
sub-stages AC1 and AC2. lf the current instructions require memory operands, the
AC1 stages calculate up to two linear memory addresses per clock (one per
pipeline) and AC2 then performs the associated memory management functions and
cache accesses. For register operands, register renaming occurs during ACl, and
AC2 then accesses the register file. Additionally, floating point instructions
are dispatched to the FPU during the AC2 stage.

---------------------------------------------------------------------------

p ᫥ ... :)

                    With best wishes, Aleksandr
P.S. ,  ⮩ ⥬     pp  January 1994 BYTE.
  - , 祣    ⨪ ...
---
 * Origin:  aleks@sibkom.omsk.su  (2:5004/9.7)

 SU.HARDW.PC.CPU (2:5020/299)  SU.HARDW.PC.CPU 
 From : Aleksandr Konosevich                2:5004/9.7      Mon 05 Dec 94 19:43
 Subj : M1 ARCHITECTURAL OVERVIEW (ADVANCE INFORMATION)


⠪, p. p  4-  6- :

---------------------------------------------------------------------------

All instructions are kept in program order up to and during the ACl and AC2
stages.

The Execute (EX) stage actually performs the instruction operation using the
operands provided by the address calculation stage. The operation results are
written to the register file and write buffers during the Wrlte-Back (WE)
stage. Once instructions have entered the EX stage, instructions in one pipeline
may complete independent of the second pipeline. In other words instructions may
complete in a different order than they were dispatched.
This is referred to as out-of-order completion. Howevcr, any resulting bus
cycles are always issued in program order.

2.2 Opimitzed Pipefine Uifization

The Ml architecture optimizes parallel use of the X and Y pipelines by allowing
the majority of instructions to be dispatched in pairs, and
by allowing the tw pipelines to operate in a relatively independent fashions.
These techniques maximize performance by reducing the number of clocks in which
pipeline stages are idle.

2.2.1 Instruction Dispatch

The M1 architecture enforces very few instruction pairing constraints.
The most commonly used instructions in the x86 instruction set may be dispatched
in pairs to either pipeline, regardless of dependencies that may exist between
the two instructions. However, there are three categories of instructions that
must be dispatched only in the X pipeline:

(1) branch instructions,
(2) floating point instructions, and
(3) exclusive instructions.

The first two X-pipe only instruction types, branch and floating point, may be
paired with another instruction in the Y pipeline. Exclusive instructions may
not be paired. Instructions are classified as exclusive if they may fault in the
EX pipe stage and are typically instructions that require multiple memory
accesses. Although exclusive instructions may not be paired, hardware from both
pipelines is used to accelerate instruction completion. The Ml exclusive
instruction types are listed below:

* Protected mode segment loads
* Special register accesses
   (Control, Debug and Test registers)
* String instructions
* Multiply and divide
* I/0 port accesses
* Push all (PUSHA) and pop all (POPA)
* Task switches

2.2 Out-of-Order Completion

Out-of-order completion occurs is the EX and WB stages when an instruction in
one pipeline completes prior to a previously dispatched instruction in the
adjacent pipeline that requires multiple clocks to complete. This type
of processing is primarily used when an instruction in one pipeline is stalled
waiting for a memory aeeess to complete. Under this condition, the current and
subsequent instruction in the EX image of the adjacent pipe can be completed
without waiting for the pending access to complete, assuming no interinstruction
dependencies.

The Ml architecture always supplies instructions in program order to the EX
stage, and allows instructions to complete out-of-order only from that point on.
In conjunction with excltlsive instructions, this ensures that exceptions occur
in program order. Also, writes resulting from instructions completed
out-oif-order are always issued to the cache or external bus in prognm order.
Thus, x86 software compatibility is maintained.

2.3 Data Dependency Removal

M1 incorporates key architectural features that eliminate idle pipeline stages
resulting from inter-insttucLion data dependencies. A combination of register
renaming, data forwarding and data bypassing techniques are used to eliminate
write-after-write (WAW), write-af-ter-read (WAR) and read-after-write (RAW)
data dependencies.

2.3.1 Register Renaming

The Ml architecture contains 32 physical general purpose registers. These 32
registers are mapped, or renamed, to any one of the 8 logical general purpose
registers defined by the x86 architecture (EAX, EBX, ECX, EDX, ESI, EDI, EBP,
ESP). This renaming is controlled entirely by on chip hardware and is therefore
transparent to software.

Each time a wrfte to a logical register occurs, a new physical register is
assigned to the logical register. This prevent overwriting the previous data in
the logical register and thus eliminates write-after-write (WAW) and
write-after-read (WAR) dependencies as illustrated in the following examples.

WAR Dependency Removal Example

Assume the following instructions are executing simultaneously in the X and Y
pipelines:

(1) MOV BX, AX
(2) ADD AX. CX

X-PIPE            Y-PIPE

(1) BX <- AX      (2) AX <- AX + CX

A WAR dependency exists with the AX register because the Y pipe must wait for
the X pipe to read AX before the add instruction in the Y pipe updates the value
of AX. This causes the Y pipe to stall in an architecture where register
renaming is not used and out-of-order completion is allowed.

In the M1, physical registers are substituted for the logical registers. The
operations are completed in parallel with no Y pipeline stall
as shown below:

lnitial assignments:
                 AX = reg0
                 BX = reg1
                 CX = reg2

X-PIPE            Y-PIPE

(1) reg3 <- reg0  (2) reg4 <- reg0 + reg2

Final assignments:

               AX = reg4
               BX = reg3
               CX = reg2

WAW dependency Removal Example

Assume the following instructions are
executing simultaneously in the X and Y pipelines :

(1) MOV  AX,[MEM]
(2) ADD  AX, BX

X-pipe            Y-pipe

(1) AX <- mem_(2) AX <- AX+BX

The X pipe issues a memory access. The Y pipe is waiting for the same memory
data as the X pipe to be used in the ADD calculation. Using dara forwarding (see
Data Forwarding), the memory operand available to both pipelines at the same
time. A WAW dependency is created with AX because the Y pipe must wait
for the X pipe to update AX before the Y pipe can write the result of the ADD
instruction to AX. This causes the Y pipe to stall in an architecture where
register renaming is not used.

Using register renaming, the Ml substitutes physical registers for the logical
registers. The operations are completed in parallel with no Y pipeline stall as
shown below:

Initial assignments:

    AX = reg0
    BX = reg1

X-pipe               Y-pipe
(1) reg2 <-mem       (2) reg3 <- mem + reg1

Final assignments:
AX = reg3
BX = reg1

        2.3.2 Data Forwarding

In addition to register renaming, the Ml architecture incorporates a technique
called Data Forwarding that is used to eliminate reade-after-write register and
memory dependencies. Data forwarding allows pairs of instructions with a RAW
register dependency to execute simultaneously, thus eliminating pipeline stalls.
The Ml implements two types of data forwarding: (1) operand forwarding,
ant (2) result forwarding.

Operand fonrwarding occurs when a MOV instmction is used to load data into a
register or memory location. The register or memory location is then used in a
subsequent instructlion as an operand creating a RAW dependency on the operand
register or memory location. Using operated forwarding, the load data is
immediately made available to the subsequent instruction without waiting for the
completion of the MOV instruction. Operand forwarding is illustrated in the
following example.

        Operand Forwarding Example

Assume the following instructions are executig simultaneously in the X and Y
pipelines:

        (1) MOV AX, [MEM]
        (2) ADD BX, AX

X-pipe               Y-pipe

(1) AX <- [mem]      (2) BX <- AX + BX

------------------------------------------------------------------------------


p ᫥ ... ;)

                    With best wishes, Aleksandr
---
 * Origin:  aleks@sibkom.omsk.su  (2:5004/9.7)
