/* CS 341 Spring 2002 */
/* Note segment 2 */
/* 22-Jan-2002 */
/* Taken by Apostolos Paul Pantazis */

Last Time: Basic CPU Architecture.

MemoryLess Logic: An input is initiated on a device, shortly after the
		  device generates an output. The catch is that the device
                  does not remember anything for the input after the
	          output is generated. A good example is an adder.

    (1) -->|-------------|
	   |		 | --> (16 bit out)
    (2) -->|-------------|

    (1) & (2) ARE 16 bIT INPUTS.

    Remember that in building CPU's the presence of a clock and latch
    are vital.

    There are 2 things that deal with designing faster CPU's:
    1. We care for the amount of time (picoseconds) it takes for the
       output to be produced.
    2. We need a latch.

    Last time we have talked about the one-adress accumulator machine:
    It works in a very simple way, it holds in 1 value in the accumulator
    and one in the PC(Recall PC = Program Counter). It is also important
    to know that a 1 adress machine accesses memory at least twice.

    There are 2 ways of building a computer:
    1. Synchronous --> every part of the computer produces its values
                       based on a clock.
    2. Asynchronous --> No clock is present, a signal is generated to
                        denote that a value is ready. Easier to understand
                        the division circuits on a Pentium are implemented
                        in this way. The rest of the chip is Synchronous.

    Back to 1-adress accumulator machine:
    ADD 17
    MUL 21
    * How does that work??*
    * See class notes Memory Diagram for values*
    --> PC will initialy have a value of 100. ACC (for accumulator)
        will start at 0. At the first clock cycle, 100 will be placed
        on the adress bus.. We will get back the value 17. So we also
        place 17 on the adress bus. A value of 40 comes back and it is
	added to teh ACC wgich is now 40 and PC is 101. The next
	instruction is fetched and the process is repeated.

    * For each cycle of the clock we are writing 1 thing in memory so each
      instruction takes 2 cycles of the clock. *

      (2 memory accesses per instruction) * (1 cycle per memory access)
      == [2 cycles per instruction]
      |--> If ALU is really slow the above would not hold.

    Modern Electronics: The cycle time is Pretty fast so 1 cycle is 1/2
                        a nanosec (500 picosec).

    One of the MAJOR concerns on CPU design is Accessing Memory less,
    hiding this memory latency that exists.
    --> How do we hide this latency?

        (1). Do more things inside the CPU.
	    |--> Use memory Register instead of 1 accumulator.

        So...Multiple registers..What is that?
        Inside the CPU you have a Register Bank, r0....r7 lets say.
        Instructions will not just operate on the ACC instead you would
        have a register like:
        ADD  r0, r1  /* result in r1*/
        MUL  r2, r3
        LOAD r3, r5
        |--> 2-adress machine. This is a LOAD/STORE 2-adress architecture
             with 8 registers (r0...r7)

        An Instruction would look like:
        |			       |

          (1)	   (2)	       (3)

        (1) --> Opcode
        (2) --> 1st register 3 bits.
        (3) --> 2nd register 3 bits.

        * Recall 8 registers so each register == 3 bits long.

        Opcode Classes:
        1. Aritmetic like ADD, MUL
        2. Memory Access like STORE, LOAD
        3. Opcodes for control, for Branching. (see *1 bellow).
        4. Boolean Opcodes.
        (*1) : like JUMP, Give an absolute adress. JUMP on top of the
               instruction, get an adress and JUMP to that adress.
               JUMP conditionaly: Will only Branch if the most recent
               value calculated by the CPU is non zero.

       Often the ALU will produce an output and some extra things that
       occured during runtime like overflow info. These are called the
       CONDITION CODES (CC). Is it negative? Was there an Overflow?
       Is it equal to zero? The branch instruction will look at the CC.

       Back to hiding latency --> Solution #2...
       first let us note:
       2 notions of latency:
       A) Latency: Initiating reguest until response is received.
       B) Bandwidth: Amt of info you get per/sec.
       Memory has a big latency. You can hide this by using bandwidth.
       On a memory read give instead of the next byte 512 bytes ( a
       sector) or even 16 sectors.

       Cache is the answer to #2. (L1, L2, L3)

       registers are sometimes wrongly thought as another level in the
       memory hierarchy. registers are managed by the programmer, cache is 
       not visible to the programmer, it is only managed by us under
       special circumstances.

       3-adress machine.
       --> Specify both the sources and destination register.   
           --> ADD r0, r1, r2
               --> MOV r1, r2
                   --> ADD r0, r2

      0-adress machine AKA Stack machine.
      --> Both sources and destination are implicit.
          --> Stack machine solves the problem of not knowing how
              many register to have by just having a stack. ALU inputs are
              always the 2 top values in the stack. Pop(val_1, val_2) and
              Push(new_val). Pretty fast.

      /* Stack machine sample code */
      /* implement A = B*C+D*E */

     LOAD B --> push
     LOAD C --> push
     MUL    --> pop
     LOAD D
     LOAD E
     STORE A

    Virtuall Memory --> Treating memory as cache for Disk 
                        (MSRM) --> MASS STORAGE ROTATING MEDIA.
    Virtuall memory --> Physical translation.

    Branch Delay Slot
    ADD ro, r1
    BRZ  L31
    SUB r3, r4
    |-->BDS: It does not branch unitl the instruction following
             the branch is executed and then JUMP label and do
             whta is specified there.

    L31: ( a label to branch to)
        (do some things ..)