





## **REVIEW: Single-instruction, multiple-data (SIMD)**

- Multimedia instructions exploit that many registers, adders etc are quite wide (32/64 bit),
- whereas most multimedia data types are narrow
   (e.g. 8 bit per color, 16 bit per audio sample per channel)
- 2-8 values can be stored per register and added. E.g.:







## EPIC: TMS 320C6xx as an example

1 Bit per instruction encodes end of parallel exec.



| Cycle | Instruction |   |
|-------|-------------|---|
| 1     | Α           |   |
| 2     | в с         | D |
| 3     | E F         | G |

Instructions B, C and D use disjoint functional units, cross paths and other data path resources. The same is also true for E, F and G.

Parallel execution cannot span several packets.

BF - ES - 7 -

## Partitioned register files

- Many memory ports are required to supply enough operands per cycle.
- Memories with many ports are expensive.
- Registers are partitioned into (typically 2) sets,

e.g. for TI C60x:



BF - ES

- 8 -

## **REVIEW: Branch delay penalty**



The execution of many instructions has been started before it is realized that a branch was required.

Nullifying those instructions would waste compute power

- Executing those instructions is declared a feature, not a bug.
- How to fill all "delay slots" with useful instructions?
- Avoid branches wherever possible.

BF - ES - 9 -

## Predicated execution: Implementing IF-statements "branch-free"

Conditional Instruction "[c] I" consists of:

- condition c
- instruction I

c = true => I executed c = false => NOP

BF - ES - 10 -

5

```
Predicated execution:
 Implementing IF-statements "branch-free":
 TI C6x
                Conditional branch
                                       Predicated execution
if (c)
                                           [c] ADD x,y,a
                     [c] B L1
\{ a = x + y;
                                       || [c] ADD x,z,b
                         NOP 5
 b = x + z;
                                       || [!c] SUB x,y,a
                         BL2
                         NOP 4
                                       || [!c] SUB x,z,b
else
                         SUB x,y,a
\{ a = x - y; 
                    || SUB x,z,b
 b = x - z;
                        ADD x,y,a
                L1:
                    || ADD x,z,b
                L2:
              max. 12 cycles
                                       1 cycle
 BF - ES
                                                        - 11 -
```



## **Reconfigurable Logic**

- Full custom chips may be too expensive, software too slow.
- Combine the speed of HW with the flexibility of SW
  - FHW with programmable functions and interconnect.
  - Use of configurable hardware; common form: field programmable gate arrays (FPGAs)
- Applications: bit-oriented algorithms like
  - encryption,
  - fast "object recognition" (medical and military)
  - Adapting mobile phones to different standards.
- Popular devices from
  - XILINX (XILINX Virtex 6 are recent devices)
  - Actel, Altera and others

BF - ES - 13 -











#### **Code-size efficiency** Compression techniques (continued): • 2nd instruction set, e.g. ARM Thumb instruction set: 16-bit Thumb instr. Rd 001 10 Constant ADD Rd #constant decoded at run-time Dynamically major source= minor opcode zero extended destination opcode 1110 001 01001 0 Rd 0 Rd 0000 Constant Reduction to 65-70 % of original code size • 130% of ARM performance with 8/16 bit memory [ARM, R. Gupta] • 85% of ARM performance with 32-bit memory Same approach for LSI TinyRisc, ... Requires support by compiler, assembler etc. BF - ES - 19 -















## **Communication requirements**

- Real-time behavior
- Efficient, economical (e.g. centralized power supply)
- Appropriate bandwidth and communication delay
- Robustness
- Fault tolerance
- Diagnosability
- Maintainability
- Security
- Safety

BF - ES

- 27 -

# Basic techniques: Electrical robustness

Single-ended vs. differential signals



Voltage at input of Op-Amp positive  $\rightarrow$  '1'; otherwise  $\rightarrow$  '0'



Combined with twisted pairs; Most noise added to both wires.

is an analysis of the second s

### **Evaluation**

#### Advantages:

- Subtraction removes most of the noise
- Changes of voltage levels have no effect
- Reduced importance of ground wiring
- Higher speed

### Disadvantages:

- Requires negative voltages
- Increased number of wires and connectors

### Applications:

- USB, FireWire, ISDN
- Ethernet (STP/UTP CAT 5/6 cables)
- differential SCSI
- High-quality analog audio signals (XLR)

BF - ES - 29 -

## Priority-based arbitration of communication media

For example, consider a bus



- Bus arbitration (allocation) is frequently priority-based
- Communication delay depends on communication traffic of other partners
- No tight real-time guarantees, except for highest priority partner

BF - ES - 30 -

15

### **Ethernet**

- Carrier-sense multiple-access/collision-detection (CSMA/CD, Standard Ethernet): no guaranteed response time.
- Alternatives:
  - token rings, token busses
  - Carrier-sense multiple-access/collision-avoidance (CSMA/CA)
    - WLAN techniques with request preceding transmission
    - · Each partner gets an ID (priority). After bus transfer: partners try setting their ID on the bus; Partners detecting higher ID disconnect themselves. Highest priority partner gets guaranteed response time; others only if they are given a chance.

BF - ES - 31 -

## Time division multiple access (TDMA) busses

 Each communication partner is assigned a fixed time slot. Example:



- Some waiting time
- Each slave transmits in its time slot
- TDMA resources have a deterministic timing behavior
- TDMA provides QoS guarantees in networks on chips

BF - ES - 32 -