# **Embedded Systems**



#### REVIEW

### TMS320C6x Datapath



## **Overview XILINX FPGA**

#### REVIEW

- All Xilinx FPGAs contain the same basic resources
  - Slices grouped into Configurable Logic Blocks (CLBs)
    - Contain combinatorial logic and register resources
  - IOBs
    - Interface between the FPGA and the outside world
  - Programmable interconnect
  - Other resources
    - Memory
    - Multipliers
    - Global clock buffers
    - Boundary scan logic

CS - ES

### REVIEW Embedded Processors in FPGAs

#### Hard Core

- EP is a dedicated physical component of the chip separate from the programmable logic
- E.g. Xilinx Virtex families (PowerPC 405)





- Embedded processor is also a synthesized to the FPGA to th programmable logic on the chip
- E.g. Altera (NIOS), Xilinx (MicroBlaze)



# **Partial Reconfiguration**

Technology and Benefits



- Partial Reconfiguration enables:
  - System Flexibility
    - Perform more functions while maintaining communication links
  - Size and Cost Reduction
    - Time-multiplex the hardware
      to require a smaller FPGA
  - Power Reduction
    - Shut down power-hungry tasks when not needed



#### **Embedded System Hardware**



 Embedded system hardware is frequently used in a loop ("hardware in a loop"):



## **Communication** - Requirements -



- Real-time behavior
- Efficient, economical (e.g. centralized power supply)
- Appropriate bandwidth and communication delay
- Robustness
- Fault tolerance
- Maintainability
- Diagnosability



#### Memory



- For the memory, efficiency is again a concern:
  - speed (latency and throughput); predictable timing
  - energy efficiency
  - size ~
  - cost
  - other attributes (volatile vs. persistent, etc)

#### **REVIEW**

#### Memory hierarchy



# Static Timing Analysis

producing the input to schedulability analysis





Schedulability analysis has assumed the knowledge of the execution time of tasks.

So, the problem to solve:

- Given
  - 1 a software task to produce some reaction,
  - 2 a hardware platform, on which to execute the software,
  - 3 a required reaction time, e.g. the period of the task.

Derive:

a reliable (and precise) upper bound on the execution times.

# **Timing Analysis**







- Architecture Synthesis
  - HW/SW Codesign
  - Power Aware Computing

3.2.2011 Lecture by Bernd Finkbeiner, Head of Reactive Systems Group at Saarland University(<u>http://react.cs.uni-sb.de/</u>

## **Architecture Synthesis**

Design a hardware architecture that efficiently executes a given algorithm.

- tasks:
  - allocation (determine the necessary hardware resources)
  - scheduling (determine the timing of individual operations)
  - binding (determine relation between individual operations of the algorithm and hardware resources)

**Classification** of synthesis algorithms  $\rightarrow$ 

 Synthesis methods can often be applied *independently of granularity*



## **Synthesis in Temporal Domain**

- Scheduling and binding can be done in different orders or together
- Schedule:
  - Mapping of operations to time slots (cycles)
  - A scheduled sequencing graph is a labeled graph



CS - ES

## **Schedule in Spatial Domain**

- Resource sharing
  - More than one operation bound to same resource
  - Serialized operations



[©Gupta]



Source: Teich: Dig. HW/SW Systeme; Thiele ETHZ

## Models

- Sequence graph  $G_S = (V_S, E_S)$ where  $V_S$  denotes the operations of the algorithm and  $E_S$ the dependence relations.
- Resource graph  $G_R = (V_R, E_R)$ ,  $V_R = V_S \cup V_T$ where  $V_T$  denote the resource types of the architecture and  $G_R$  is a bipartite graph. An edge  $(v_S, v_t) \in E_R$ represents the availability of a resource type  $v_t$  for an operation  $v_s$ .
- Cost function  $c: V_T \to \mathbf{Z}$
- Execution times  $w: E_R \to \mathbb{Z}^{\geq 0}$ are assigned to each edge  $(v_s, v_t) \in E_R$ and denote the execution time of operation  $v_s \in V_S$ on resource type  $v_t \in V_T$ .



- 18 -



## Scheduling

The latency L of a schedule is the time difference between start node  $v_0$  and end node  $v_n$ :  $L = \tau(v_n) - \tau(v_0)$ .



$$L = \mathcal{T}(v_n) - \mathcal{T}(v_0) = 4$$



#### **Binding**

Example ( $\alpha(r_1) = 4$ ,  $\alpha(r_2) = 2$ ):

 $\beta(v_1) = r1, \gamma(v_1) = 1$  $\beta(v_2) = r1, \gamma(v_2) = 2$  $\beta(v_3) = r1, \gamma(v_3) = 2$ 



. . .

. . .



## As soon as possible (ASAP) scheduling

ASAP: All tasks are scheduled as early as possible

- Loop over (integer) time steps:
  - Compute the set of unscheduled tasks for which all predecessors have finished their computation
  - Schedule these tasks to start at the current time step.





ALAP: All tasks are scheduled as late as possible

Start at last time step\*:



Schedule tasks with no successors and tasks for which all successors have already been scheduled.

<sup>\*</sup> Generate a list, starting at its end



## **Scheduling under Detailed Timing Constraints**

- Motivation
  - Interface design.
  - Control over operation start time.
- Constraints
  - Upper/lower bounds on start-time difference of any operation pair.
- Minimum timing constraints between two operations
  - An operation follows another by <u>at least a number of prescribed time</u> steps
- Maximum timing constraints between two operations
  - An operation follows another by <u>at most a number of prescribed time</u> steps

## **Scheduling under Detailed Timing Constraints**

- Example
  - Circuit reads data from a bus, performs computation, writes result back on the bus.
  - Bus interface constraint: data written three cycles after read.
  - Minimum and maximum constraint of 3 cycles between read and write operations.

### **Constraint graph model**

- Start from a sequencing graph
- Model delays as weights on edges
- Add forward edges for minimum constraints
- Add backward edges for maximum constraints



Add this edge for min constraint

#### Weighted Constraint Graph

In order to represent a *feasible schedule*, we have one edge corresponding to each precedence constraint with

$$d(v_i, v_j) = w(v_i)$$

where  $w(v_i)$  denotes the execution time of  $v_i$ .

- A consistent assignment of starting times τ(v<sub>i</sub>) to all operations can be done by solving a single source longest path problem.
- A possible algorithm (*Bellman-Ford*) has complexity O(|V<sub>c</sub>| |E<sub>c</sub>|):

Iteratively set 
$$\tau(v_j) := \max\{\tau(v_j), \tau(v_i) + d(v_i, v_j) :$$
  
 $(v_i, v_j) \in E_C\}$  for all  $v_j \in V_C$  starting from  
 $\tau(v_i) = -\infty$  for  $v_i \in V_C \setminus \{v_0\}$  and  $\tau(v_0) = 1$ .



#### **Solution - Constraint Graph Model**

Mul delay = 2 ADD delay =1

| Vertex                                                                                 | Start time                 |
|----------------------------------------------------------------------------------------|----------------------------|
| $\begin{array}{c} v_0\\ v_1\\ v_2\\ \overline{v_3}\\ \overline{v_4}\\ v_n \end{array}$ | 1<br>1<br>3<br>1<br>5<br>6 |



#### List scheduling: extension of ALAP/ASAP method

Preparation:

- Greedy strategy (does NOT guarantee optimum solution)
- Topological sort of task graph G=(V,E)
- Computation of priority of each task:

Possible priorities *u*:

- Number of successors
- Longest path
- **Mobility** =  $\tau$  (ALAP schedule)-  $\tau$  (ASAP schedule)
  - Defined for each operation
  - Zero mobility implies that an operation can be started only at one given time step
  - Mobility greater than 0 measures span of time interval in
    - which an operation may start  $\rightarrow$  Slack on the start time<sub>33</sub>.

CS - ES

#### **Mobility as a priority function**

*Mobility* is not very precise



# Algorithm



may be repeated for different task/ processor classes

#### Example







#### does NOT guarantee optimum solution e.g.

List Scheduling





#### **Integer linear programming models**

Ingredients:



Constraints: 
$$\forall j \in J : \sum_{x_i \in X} b_{i,j} \ x_i \ge c_j \text{ with } b_{i,j}, c_j \in \mathbb{R}$$
 (2)

**Def**.: The problem of minimizing (1) subject to the constraints (2) is called an **integer linear programming (ILP) problem**. If all  $x_i$  are constrained to be either 0 or 1, the IP problem said to be a **0/1 integer linear programming problem**.

#### Example



# **Remarks on integer programming**

- Integer programming is NP-complete
- Running times depend exponentially on problem size, but problems of >1000 vars solvable with good solver (depending on the size and structure of the problem)

 ILP/LP models good starting point for modeling, even if heuristics are used in the end.

Solvers: Ip\_solve (public), CPLEX (commercial), …

# **ILP Formulation of ML-RCS**

- Minimize latency given constraints on area or the resources (ML-RCS)
- Use binary decision variables
  - *i* = 0, 1, ..., *n*
  - $l = 1, 2, ..., \lambda' + 1$   $\lambda'$  given upper-bound on latency
  - $x_{il} = 1$  if operation *i* starts at step *l*, 0 otherwise.
- Set of linear inequalities (constraints), and an objective function (min latency)

# **ILP Formulation of ML-RCS**

Observation

$$x_{il} = 0 \quad for \quad l < t_i^S \quad and \quad l > t_i^L$$
$$(t_i^S = ASAP(v_i), t_i^L = ALAP(v_i))$$

# **Start Time vs. Execution Time**

- For each operation  $v_i$ , only one start time
- If d<sub>i</sub>=1, then the following questions are the same:
  - Does operation v<sub>i</sub> start at step l?
  - Is operation v<sub>i</sub> running at step l?
- But if <u>d\_i>1</u>, then the two questions should be formulated as:
  - Does operation v<sub>i</sub> start at step l?
    - Does  $x_{il} = 1$  hold?
  - Is operation v<sub>i</sub> running at step l?
    - Does the following hold?

$$\sum_{\text{CS-ES}}^{l} x_{im} = 1 ?$$

# **Operation** *v<sub>i</sub>* **Still Running at Step** *I* **?**

Is v<sub>o</sub> running at step 6?

• Is 
$$x_{9,6} + x_{9,5} + x_{9,4} = 1$$
 ?



- Note:
  - Only one (if any) of the above three cases can happen
- To meet resource constraints, we have to ask the same question for <u>ALL steps</u>, and <u>ALL operations of</u> <sub>CS-ES</sub>that type

#### **Operation** *v<sub>i</sub>* **Still Running at Step** *I* **?**

Is v<sub>i</sub> running at step l?

• Is 
$$x_{i,l} + x_{i,l-1} + \dots + x_{i,l-di+1} = 1$$
 ?



# ILP Formulation of ML-RCS (cont.)

- Constraints:
  - Unique start times:

$$\sum_{l} x_{il} = 1, \quad i = 0, 1, \dots, n$$

Sequencing (dependency) relations must be satisfied

$$t_i \ge t_j + d_j \quad \forall (v_j, v_i) \in E \Longrightarrow \sum_l l \cdot x_{il} \ge \sum_l l \cdot x_{jl} + d_j$$

Resource constraints

$$\sum_{i:T(v_i)=k} \sum_{m=l-d_i+1}^{l} x_{im} \leq a_k, \quad k = 1, \dots, n_{res}, \quad l = 1, \dots, \overline{\lambda} + 1$$

- Objective:  $\min c^T t$ .
  - *t* =start times vector, *c* =cost weight

VZ

#### **ILP Example**

- Assume  $\overline{\lambda} = 4$
- First, perform ASAP and ALAP
  - (we can write the ILP without ASAP and ALAP, but using ASAP and ALAP will simplify the inequalities)



# **ILP Example: Unique Start Times Constraint**

 Without using ASAP and ALAP values: Using ASAP and ALAP:

$$\underbrace{x_{1,1}^{\prime} + x_{1,2}^{\prime} + x_{1,3}^{\prime} + x_{1,4}^{\prime} = 1}_{x_{2,1}^{\prime} + x_{2,2}^{\prime} + x_{2,3}^{\prime} + x_{2,4}^{\prime} = 1}$$



# **ILP Example:** Dependency Constraints $\tau_{c,j} = \tau_{c,j} = \tau_{c,j}$

 Using ASAP and ALAP, the non-trivial inequalities are: (assuming unit delay for + and \*)



#### **ILP Example: Resource Constraints**

 $(\mathbf{x})$ 

 $(\mathbf{x})$  v7

v5

(X)v8

- Resource constraints (assuming 2 adders and 2 multipliers)
  - $x_{1,1} + x_{2,1} + x_{6,1} + x_{8,1} \le 2$
  - $x_{3,2} + x_{6,2} + x_{7,2} + x_{8,2} \le 2$



- $x_{10,1} \le 2$
- $x_{9,2} + x_{10,2} + x_{11,2} \le 2$
- $\begin{aligned} x_{4,3} + x_{9,3} + x_{10,3} + x_{11,3} &\leq 2 \\ x_{5,4} + x_{9,4} + x_{11,4} &\leq 2 \end{aligned}$
- Objective:  $Min X_{n,1} + 2X_{n,2} + 3X_{n,3} + 4X_{n,4}$

 $\times vI \times v2$ 

(⊂)v11

2

(x)۱

- 50 -



# (Time constrained) Force-directed scheduling

- Goal: balanced utilization of resources
- Based on spring model
- Originally proposed for high-level synthesis
- Force
  - Used as a priority function
  - Related to concurrency sort operations for least force
  - Mechanical analogy: Force = constant x displacement
    - Constant = operation-type distribution
    - Displacement = change in probability

[Pierre G. Paulin, J.P. Knight, Force-directed scheduling in automatic data path synthesis, Design Automation Conference (DAC), 1987, S. 195-202]



# **Force-Directed Scheduling**

The Force-Directed Scheduling approach reduces the amount of:

- Functional Units
  - Registers
  - Interconnect

This is achieved by balancing the concurrency of operations to ensure a high utilization of each unit.

#### Next: computation of "forces"

- Direct forces push each task into the direction of lower values of *D*(*i*).
- Impact of direct forces on dependent tasks taken into account by indirect forces
- Balanced resource usage ≈ smallest forces
- For our simple example and time constraint=6: result = ALAP schedule





#### 1.Compute time frames R(j)2. Compute "probability" P(j,i) of assignment $j \rightarrow i$



# 3. Compute "distribution" *D*(*i*) (# Operations in control step *i*)



#### Example



#### **Scheduling – An example**

#### Step 3 : Calculate the *force* (a new metric)

A metric called *force* is introduced. The force is used to optimize the utilization of units. A high positive force value indicates a poor utilization.

$$Force(\mathbf{j}) = DG(\mathbf{j}) - \sum_{i=t}^{b} \frac{DG(i)}{(b-t+1)}$$

#### **Scheduling – An example**





# **Scheduling – An example**

By repeatedly assigning operations to various control-steps and calculating the force associated with the choice several force values will be available.

The Force-directed scheduling algorithm chooses the assignment with the <u>lowest force value</u>, which also <u>balances the concurrency</u> of operations most efficiently.

# **Overall approach**

•procedure forceDirectedScheduling; May be repeated begin for AsapScheduling; different AlapScheduling; task/ while not all tasks scheduled do processor classes begin select task T with smallest total force; schedule task T at time step minimizing forces; recompute forces; end; Not sufficient for today's complex, end heterogeneous hardware platforms

#### **Force-Directed Scheduling**

The Force-Directed Scheduling approach reduces the amount of:

- Functional Units
- Registers
  - Interconnect

By introducing Registers and Interconnect as storage operations, the force is calcuted for these as well.

#### **Force-Directed Scheduling**



- Architecture Synthesis
- HW/SW Codesign
  - Power Aware Computing
  - 3.2.2011 Lecture by Bernd Finkbeiner, Head of Reactive Systems Group at Saarland University(<u>http://react.cs.uni-sb.de/</u>

#### **Codesign Definition and Key Concepts**

- Codesign
  - The meeting of system-level objectives by exploiting the trade-offs between hardware and software in a system through their concurrent design
- Key concepts
  - Concurrent: hardware and software developed at the same time on parallel paths

~

 Integrated: interaction between hardware and software development to produce design meeting performance criteria and functional specs

#### **Typical Codesign Process**

