

leti Ceatech





# **ENABLING ARTIFICIAL INTELLIGENCE TECHNOLOGIES**

# Entering in Human and machine collaboration era





# ENABLED BY ARTIFICIAL INTELLIGENCE (AND DEEP LEARNING)

- Artificial Intelligence is changing the man-machine interaction natural interfaces, "intelligent" behavior
  - Image and situation understanding
  - Voice recognition and synthesis
  - Data analysis
  - Decision taking







## **COMPUTING DISTRIBUTION FOR "COGNITIVE" SYSTEMS**



leti Ceatech



# **EMBEDDED INTELLIGENCE NEEDS LOCAL HIGH-END COMPUTING**

Should I brakes Transmission erro please retry later System should be autonomous to make good decisions in all conditions

Safety will impose that basic autonomous functions should not rely on "always connected" or "always available"



# **EMBEDDED INTELLIGENCE NEEDS LOCAL HIGH-END COMPUTING**



# Bandwidth will require more local processing



# **EMBEDDED INTELLIGENCE NEEDS LOCAL HIGH-END COMPUTING**



Example: detecting elderly people falling in their home

# *Privacy* will impose that some processing should be done locally and not be sent to the cloud.

CURRENT CONTRIBUTIONS OF LETI TO DEEP LEARNING

1) Provide tools and IPs (Hardware and software) for *fast and efficient development* of deep-learning techniques (mainly inference and data fusion) *at the edge* under constraints of:

- Performance
- Speed
- Cost
- Power consumption
- Choice of hardware
- Size

2) Provide *innovative technologies* for tomorrow's unsupervised learning systems



#### N2D2: NEURAL NETWORK DESIGN & DEPLOYMENT



Ceatech

#### FAST AND ACCURATE DNN EXPLORATION



Leti Innovation Days | June 28-29, 2017 | 11





#### **<b>EFUSION: SENSOR DATA PREPROCESSING AND FUSION**



# ΣFusion technology:

- Bayesian Fusion with only integer arithmetic
- Compatible with ASIL-D processors
- Real-time performance on µC (Cortex M7), 200MHz
- **Fully certifiable** (deterministic and predictable)
- **Power efficiency** increased by a factor of **100x**
- Suitable for multi-modal sensor fusion



- Perception for autonomous vehicles
  - 2x Velodyne VLP 16 Lidars
    - 600000 data points per sec
    - 16 Mbits/sec data bandwith (Ethernet)
  - 1x Tara stereo system
    - Disparity map computed on a Nvidia TX1
    - ~150000 data points per sec
    - ~ 4Mbits /sec
  - Environment model
    - 272x480 cells (130560 cells)
    - computed in real-time (40 ms)
    - Spatial accuracy of ~4cm
    - Scanning horizon ~8m
  - Hardware used:
    - STM32F7 @200 MHz

# See the stand on the right

- Obstacle detection
- Vision glare with external light source

#### • Featured in EEtimes, Embedded Computing, Eenews, ...

- <u>http://www.eenews.net/stories/1060048190</u>
- <u>http://embedded-computing.com/31082-ces-2017-leti-the-biggest-little-organization-you-never-should-have-heard-of/</u>
- http://www.eetimes.com/document.asp?doc\_id=1331148&page\_number=7





Ceatech

#### **APPLICATION: DEFECTS DETECTION**

- Defects identification on metal after rolling
  - Constrains:
    - Real-time with extremely high throughput
    - Tiny and low contrasted defects
  - Solutions:
    - Database labeling and pre-processing
    - Fast NN topology exploration
    - Performances vs complexity analysis

# 1) Defects labeling and visualization







#### 2) NN Exploration and benchmarking

 40
 60
 40
 60
 40
 60
 40
 60
 40
 60
 60
 60

 3x3
 3x5
 5x5
 3x3
 3x3
 5x5
 5x5
 3x3
 3x3
 5x5
 5x5
 3x3

 8
 8
 8
 16
 16
 16
 16
 32
 32
 32
 32
 32





From scratch exploration (database and NN construction) to industrial application

→ 50,000 MACs NN synthetized in 100 cycles on FPGA @ 100 MHz (500 MACs/cycle)



#### **APPLICATION: REAL-TIME FACES DETECTION WITH GENDER & EMOTION**





#### **APPLICATION: Q-LEARNING BASED SOC ENERGY MANAGEMENT**

Energy saving reinforcement learning



- Dynamic software applications with performance constraints, e.g., throughput
- Standard Linux-based operating system

android

Multi/many core SoCs



Source: NXP i.MX6 Source: ST/CEA

- Q-learning energy manager
  - On-line, gradually learn the SoC operating points such that performance constraints are respected and energy consumption is reduced
  - No need to model the dynamics of the system



Up to 44% energy reduction, wrt. state-of-the-art (proportional-integral and non-linear controllers)



#### HARDWARE ACCELERATION FOR DEEP NEURAL NETWORKS

Dedicated computing IPs with high TOPS/Watt performance



PNeuro programmable

DNeuro dedicated IP / High level synthesis

3D stacked architectures



PNeuro Engine





Advanced architectures (Spike-based, mixed-signal, NVMs...)



Ceatech PNEURO BENCHMARKING

FROM RESEARCH TO INDUSTR

- Face extraction on a databa of 18,000 images
- 60 neurons on the hidden la
- Recognition rate 97%

#### Optimized code for 5 archited

- Embedded CPU: Quad Arm
- Embedded GPU: NVidia Teg
- PNeuro Quad Neuro-Cores

|                                                                                                                                                                                       |                               |                        |                       | Images<br>Database  | 2 (46x46)                  | 4 (15:61) Max              |                          |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|------------------------|-----------------------|---------------------|----------------------------|----------------------------|--------------------------|
| <ul> <li>Benchmark application:</li> <li>Face extraction on a database of 18,000 images</li> </ul>                                                                                    |                               |                        |                       |                     |                            |                            |                          |
| <ul> <li>60 neurons on the hidden layer, 450 Kops</li> <li>Recognition rate 97%</li> </ul>                                                                                            |                               |                        |                       | Embedded CPU        | Embedded GPU               | CEA PNeuro                 | CEA DNeuro               |
| <ul> <li>Optimized code for 5 architectures:</li> <li>Embedded CPU: Quad Arm A7 &amp; A15</li> <li>Embedded GPU: NVidia Tegra K1</li> <li>PNeuro Quad Neuro-Cores / DNeuro</li> </ul> |                               |                        |                       |                     |                            |                            |                          |
|                                                                                                                                                                                       | Target                        | Quad ARM A7<br>900 MHz | Quad ARM A15<br>2 GHz | Tegra K1<br>850 MHz | PNeuroV2 (FPGA)<br>100 MHz | PNeuroV2 (ASIC)<br>500 MHz | DNeuro (FPGA)<br>100 MHz |
|                                                                                                                                                                                       | Performance                   | 480 images/s           | 870 images/s          | 3 550 images/s      | 7 000 images/s             | 25 000 images/s            | 45 000 images/s          |
|                                                                                                                                                                                       | Energy Efficiency             | 380 images/s/W         | 350 images/s/W        | 600 images/s/W      | 2 800 images/s/W           | 125 000 images/s/W         | 18 000 images/s/W        |
|                                                                                                                                                                                       | o and DNeuro<br>rison vs Tegr | •                      | 2D2.                  | x 2                 | x 7                        | x 12.5                     |                          |
|                                                                                                                                                                                       | 0                             |                        | – More                | ( x 4.5             | ( x 200 )                  | x 30                       |                          |

conv1 3x3

nool1

Leti Innovation Days | June 28-29, 2017 | 18



#### PNEURO OVERVIEW





Sensor layer

130nm SOI

# **3D STACKED RETINA WITH SPIKING NEURAL NETWORKS**

**Neural layer 2** 

#### **RETINE:** image sensor + 3D stacked SIMD processors

- \_ Image sensor: 70% fill factor, 12 μm pixel, >1000 fps
- SIMD processors: 3072 units, distributed memory, 11.7 MOPS/mW

N2

N1

Passive interposer or PCB

- Feed SNN with Asynchronous Event Representation (AER) after pre-processing

SNN chip



Processor array die

Retine Chip ALTIS 130nm, CuCu bonding



#### Pre-processing performances: (L1+L2 stacked retina)

Lens

Preprocessing

synchronous AER coding

|                                           | RETINE | ARM cortex A9<br>+NEON | STxP70 |
|-------------------------------------------|--------|------------------------|--------|
| Frequency (Mhz)                           | 150    | 400                    | 350    |
| Performance (GOPS)                        | 72     | 0,67                   | 0,28   |
| Power consumption<br>(W)                  | 4,8    | 0,25                   | 0,08   |
| Energy / frame (mJ)                       | 2,74   | 0,68                   | 5,6    |
| Energy efficiency<br>(normalized, GOPS/W) | 45     | 2,68                   | 5,25   |

→ x100 computing power, x10 energy efficiency, /15 processing latency vs competition



# **3D SPIKING NEURAL NETWORK**

Spiking frequency

11111

f<sub>MIN</sub>

TMAX

Pixel

brightness

Rate-based

# **NEMESIS 3D two-layers SNN test chip**

- 1<sup>st</sup> layer: 48 macro-block neurons, 1024 synapses per neuron (49 152 total)
- 2<sup>nd</sup> layer: 50 fully connected neurons, 2 400 synapses





#### LEARNING FROM NEUROSCIENCE: A STDP (SPIKE TIMING DEPENDENT PLASTICITY) PRIMER





## **PRINCIPLE CROSSBARS OF MEMRISTORS**





#### **NVM SYNAPSES IMPLEMENTATIONS**

• 2-PCM synapses for unsupervised cars trajectories extraction



CBRAM binary synapses for unsupervised MNIST handwritten digits classification with stochastic learning



Traffic lanes

visualization

2<sup>nd</sup> laver

1<sup>st</sup> layer

Lateral

inhibition



#### **PUTTING IT ALL TOGETHER: NEURAM3**

NEURAL COMPUTING ARCHITECTURES IN ADVANCED MONOLITHIC 3D-VLSI NANO-TECHNOLOGIES

- EU collaborative project (ICT)
- Objective
  - Fabricate a chip implementing neuromorphi architecture with state-of-the-art machine and spike-based learning

#### Features

- Ultra low power, scalable and configurable NN architecture
- Gain x50 in power consumption vs conventional digital solutions
- 3D FDSOI at 28nm integrating RRAM synaptic elements
- TFT device technology to interconnect multiple processor chips

Consortium



| Participant no. | Organization name                                                                                               | Short name | Country     |
|-----------------|-----------------------------------------------------------------------------------------------------------------|------------|-------------|
| 1 (Coordinator) | Commisariat a l'energie atomique et<br>aux energies alternatives                                                | CEA        | France      |
| 2               | Interuniversitair Micro-Electronica<br>Centrum IMEC VZW                                                         | IMEC       | Belgium     |
| 3               | Stichting IMEC Nederland                                                                                        | IMEC-NL    | Netherlands |
| 4               | IBM Research Gmbh                                                                                               | IBM        | Switzerland |
| 5               | University of Zurich, Institute of Neu-<br>roinformatics                                                        | UZH        | Switzerland |
| 6               | Agencia Estatal Consejo Superior de<br>Investigaciones Científicas, Instituto<br>de Microelectronica de Sevilla | CSIC       | Spain       |
| 7               | Consiglio Nazionale delle Ricerche                                                                              | CNR        | Italy       |
| 8               | Jacobs University Bremen                                                                                        | JAC        | Germany     |
| 9               | ST-Microelectronics S.A.                                                                                        | STM        | France      |

NeuRAM<sup>3</sup>



# **1<sup>ST</sup> DIGITAL CHIP EXPECTED FOR SUMMER 2017:**

|                                        | Neuram3 1 <sup>st</sup> chip | IBM True North        |
|----------------------------------------|------------------------------|-----------------------|
| Technology                             | 28 nm FDSOI                  | 28nm CMOS             |
| Supply Voltage                         | 1 V                          | 0.7V                  |
| Neuron Type                            | Analog                       | Digital               |
| Neurons per core                       | 256                          | 256                   |
| Core Area                              | 0.36 mm <sup>2</sup>         | 0.094 mm <sup>2</sup> |
| Computation                            | Parallel processing          | Time multiplexing     |
| Fan In/Out                             | 2k/8k                        | 256/256               |
| Synaptic Operation per Second per Watt | 300 GSOPS/W <sup>*1</sup>    | 46 GSOPS/W            |
| Energy per synaptic event              | <2 pJ*2                      | 10 pJ                 |
| Energy per spike                       | <0.375 nJ*3                  | 3.9 nJ                |

\* 1 At 100Hz mean firing rate, by appending 4 local-core destinations per spike, 400 k events will be broadcast to 4 cores with 25% connectivity per event. 400 k x 1 k x 25% / 300  $\mu$  W = 300 GSOPS/W

\* 2 In case of 25% match in each core, energy per synaptic event = energy per broadcast / (256\*25%) =120pJ/64 = 2 pJ

\* 3 Energy per spike = total power consumption / spikes numbers = 300 uW/800 k = 0.375 nJ



- Identification of single neurons by characteristic of spike shapes
- Output neurons allow to classify spikes after sufficient learning period
- Extraction of spike enables decoding the brain activity (BCI)...
- .... Opening new applications, like brain controlled prosthetics..





leti Ceatech





Leti, technology research institute Commissariat à l'énergie atomique et aux énergies alternatives Minatec Campus | 17 rue des Martyrs | 38054 Grenoble Cedex | France www.leti-cea.com

-

