The NXP eIQ ML (edge intelligence machine learning) software environment provides tools to perform inference on embedded systems using neural network models. The software includes optimizations that leverage the hardware capabilities of the i.MX93 family for improved performance. Examples of applications that typically use neural network inference include object/pattern recognition, gesture control, voice processing, and sound monitoring.

eIQ includes support for the following inference engine:

Include eIQ packages in Digi Embedded Yocto

Edit your conf/local.conf file to include the eIQ package group in your Digi Embedded Yocto image:

conf/local.conf
IMAGE_INSTALL:append = " packagegroup-imx-ml"

This package group contains all of NXP’s eIQ packages compatible with the ConnectCore 93. The packagegroup-imx-ml has been extended to customize the eiq-examples to:

  1. Automatically download and convert the models to the Ethos-U NPU format using the vela tool.

  2. Autostart the dms application on boot (eiqdemo service).

  3. Set the default eiqdemo models to use the NPU.

Including this package group increases the size of the rootfs image significantly. To minimize the increase in image size, select a subset of its packages depending on your needs. See the package group’s recipe for more information on the packages it contains.

NPU: Ethos-U software architecture

The software for Ethos-U support includes three main components.

  • Vela model compiler: offline tool to compile the TFLite model graph for Ethos-U. The compiler replaces supported operators in the model with custom "ethos-u" operator containing the command stream for Ethos-U NPU. The output of the compiler is a modified TFLite model graph for TFLite/TFLite-Micro inference engines.

  • Cortex-A software stack for Linux: containing MPU inference engine (TensorFlow Lite), driver library, and kernel side device driver for Linux Kernel.

  • Cortex-M software stack: containing MCU inference engine software (TFLite-Micro, CMSIS-NN), and NPU driver.

The typical inference workflow is as follows:

  1. The Vela model compiler converts the TFLite model into a Vela model and generates the optimized version for Ethos-U NPU.

  2. The optimized model is fed into one of the following:

    1. TFLite inference engine, which recognizes the custom "ethos-u" operator, allocates the buffer for input/output feature map (IFM/OFM), and executes the operator via Ethos-U Linux driver.

    2. Inference API, which allocates the buffer for input/output feature map and sends entire model via EthosU driver.

  3. The Ethos-U driver composes the inference task message and sends it over RPMSG to Cortex-M.

  4. The Ethos-U Runner on Cortex-M dispatches the task to TFLite-Micro or Ethos-U driver directly, according to the task type.

    1. If the task type is accelerating the "ethos-u" operator (using the TFLite), the runner calls the Ethos-U driver directly.

    2. If the task type is accelerating the entire model (using the Inference API), the runner dispatches the model to TFLite-Micro and further calls Ethos-U driver for processing.

  5. After the Ethos-U driver completes the inference task, it writes the result into the OFM buffer and sends the response back to Cortex-A via RPMSG.

TensorFlow

TensorFlow support

TensorFlow Lite is a set of tools that enables on-device machine learning by helping developers run their models on mobile, embedded, and edge devices. TensorFlow Lite supports computation on the following hardware units:

  • CPU Arm Cortex-A cores

  • NPU hardware acceleration using Ethos-U Delegate.

Ethos-U Delegate is an external delegate on i.MX93 Linux platforms. It enables the inference to be accelerated via the on-chip hardware accelerator. Ethos-U Delegate directly uses the hardware accelerator driver (Ethos-U driver stack) to fully utilize the capabilities of the accelerator.

TensorFlow example with CPU

The following example shows how to use TensorFlow Lite by performing the inference in the CPU.

# cd /usr/bin/tensorflow-lite-2.12.1/examples
# ./label_image -i grace_hopper.bmp
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: invoked
INFO: average time: 142.956 ms
INFO: 0.764706: 653 military uniform
INFO: 0.121569: 907 Windsor tie
INFO: 0.0156863: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

The output displays the time taken to process the sample, with an average time of 142.956 ms.

TensorFlow example with NPU

First compile the model for Ethos-U using the Vela tool, reusing the existing model:

# vela /usr/bin/tensorflow-lite-2.12.1/examples/mobilenet_v1_1.0_224_quant.tflite \
     --output-dir /usr/bin/tensorflow-lite-2.12.1/examples/

Then, run the example by specifying the converted inference model (-m option) and the NPU library (external_delegate_path option):

# cd /usr/bin/tensorflow-lite-2.12.1/examples
# ./label_image -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant_vela.tflite \
     --external_delegate_path=/usr/lib/libethosu_delegate.so
INFO: Loaded model mobilenet_v1_1.0_224_quant_vela.tflite
INFO: resolved reporter
INFO: Ethosu delegate: device_name set to /dev/ethosu0.
INFO: Ethosu delegate: cache_file_path set to .
INFO: Ethosu delegate: timeout set to 60000000000.
INFO: Ethosu delegate: enable_cycle_counter set to 0.
INFO: Ethosu delegate: enable_profiling set to 0.
INFO: Ethosu delegate: profiling_buffer_size set to 2048.
INFO: Ethosu delegate: pmu_event0 set to 0.
INFO: Ethosu delegate: pmu_event1 set to 0.
INFO: Ethosu delegate: pmu_event2 set to 0.
INFO: Ethosu delegate: pmu_event3 set to 0.
EXTERNAL delegate created.
INFO: EthosuDelegate: 1 nodes delegated out of 1 nodes with 1 partitions.
INFO: Applied EXTERNAL delegate.
INFO: invoked
INFO: average time: 3.842 ms
INFO: 0.780392: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0156863: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

The output displays the time taken to process the sample, with an average time of 3.842 ms.

TensorFlow NPU vs CPU performance

Make sure no other process is using the NPU or overloading the CPU. To stop the eiqdemo service, execute systemctl stop eiqdemo.

Inside the TensorFlow folder, there is a benchmark tool that measures the inference time. If your model has been converted for Ethos-U, the NPU will be utilized automatically. To compare the performance, you can use the previously used models for both the CPU and NPU. Begin by running the example without Ethos-U support:

# cd /usr/bin/tensorflow-lite-2.12.1/examples
# time ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --num_runs=1000 --num_threads=2
STARTING!
Log parameter values verbosely: [0]
Min num runs: [1000]
Num threads: [2]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
#threads used for CPU inference: [2]
#threads used for CPU inference: [2]
Loaded model mobilenet_v1_1.0_224_quant.tflite
The input model file size (MB): 4.27635
Initialized session in 11.997ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=15 first=57921 curr=33998 min=33586 max=57921 avg=35436.3 std=6014

Running benchmark for at least 1000 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=1000 first=33801 curr=33601 min=33422 max=47372 avg=33759.5 std=682

Inference timings in us: Init: 11997, First inference: 57921, Warmup (avg): 35436.3, Inference (avg): 33759.5
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=4.45703 overall=12.7148

real    0m34.424s
user    1m7.911s
sys     0m0.036s

The test results indicate that the benchmark took approximately 34.4 seconds to complete, with an average inference time of 33759.5 microseconds when utilizing two CPUs. The results also indicate that CPU load reached close to 100% CPU utilization. To further analyze performance, you can repeat the test using a converted Vela model that leverages the EthosU NPU. To do so, repeat the test, instructing the converted Vela model to use the EthosU NPU.

# cd /usr/bin/tensorflow-lite-2.12.1/examples
# time ./benchmark_model --graph=mobilenet_v1_1.0_224_quant_vela.tflite --num_runs=1000 \
     --external_delegate_path=/usr/lib/libethosu_delegate.so
STARTING!
Log parameter values verbosely: [0]
Min num runs: [1000]
Num threads: [2]
Graph: [mobilenet_v1_1.0_224_quant_vela.tflite]
#threads used for CPU inference: [2]
#threads used for CPU inference: [2]
External delegate path: [/usr/lib/libethosu_delegate.so]
Loaded model mobilenet_v1_1.0_224_quant_vela.tflite
INFO: Ethosu delegate: device_name set to /dev/ethosu0.
INFO: Ethosu delegate: cache_file_path set to .
INFO: Ethosu delegate: timeout set to 60000000000.
INFO: Ethosu delegate: enable_cycle_counter set to 0.
INFO: Ethosu delegate: enable_profiling set to 0.
INFO: Ethosu delegate: profiling_buffer_size set to 2048.
INFO: Ethosu delegate: pmu_event0 set to 0.
INFO: Ethosu delegate: pmu_event1 set to 0.
INFO: Ethosu delegate: pmu_event2 set to 0.
INFO: Ethosu delegate: pmu_event3 set to 0.
EXTERNAL delegate created.
INFO: EthosuDelegate: 1 nodes delegated out of 1 nodes with 1 partitions.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 3.35866
Initialized session in 20.346ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=130 first=3934 curr=3818 min=3805 max=3934 avg=3819.06 std=16

Running benchmark for at least 1000 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=1000 first=3819 curr=3818 min=3803 max=3952 avg=3817.89 std=11

Inference timings in us: Init: 20346, First inference: 3934, Warmup (avg): 3819.06, Inference (avg): 3817.89
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=8.04297 overall=8.40625

real    0m4.417s
user    0m0.153s
sys     0m0.101s

The test results indicate that it took approximately 4.4 seconds to complete, with an average inference time of 3817.89 microseconds when utilizing the NPU. The results also indicate that CPU load reached approximately 1% CPU utilization.

NXP eIQ examples

Overview

The generated image with packagegroup-imx-ml contains the eIQ demos provided by NXP in the eiq-example. In addition, Digi has customize the packagegroup-imx-ml recipe to provide ready-to-use converted models.

The eIQ examples available in the image are inside the /usr/bin/eiq-examples-git folder:

#  ls -l /usr/bin/eiq-examples-git/
drwxr-xr-x    2 root     root          4096 Mar  9  2018 dms
drwxr-xr-x    2 root     root          4096 Mar  9  2018 face_recognition
drwxr-xr-x    2 root     root          4096 Mar  9  2018 gesture_detection
drwxr-xr-x    2 root     root          4096 Mar  9  2018 models
drwxr-xr-x    2 root     root          4096 Mar  9  2018 object_detection
drwxr-xr-x    2 root     root          4096 Mar  9  2018 vela_models

That folder contains:

  • models and vela_models: Two folders containing the original and converted Vela models to use with the NPU.

  • Demo directories: There are multiple demos, and each demo folder contains a Python script to run it.

Requirements

  • Camera: use the OV5640 or a USB camera.

  • QT environment: the demos run on top of QT; they will not work on a core-image-base image.

Running an example (using the NPU)

By default, the eiqdemo service starts the dms demo script. Use the following script to start one of the demos:

#  ls -l /etc/demos/scripts/
-rwxr-xr-x    1 root     root          1279 Mar  9  2018 launch_eiq_demo.sh
lrwxrwxrwx    1 root     root            18 Mar  9  2018 launch_eiq_demo_dms.sh -> launch_eiq_demo.sh
lrwxrwxrwx    1 root     root            18 Mar  9  2018 launch_eiq_demo_face_recognition.sh -> launch_eiq_demo.sh
lrwxrwxrwx    1 root     root            18 Mar  9  2018 launch_eiq_demo_gesture_detection.sh -> launch_eiq_demo.sh
lrwxrwxrwx    1 root     root            18 Mar  9  2018 launch_eiq_demo_object_detection.sh -> launch_eiq_demo.sh

To run any of these examples, first stop the eiqdemo service and then start the demo. For instance, to run the face recognition demo:

# systemctl stop eiqdemo
# /etc/demos/scripts/launch_eiq_demo_face_recognition.sh
INFO: Ethosu delegate: device_name set to /dev/ethosu0.
INFO: Ethosu delegate: cache_file_path set to .
INFO: Ethosu delegate: timeout set to 60000000000.
INFO: Ethosu delegate: enable_cycle_counter set to 0.
INFO: Ethosu delegate: enable_profiling set to 0.
INFO: Ethosu delegate: profiling_buffer_size set to 2048.
INFO: Ethosu delegate: pmu_event0 set to 0.
INFO: Ethosu delegate: pmu_event1 set to 0.
INFO: Ethosu delegate: pmu_event2 set to 0.
INFO: Ethosu delegate: pmu_event3 set to 0.
INFO: EthosuDelegate: 1 nodes delegated out of 1 nodes with 1 partitions.
INFO: EthosuDelegate: 1 nodes delegated out of 3 nodes with 1 partitions.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
[ WARN:0@0.278] global cap_gstreamer.cpp:2784 handleMessage OpenCV | GStreamer warning: Embedded video playback halted; module source reported: Could not read from resource.
[ WARN:0@0.284] global cap_gstreamer.cpp:1679 open OpenCV | GStreamer warning: unable to start pipeline
[ WARN:0@0.284] global cap_gstreamer.cpp:1164 isPipelinePlaying OpenCV | GStreamer warning: GStreamer: pipeline have not been created
If you use the regular models for inference on the NPU instead of the ones converted with Vela, the demo will automatically convert the model before running the inference, adding a significant delay in the demo’s execution time.

More information

See NXP’s i.MX Machine Learning User’s Guide for more information on eIQ.