Edge Impulse Brings Nvidia’s Tao Toolkit To TinyML Hardware
By
Sally Ward-Foxton 03.18.2024 0
Share Post
Share on Facebook
Share on Twitter
Edge Impulse and Nvidia have collaborated to bring Nvidia’s Tao training toolkit to tiny hardware from other silicon vendors, including microcontrollers from NXP, STMicro, Alif and Renesas, with more hardware to follow. Embedded design teams can now easily train and optimize models on Nvidia GPUs in the cloud or on-premises using Tao, then deploy on embedded hardware using Edge Impulse.
“We realized AI is expanding into edge opportunities like IoT, where Nvidia doesn’t have silicon. So we said, why not?” Deepu Talla, VP and GM for robotics and edge computing at Nvidia, told EE Times. “We’re a platform, we have no problem enabling all of the ecosystem. [What we are able] to deploy on Nvidia Jetson, we can now deploy on a CPU, on an FPGA, on an accelerator—whatever custom accelerator you have—you can even deploy on a microcontroller.”
As part of the collaboration, Edge Impulse optimized almost 88 models from Nvidia’s
model zoo for resource-constrained hardware at the edge. These models are available from Nvidia free of charge. The company has also added an extension to Nvidia Omniverse Replicator that allows users to create additional synthetic training data from existing datasets.
Tao toolkit
Tao is Nvidia’s toolkit for training and optimizing AI models for edge devices. In the latest release of Tao, model export in ONNX format is now supported, which makes it possible to deploy a Tao-trained model on any computing platform.
Integration with Edge Impulse’s platform means Edge Impulse users get access to the latest research from Nvidia, including new types of models like vision
transformers. Edge Impulse’s integrated development environment can handle data collection, training on your own dataset, evaluation and comparison of models for different devices, and deployment of Tao models to any hardware. Training is run on Nvidia GPUs in Edge Impulse’s cloud via API.
Nvidia Tao toolkit now has ONNX support so that models can be deployed on any hardware. (Source: Nvidia)
Why would Nvidia make tools and models it has invested heavily in available to other types of hardware?
“Nvidia doesn’t participate in all of the AI inference market,” Talla said, noting that Nvidia’s edge AI offerings, including Jetson, are built for autonomous machines and industrial robotics where heavy duty inference is required.
Beyond that, in smartphones and IoT devices: “We will not participate in that market,” he said. “Our strategy is to play in autonomous machines, where there’s multiple sensors and sensor fusion, and that’s a strategic choice we made. The tens or hundreds of companies developing products from mobile to IoT, you could say they are competitors, but it’s overall speeding up the adoption of AI, which is good.”
Making Tao available for smaller AI chips than Jetson isn’t an altruistic move, Talla said.
“A rising tide lifts all boats,” he said. “There’s a gain for Nvidia because…IoT devices go will go into billions if not tens of billions of units annually. Jetson is not targeting that market. As AI adoption grows at the edge, we want to monetize it on the data center side. If somebody’s going to use our GPUs in the cloud to train their AI, we have monetized that.”
Users will save money, he said, because Tao will make it easier to train on GPUs in the cloud, increasing time to market for products.
“It’s beneficial to everyone in the ecosystem,” he said. “I think this is a win-win for all of our partners in the middle, and end customers.”
Nvidia went through a lot of the same challenges facing embedded developers today in creating and optimizing models for Jetson hardware seven to either years ago. For example, Talla said, gathering data is very difficult as you can’t cover all the corner cases, there are many open-source models to choose from and they change frequently, and AI frameworks are also continuously changing.
“Even if you master all of that, how can you create a performance model that is going to be the right size, meaning the memory footprint, especially when it comes to running at the edge?”
Tao was developed for this purpose five to eight years ago and most of it was open sourced last year.
“We want to give full control for anybody to take as many pieces as they want, to control their destiny, that’s why it’s not a closed piece of software,” Talla said.
The technical collaboration between Nvidia and Edge Impulse had several facets, Talla said. First, the teams needed to make sure models being trained in Tao were in the right format for silicon vendors’ runtime tools (edge hardware platforms typically have their own runtime compilers to optimize further). Second, Nvidia regularly updates its model zoo with state of the art models, but backporting those models to older frameworks is extremely challenging—the challenge, he said, is figuring out “whether we can keep the old models with the old frameworks despite adding newer models, something we’re still trying to figure out together.”
Model zoo
As part of the collaboration, Edge Impulse has optimized almost 88 models for the edge from Nvidia’s model zoo, Daniel Situnayake, director of ML at Edge Impulse, told EE Times.
“We’ve selected specific computer vision models from Nvidia’s Tao library that are appropriate for embedded constraints based on their trade-offs between latency, memory use and task performance,” he said.
Models like RetinaNet, YOLOv3, YOLOv4 and SSD were ideal options with slightly different strengths, he said. Because these models previously required Nvidia hardware to run, a certain amount of adaptation was required.
“To make them universal, we’ve performed model surgery to create custom versions of the models that will run on any C++ target, and we’ve created target-optimized implementations of any custom operations that are required,” Situnayake said. “For example, we’ve written fast versions of the decoding and non-maximum suppression algorithms used to create bounding boxes for object detection models.”
Further optimizations include quantization, scaling models down to run on mid-range microcontrollers like those based on Arm Cortex-M4 cores, and pre-training them to support input resolutions that are appropriate for embedded vision sensors.
“This results in seriously tiny models, for example, a YOLOv3 object detection model that uses 500 kB RAM and 1.2 MB ROM,” he said.
Models can be deployed via Edge Impulse’s EON compiler or using the silicon vendor’s toolchain. Edge Impulse’s EON Tuner hyperparameter optimization system can help users choose the optimum combination of model and hyperparameters for the user’s data set and target device.
Nvidia Omniverse Replicator integration with Edge Impulse allows users to generate synthetic data to address any gaps in their datasets. (Source: Nvidia)
Edge Impulse has also been working with Nvidia on integration with Omniverse Replicator, Nvidia’s tool for synthetic data generation. Edge Impulse users can now use Omniverse Replicator to generate synthetic image data based on their existing data for training—perhaps to address certain gaps in the dataset to ensure accurate and versatile trained models.
Edge Impulse’s integration with Nvidia Tao is currently available for hardware targets including NXP, STMicro, Alif and Renesas, with Nordic devices next in line for onboarding, the company said.