Real-Time Deep Learning at the Edge



One of the major driving forces behind the push to run artificial intelligence models on-device is the reduction in latency that this approach can offer. When relying on remote data centers, there will always be network latency involved. This latency can be unpredictable at times, and it prevents applications from running in real-time.

Of course this move is not as easy as deploying the same model that runs on a cluster of GPUs to a microcontroller with a few tens of kilobytes of memory. The model must first be reduced in size and optimized to run on the less powerful platform. But too much trimming will make the model’s performance unacceptable, so only just so much can be done. Many times it is not enough, which means that the new platform will be spending too many processing cycles on inferences, bringing excessive latency back into the picture.

That brings us right back to the problem we started with, so it just won’t do. In response researchers have proposed a technique called patch-based layer fusion to speed up deep learning algorithms on resource-constrained hardware platforms. These methods operate on small windows (or patches) of the input data at any given time. They also fuse together operations from multiple layers of a neural network to simplify processing. Taken together, these optimizations speed up inferences and reduce memory utilization.

Improving on this approach, a pair of researchers at Freie Universität Berlin and Inria have developed what they call msf-CNN. Using this method, convolutional neural networks can be tuned for optimal processing speed and memory utilization. These optimizations make real-time execution of accurate models possible on even highly-constrained hardware.

The msf-CNN technique builds on patch-based fusion by applying a graph-based search algorithm to determine the best way to fuse layers in a convolutional neural network. By modeling the network’s structure as a directed acyclic graph, the researchers can explore the entire fusion solution space, identifying configurations that minimize either peak RAM usage or compute cost. This graph-based search strategy enables msf-CNN to outperform previous solutions like MCUNetV2 and StreamNet in both flexibility and efficiency.

To make this technology practical for real-world applications, the team implemented msf-CNN on a range of commercially available microcontrollers, including Arm Cortex-M, RISC-V, and ESP32 platforms. They also introduced enhancements to global pooling and dense layer operations, further reducing RAM consumption without adding compute overhead. Testing revealed that RAM utilization could be reduced by as much as 50% when compared with previous techniques.

The source code for msf-CNN is publicly available on GitHub. Given the number of platforms that are already supported, and the wide range of applications that msf-CNN can be applied to, this work could make a big impact in the world of tiny hardware.

Leave a Reply

Your email address will not be published. Required fields are marked *