BRN Discussion Ongoing

Frangipani

Top 20
Guten Tag, AkidaTag! 😊



View attachment 95926

View attachment 95927

View attachment 95928 View attachment 95929 View attachment 95930

View attachment 95931


There are two different videos on https://brainchip.com/akida-tag-lp/.

A promotional “Concept Video” as well as a “Demo Video” featuring Nikunj Kotecha…

Here are some still pictures taken from the “Concept Video”:

03DCDF60-9443-4CE2-8551-1756EDF9F008.jpeg


57F41E00-B1A0-4E53-8BBE-810CA21DB614.jpeg

34B22A1F-8455-4357-BB75-375D22241EC0.jpeg

2BE9387E-E397-4609-9F44-6CA54F5E0843.jpeg


6986817D-4F53-492E-AA1F-479FF5AACA4D.jpeg


BDDC5D1D-34BF-4A2C-91F9-A3D1DCED4E2E.jpeg



5E39F051-D1A7-4B00-BF24-CE835C0A8CC2.jpeg
1E7E18A0-6A8F-46CF-99B4-101C73F6C82A.jpeg


F79C7BFC-F0B3-44ED-B4A2-A4EC1DD376D7.jpeg

8F16BDA6-27CB-40E9-AFD7-0AC39B215566.jpeg
 
Last edited:
  • Like
  • Fire
  • Love
Reactions: 23 users

Frangipani

Top 20
There are two different videos on https://brainchip.com/akida-tag-lp/.

A promotional “Concept Video” as well as a “Demo Video” featuring Nikunj Kotecha…

Here are some still pictures taken from the “Concept Video”:

View attachment 95952

View attachment 95936
View attachment 95941
View attachment 95948

View attachment 95949

View attachment 95937


View attachment 95950 View attachment 95951

View attachment 95944
View attachment 95940

Now some impressions from the “Demo Video”:

E4C88091-C1FB-435F-A788-E9490460905E.jpeg
0C2BC325-90AB-4B99-A26F-56C7E8B2C12A.jpeg
5AAE227E-72BE-488A-B8DA-E129EF9DC801.jpeg
A23FC480-653C-40E5-8A40-B9EF4EE408EF.jpeg
90053C56-A774-42CC-8DBE-238F20C517D9.jpeg
2A2F51CB-3D8A-4C82-A9E1-A3E3F5D9EFC6.jpeg
177BFB8E-780F-4609-BBAA-E7428E50A549.jpeg

DF547F83-CA32-45E1-9E5D-B88B713E6950.jpeg
F3DDDA11-9B57-4AFB-BCC3-4F45014ADBDA.jpeg

BD8AB7A4-D3D0-4746-91E7-3CE4CF8F86AF.jpeg
 
  • Fire
  • Like
  • Love
Reactions: 17 users

jrp173

Regular
Guten Tag, AkidaTag! 😊



View attachment 95926

View attachment 95927

View attachment 95928 View attachment 95929 View attachment 95930

View attachment 95931


Great videos and perhaps the most professional and polished we have seen from BrainChip, but the question still remains, why do they never do anything to advocate for themselves? Why have these videos not been posted on BrainChip's Linked In page...

Don't even get me started on ASX announcement of what looks like a great product.

I guess you can lead a horse to water but you can not make it drink.
 
  • Like
  • Fire
  • Love
Reactions: 11 users

HopalongPetrovski

I'm Spartacus!
Nice to see new products and them being demonstrated to interested parties but IMO we need more engaging demonstrations to capture imaginations.
There is so much competition these days for our eyes, ears and attention that we need a bit of pizzazz to even be noticed in a world filled with tik tok influencer’s and Marvel super hero’s.

No offence meant towards Nikunj and I’m sure he is a very capable Solutions Architect, but he is no Brad Pitt.
We need someone who presents well and is engaging in all the shallow and ridiculous ways that have currency in an overloaded sea of information.

And please spend a few bucks on getting someone in who can improve the production values.
This looks like it was done by a primary school kid on their summer holidays.
Frankly it's pretty cringeworthy and that is not the impression we want to be leaving with viewers.

The concept video is better but really only junior high school student level.

I get they are wanting to save some dollars and produce "in house” but having great products poorly presented is not only a false economy but actually a negative in regards the brand.
 
  • Like
  • Fire
  • Love
Reactions: 16 users

TheDrooben

Pretty Pretty Pretty Pretty Good
Nice to see new products and them being demonstrated to interested parties but IMO we need more engaging demonstrations to capture imaginations.
There is so much competition these days for our eyes, ears and attention that we need a bit of pizzazz to even be noticed in a world filled with tik tok influencer’s and Marvel super hero’s.

No offence meant towards Nikunj and I’m sure he is a very capable Solutions Architect, but he is no Brad Pitt.
We need someone who presents well and is engaging in all the shallow and ridiculous ways that have currency in an overloaded sea of information.

And please spend a few bucks on getting someone in who can improve the production values.
This looks like it was done by a primary school kid on their summer holidays.
Frankly it's pretty cringeworthy and that is not the impression we want to be leaving with viewers.

The concept video is better but really only junior high school student level.

I get they are wanting to save some dollars and produce "in house” but having great products poorly presented is not only a false economy but actually a negative in regards the brand.

26n79Sm5ZYyXeEufC.gif


Happy as Larry
 
  • Haha
Reactions: 11 users
TI just announced the MSPM0G5187 and AM13Ex MCU families at Embedded World 2026, both featuring their new "TinyEngine" NPU. The performance claims are impressive — 90x lower latency and 120x lower energy per inference versus a software-only MCU baseline, all on a sub-$1 Cortex-M0+ part. But after digging through the available documentation, I think there's something unusual going on with this product, and I want to lay out what I've found.

The Documentation Gap
If you go look at TI's previous NPU product — the C2000 F28P55x, announced November 2024 — the datasheet is what you'd expect from TI. It gives you concrete architectural specs: 600–1200 MOPS at 75MHz, specific bit-width configurations (8bWx8bD and 4bWx8bD), 10x inferencing improvement vs software, CNN-optimized architecture, and it sits on a well-documented C28x DSP core with CLA, FPU, and TMU. This is thorough, detailed documentation.

Now go look at what TI has published for the TinyEngine NPU in the MSPM0G5187. The datasheet describes it as "an integrated accelerator module used to enhance fast, secure AI at the edge." All details are hidden including MOPS figure, MAC array count, memory bandwidth spec or block diagram. Nothing that tells an engineer what the silicon actually is.

The 90x and 120x numbers are relative benchmarks against a non-accelerated MCU, not absolute architectural metrics. This is unusual for TI — they are one of the most documentation-heavy semiconductor companies in the world.

What's Interesting About the Architecture

The TinyEngine sits on an 80MHz Cortex-M0+ — the most basic Arm core available. No DSP. No hardware FPU. No CLA coprocessor. Yet TI claims performance improvements that dramatically exceed what their C2000 NPU achieves on a much more powerful DSP-based platform (90x vs 10x). The NPU apparently runs full neural network inference autonomously, in parallel with the CPU.

That's a significant departure from TI's historical approach, where AI/ML workloads have always been paired with their C28x DSP cores. Running aggressive neural network inference without any DSP support suggests the NPU is doing something fundamentally different from a conventional MAC array.

The Circumstantial Case for Licensed IP
Several things stack up when you look at this more closely:

1. Architecture secrecy is atypical for TI. Every other accelerator block TI ships — CLA, TMU, VCRC, the C2000 NPU — comes with detailed architectural documentation. The TinyEngine is the exception.

BrainChip's Akida IP is designed for exactly this use case. Akida runs quantized CNNs with 1/2/4/8-bit weights, operates autonomously without a host CPU for inference, and is explicitly designed to be licensed into SoCs. Their Akida Pico variant (announced October 2024) targets MCU-class integration at sub-milliwatt power with a 0.18mm² die area.

3. BrainChip deliberately withholds conventional performance metrics for Akida Pico. They've said performance is "scalable" and that they avoid MOPS/TOPS comparisons because their event-based architecture doesn't map cleanly to those metrics. This mirrors TI's non-disclosure pattern exactly.

4. A retired TI Senior Fellow sits on BrainChip's board. Duy-Loan Le spent 35 years at TI, became the company's first Asian American Senior Fellow, and led TI's multi-billion-dollar memory and DSP product lines — the exact technical domains relevant to NPU integration. She joined BrainChip's board in October 2022.

5. No DSP required. Akida's entire pitch is that it replaces the need for a host DSP/CPU during inference. The TinyEngine NPU operating on a bare M0+ without DSP support is consistent with this.

6. The performance profile fits neuromorphic/sparse compute. 90x latency and 120x energy improvements on a sub-$1, 80MHz M0+ MCU are very aggressive numbers. They're more consistent with a purpose-built event-based sparse compute engine (which only processes when data changes) than a conventional MAC array (which processes everything uniformly).

7. BrainChip has confirmed it holds back some partnership announcements. Their investor relations materials note that disclosure decisions are made carefully and not all commercial relationships are announced publicly. Their existing license with Renesas is structured as a royalty-bearing IP license with NDA-like provisions.

What Doesn't Fit
-"TinyEngine" is also the name of an MIT open-source inference framework from the MCUNet project. TI could be borrowing the naming convention, though the MIT version is software, not a hardware NPU.
The 90x/120x numbers could theoretically come from even a conventional accelerator if the baseline is unoptimized software inference on a bare M0+. That's a very slow baseline.
TI says they're rolling TinyEngine across their entire MCU portfolio. If this is Akida IP, that's an big licensing commitment, and BrainChip's financials haven't reflected that scale of revenue yet
- Renesas licensed Akida IP in 2020 for MCU integration, but their subsequent NPU-equipped MCUs (RA8P1) ended up using Arm's Ethos-U55 with full architectural disclosure. So having an Akida license doesn't always mean shipping Akida.

What Would Confirm or Deny This
- BrainChip's next licensing announcements Watch for any language about "MCU-class," "high-volume embedded," or "general-purpose microcontroller" deployments.
- TI's Technical Reference Manual. If/when they release detailed NPU documentation, the architecture will either look like a conventional MAC array (disproving the theory) or remain opaque (consistent with licensed IP under NDA).
- Die analysis. Once MSPM0G5187 is in volume production and someone decaps it, the NPU block layout would reveal a lot.
- BrainChip financials. A deal at TI's scale would eventually show up in licensing revenue and royalty streams. Watch the next few quarterly reports.

Bottom Line
None of this is proof. But the combination of deliberate architecture non-disclosure (extremely atypical for TI), the board-level personnel connection, the technical alignment with Akida's capabilities, the DSP-free architecture departure, and the performance profile that fits sparse/event-based compute better than conventional MAC arrays — it adds up to a circumstantial case that's hard to dismiss.

The architecture non-disclosure is probably the single strongest signal, because it's not a gap — it's a deliberate choice by a company that documents everything else exhaustively. The most straightforward explanation for that choice is a licensing agreement with confidentiality provisions.

Worth watching closely.



Other sources: TI MSPM0G5187 datasheet, TI TMS320F28P55x datasheet, TI press release (March 10, 2026), BrainChip Akida documentation, BrainChip Akida Pico announcement (October 2024), BrainChip/Renesas IP license agreement, BrainChip board of directors disclosures, Electronic Design, EE Times, Tom's Hardware, IEEE Spectrum.
 
  • Like
  • Love
  • Thinking
Reactions: 24 users

ChrisBRN

Member
Vecow, Haila & Global Foundries showing brainchip products


IMG_0709.jpeg

IMG_0710.jpeg
 
  • Like
  • Love
  • Fire
Reactions: 15 users

Bravo

Meow Meow 🐾
Nice to see new products and them being demonstrated to interested parties but IMO we need more engaging demonstrations to capture imaginations.
There is so much competition these days for our eyes, ears and attention that we need a bit of pizzazz to even be noticed in a world filled with tik tok influencer’s and Marvel super hero’s.

No offence meant towards Nikunj and I’m sure he is a very capable Solutions Architect, but he is no Brad Pitt.
We need someone who presents well and is engaging in all the shallow and ridiculous ways that have currency in an overloaded sea of information.

And please spend a few bucks on getting someone in who can improve the production values.
This looks like it was done by a primary school kid on their summer holidays.
Frankly it's pretty cringeworthy and that is not the impression we want to be leaving with viewers.

The concept video is better but really only junior high school student level.

I get they are wanting to save some dollars and produce "in house” but having great products poorly presented is not only a false economy but actually a negative in regards the brand.



If BrainChip wheeled out Brad Pitt to demo AkidaTag, it'd be the first neuromorphic chip to prove it can detect mass ECG anomalies all around the globe.



e873b040-5821-4b7b-a50a-82295123b56d_Thelma.gif
 
  • Haha
  • Fire
Reactions: 13 users

TheDrooben

Pretty Pretty Pretty Pretty Good
At least we know what AkidaTag is now. There was a lot of vagueness about it a few weeks back when it was first mentioned. I thought it may have been a cybersecurity product which "tags" and traces any anomalies in a network.

Happy as Larry
 
  • Like
  • Fire
Reactions: 6 users

Diogenese

Top 20
TI just announced the MSPM0G5187 and AM13Ex MCU families at Embedded World 2026, both featuring their new "TinyEngine" NPU. The performance claims are impressive — 90x lower latency and 120x lower energy per inference versus a software-only MCU baseline, all on a sub-$1 Cortex-M0+ part. But after digging through the available documentation, I think there's something unusual going on with this product, and I want to lay out what I've found.

The Documentation Gap
If you go look at TI's previous NPU product — the C2000 F28P55x, announced November 2024 — the datasheet is what you'd expect from TI. It gives you concrete architectural specs: 600–1200 MOPS at 75MHz, specific bit-width configurations (8bWx8bD and 4bWx8bD), 10x inferencing improvement vs software, CNN-optimized architecture, and it sits on a well-documented C28x DSP core with CLA, FPU, and TMU. This is thorough, detailed documentation.

Now go look at what TI has published for the TinyEngine NPU in the MSPM0G5187. The datasheet describes it as "an integrated accelerator module used to enhance fast, secure AI at the edge." All details are hidden including MOPS figure, MAC array count, memory bandwidth spec or block diagram. Nothing that tells an engineer what the silicon actually is.

The 90x and 120x numbers are relative benchmarks against a non-accelerated MCU, not absolute architectural metrics. This is unusual for TI — they are one of the most documentation-heavy semiconductor companies in the world.

What's Interesting About the Architecture

The TinyEngine sits on an 80MHz Cortex-M0+ — the most basic Arm core available. No DSP. No hardware FPU. No CLA coprocessor. Yet TI claims performance improvements that dramatically exceed what their C2000 NPU achieves on a much more powerful DSP-based platform (90x vs 10x). The NPU apparently runs full neural network inference autonomously, in parallel with the CPU.

That's a significant departure from TI's historical approach, where AI/ML workloads have always been paired with their C28x DSP cores. Running aggressive neural network inference without any DSP support suggests the NPU is doing something fundamentally different from a conventional MAC array.

The Circumstantial Case for Licensed IP
Several things stack up when you look at this more closely:

1. Architecture secrecy is atypical for TI. Every other accelerator block TI ships — CLA, TMU, VCRC, the C2000 NPU — comes with detailed architectural documentation. The TinyEngine is the exception.

BrainChip's Akida IP is designed for exactly this use case. Akida runs quantized CNNs with 1/2/4/8-bit weights, operates autonomously without a host CPU for inference, and is explicitly designed to be licensed into SoCs. Their Akida Pico variant (announced October 2024) targets MCU-class integration at sub-milliwatt power with a 0.18mm² die area.

3. BrainChip deliberately withholds conventional performance metrics for Akida Pico. They've said performance is "scalable" and that they avoid MOPS/TOPS comparisons because their event-based architecture doesn't map cleanly to those metrics. This mirrors TI's non-disclosure pattern exactly.

4. A retired TI Senior Fellow sits on BrainChip's board. Duy-Loan Le spent 35 years at TI, became the company's first Asian American Senior Fellow, and led TI's multi-billion-dollar memory and DSP product lines — the exact technical domains relevant to NPU integration. She joined BrainChip's board in October 2022.

5. No DSP required. Akida's entire pitch is that it replaces the need for a host DSP/CPU during inference. The TinyEngine NPU operating on a bare M0+ without DSP support is consistent with this.

6. The performance profile fits neuromorphic/sparse compute. 90x latency and 120x energy improvements on a sub-$1, 80MHz M0+ MCU are very aggressive numbers. They're more consistent with a purpose-built event-based sparse compute engine (which only processes when data changes) than a conventional MAC array (which processes everything uniformly).

7. BrainChip has confirmed it holds back some partnership announcements. Their investor relations materials note that disclosure decisions are made carefully and not all commercial relationships are announced publicly. Their existing license with Renesas is structured as a royalty-bearing IP license with NDA-like provisions.

What Doesn't Fit
-"TinyEngine" is also the name of an MIT open-source inference framework from the MCUNet project. TI could be borrowing the naming convention, though the MIT version is software, not a hardware NPU.
The 90x/120x numbers could theoretically come from even a conventional accelerator if the baseline is unoptimized software inference on a bare M0+. That's a very slow baseline.
TI says they're rolling TinyEngine across their entire MCU portfolio. If this is Akida IP, that's an big licensing commitment, and BrainChip's financials haven't reflected that scale of revenue yet
- Renesas licensed Akida IP in 2020 for MCU integration, but their subsequent NPU-equipped MCUs (RA8P1) ended up using Arm's Ethos-U55 with full architectural disclosure. So having an Akida license doesn't always mean shipping Akida.

What Would Confirm or Deny This
- BrainChip's next licensing announcements Watch for any language about "MCU-class," "high-volume embedded," or "general-purpose microcontroller" deployments.
- TI's Technical Reference Manual. If/when they release detailed NPU documentation, the architecture will either look like a conventional MAC array (disproving the theory) or remain opaque (consistent with licensed IP under NDA).
- Die analysis. Once MSPM0G5187 is in volume production and someone decaps it, the NPU block layout would reveal a lot.
- BrainChip financials. A deal at TI's scale would eventually show up in licensing revenue and royalty streams. Watch the next few quarterly reports.

Bottom Line
None of this is proof. But the combination of deliberate architecture non-disclosure (extremely atypical for TI), the board-level personnel connection, the technical alignment with Akida's capabilities, the DSP-free architecture departure, and the performance profile that fits sparse/event-based compute better than conventional MAC arrays — it adds up to a circumstantial case that's hard to dismiss.

The architecture non-disclosure is probably the single strongest signal, because it's not a gap — it's a deliberate choice by a company that documents everything else exhaustively. The most straightforward explanation for that choice is a licensing agreement with confidentiality provisions.

Worth watching closely.



Other sources: TI MSPM0G5187 datasheet, TI TMS320F28P55x datasheet, TI press release (March 10, 2026), BrainChip Akida documentation, BrainChip Akida Pico announcement (October 2024), BrainChip/Renesas IP license agreement, BrainChip board of directors disclosures, Electronic Design, EE Times, Tom's Hardware, IEEE Spectrum.
Hi IDD,

That's an impressive thesis grade analysis.

TI have over 100 patents which mention "neural network" and "accelerator".

https://worldwide.espacenet.com/pat... = "neural network" AND nftxt = "accelerator"

This one uses the CPU to synchronize NN processing units (contrast with Akida, in which the signaling between NPUs is asynchronous):

US2025190746A1 SYNCHRONIZED EXECUTION OF NEURAL NETWORK LAYERS IN MULTI-CORE ENVIRONMENTS 20231207

1773196293385.png



This is one that mentions matrix multiplication:

US2025315500A1 LAYER NORMALIZATION TECHNIQUES FOR NEURAL NETWORKS 20240403

1773195448654.png


[0030] In an implementation, to perform the matrix multiplication, the processing circuitry is configured to instruct an associated hardware accelerator to perform the matrix multiplication operations. For example, the processing circuitry may be coupled to a matrix multiplication accelerator (MMA), and configured to instruct the matrix multiplication accelerator to matrix multiply a first input matrix with a second input matrix to generate an output matrix storing a plurality of result values. In an implementation, after instructing the associated hardware accelerator to perform the matrix multiplication operations, the processing circuitry is further configured to instruct the associated hardware accelerator to process the plurality of result values to generate normalization parameters for the feature vector.

[0041] Layer 110 represents a processing block of a neural network. For example, if inference engine 109 is a transformer network, then layer 110 may be a multi-headed attention block (MHAB) which is configured to compute the scaled dot-product attention of a feature matrix. In an implementation, layer 110 is configured to provide its output data to data management module 111 . For example, layer 110 may store its output within memory 101 , for access by data management module 111 .

[0045] Layer 115 is representative of a processing block which is configured to form the output of inference engine 109 . For example, if inference engine 109 is configured to perform image classification, then layer 115 may be configured to output a classification for an input image
.

As recently as 2023. TI were playing with 1-bit -

US2025004762A1 BINARY CONVOLUTION INSTRUCTIONS FOR BINARY NEURAL NETWORK COMPUTATIONS 20230629

1773197517838.png



1773197479272.png



A system for accelerating binary convolution operations of a neural network includes a set of destination registers, binary convolution circuitry, a decoder coupled to the binary convolution circuitry, and instruction fetch circuitry coupled to the decoder and configured to fetch a binary convolution instruction from an associated memory. The binary convolution instruction specifies input data, weight data, and the set of destination registers for performing a binary convolution operation. The decoder receives the binary convolution instruction from the instruction fetch circuitry and causes the input data and the weight data to be provided to the binary convolution circuitry. In response, the binary convolution circuitry performs the binary convolution operation on the input data and the weight data to produce output data and stores the output data in the set of destination registers.

So maybe their remarkable performance is based on their 1-bit configuration.
 
  • Like
  • Fire
  • Love
Reactions: 6 users

Diogenese

Top 20
Hi IDD,

That's an impressive thesis grade analysis.

TI have over 100 patents which mention "neural network" and "accelerator".

https://worldwide.espacenet.com/patent/search/family/097231435/publication/US2025315500A1?q=pa = "Texas Instruments" AND nftxt = "neural network" AND nftxt = "accelerator"

This one uses the CPU to synchronize NN processing units (contrast with Akida, in which the signaling between NPUs is asynchronous):

US2025190746A1 SYNCHRONIZED EXECUTION OF NEURAL NETWORK LAYERS IN MULTI-CORE ENVIRONMENTS 20231207

View attachment 95970


This is one that mentions matrix multiplication:

US2025315500A1 LAYER NORMALIZATION TECHNIQUES FOR NEURAL NETWORKS 20240403

View attachment 95969

[0030] In an implementation, to perform the matrix multiplication, the processing circuitry is configured to instruct an associated hardware accelerator to perform the matrix multiplication operations. For example, the processing circuitry may be coupled to a matrix multiplication accelerator (MMA), and configured to instruct the matrix multiplication accelerator to matrix multiply a first input matrix with a second input matrix to generate an output matrix storing a plurality of result values. In an implementation, after instructing the associated hardware accelerator to perform the matrix multiplication operations, the processing circuitry is further configured to instruct the associated hardware accelerator to process the plurality of result values to generate normalization parameters for the feature vector.

[0041] Layer 110 represents a processing block of a neural network. For example, if inference engine 109 is a transformer network, then layer 110 may be a multi-headed attention block (MHAB) which is configured to compute the scaled dot-product attention of a feature matrix. In an implementation, layer 110 is configured to provide its output data to data management module 111 . For example, layer 110 may store its output within memory 101 , for access by data management module 111 .

[0045] Layer 115 is representative of a processing block which is configured to form the output of inference engine 109 . For example, if inference engine 109 is configured to perform image classification, then layer 115 may be configured to output a classification for an input image
.

As recently as 2023. TI were playing with 1-bit -

US2025004762A1 BINARY CONVOLUTION INSTRUCTIONS FOR BINARY NEURAL NETWORK COMPUTATIONS 20230629

View attachment 95972


View attachment 95971


A system for accelerating binary convolution operations of a neural network includes a set of destination registers, binary convolution circuitry, a decoder coupled to the binary convolution circuitry, and instruction fetch circuitry coupled to the decoder and configured to fetch a binary convolution instruction from an associated memory. The binary convolution instruction specifies input data, weight data, and the set of destination registers for performing a binary convolution operation. The decoder receives the binary convolution instruction from the instruction fetch circuitry and causes the input data and the weight data to be provided to the binary convolution circuitry. In response, the binary convolution circuitry performs the binary convolution operation on the input data and the weight data to produce output data and stores the output data in the set of destination registers.

So maybe their remarkable performance is based on their 1-bit configuration.
Well that's wrong. TI TinyEngine NPU has 2/4/8 Bits:

https://www.ti.com/lit/po/sprt822a/...8830839&ref_url=https%3A%2F%2Fwww.bing.com%2F

Key benefits of the TinyEngine NPU The TinyEngine NPU addresses key design constraints that have traditionally prevented widespread adoption of embedded AI by delivering:
• 120 times less energy per inference and 90 times lower latency compared to software-based AI
• 2.56 GOPS of computation performance for real-time edge AI inference for deep learning models
• Support for 8-bit, 4-bit and 2-bit and mixed precision configurations for quantization and in-place computation to solve memory footprint limitations
• Support for a wide range of neural network layer types like convolutional layers (generic, depthwise, pointwise, transposed), fully connected layers and pooling layers (average and max) with batch normalization
• Less development
 
  • Like
  • Fire
Reactions: 5 users

Diogenese

Top 20
Well that's wrong. TI TinyEngine NPU has 2/4/8 Bits:

https://www.ti.com/lit/po/sprt822a/sprt822a.pdf?ts=1773198830839&ref_url=https%3A%2F%2Fwww.bing.com%2F

Key benefits of the TinyEngine NPU The TinyEngine NPU addresses key design constraints that have traditionally prevented widespread adoption of embedded AI by delivering:
• 120 times less energy per inference and 90 times lower latency compared to software-based AI
• 2.56 GOPS of computation performance for real-time edge AI inference for deep learning models
• Support for 8-bit, 4-bit and 2-bit and mixed precision configurations for quantization and in-place computation to solve memory footprint limitations
• Support for a wide range of neural network layer types like convolutional layers (generic, depthwise, pointwise, transposed), fully connected layers and pooling layers (average and max) with batch normalization
• Less development
In fact, it looks like they use the sequencer patent:

US2025190746A1 SYNCHRONIZED EXECUTION OF NEURAL NETWORK LAYERS IN MULTI-CORE ENVIRONMENTS 20231207

https://github.com/Leonui/tiny-npu?tab=readme-ov-file#npu

Microcode Controller​

The sequencing brain of the NPU. It fetches 128-bit microcode instructions from SRAM, decodes them into engine-specific commands, and dispatches them to the appropriate hardware engine.

A scoreboard tracks which of the 6 engines are currently busy, ensuring instructions only dispatch when their target engine is free. A barrier instruction forces the controller to stall until all engines complete, which is needed between dependent operations (e.g., wait for GEMM before starting Softmax).

GitHub - Leonui/tiny-npu: opensource NPU for LLM inference (this run gpt2) · GitHub

1773200082413.png



For matrices larger than 16x16, the GEMM controller tiles the computation automatically. A [16][64] * [64][256] GEMM is broken into (1)(4)(16) = 64 tiles of 16x16, with partial sums accumulated across the K dimension.

After accumulation, the post-processing unit applies requantization: result_i8 = clamp(round((acc * scale) >> shift)). This keeps all intermediate activations in INT8, minimizing SRAM bandwidth requirements.
 
Last edited:
  • Wow
Reactions: 1 users
In fact, it looks like they use the sequencer patent:

US2025190746A1 SYNCHRONIZED EXECUTION OF NEURAL NETWORK LAYERS IN MULTI-CORE ENVIRONMENTS 20231207

https://github.com/Leonui/tiny-npu?tab=readme-ov-file#npu

Microcode Controller​

The sequencing brain of the NPU. It fetches 128-bit microcode instructions from SRAM, decodes them into engine-specific commands, and dispatches them to the appropriate hardware engine.

A scoreboard tracks which of the 6 engines are currently busy, ensuring instructions only dispatch when their target engine is free. A barrier instruction forces the controller to stall until all engines complete, which is needed between dependent operations (e.g., wait for GEMM before starting Softmax).

GitHub - Leonui/tiny-npu: opensource NPU for LLM inference (this run gpt2) · GitHub

View attachment 95973


For matrices larger than 16x16, the GEMM controller tiles the computation automatically. A [16][64] * [64][256] GEMM is broken into (1)(4)(16) = 64 tiles of 16x16, with partial sums accumulated across the K dimension.

After accumulation, the post-processing unit applies requantization: result_i8 = clamp(round((acc * scale) >> shift)). This keeps all intermediate activations in INT8, minimizing SRAM bandwidth requirements.
Is that good or bad :LOL:

english-please-i-don't-speak-spanish.gif
 
  • Haha
Reactions: 1 users
Top Bottom