This looks pretty decent! And then esp-dsp can be used as example code in some cases.
Personally I want to have accelerated Neural Networks for ESP32-S3 using this. Ideally for
https://github.com/sipeed/TinyMaix which is one of the smallest and most hackable CNN implementations.
The Xtensa processor comes from Cadence and for some reason they like to keep everything under NDA, even information which would help people use their processors. I find it hard to understand why the instruction set should be kept secret; a CPU vendor should make it as easy as possible for engineers to use their CPUs.
FWIW this bit in the article is a little confused. The SIMD instructions being detailed appear to be Espressif-custom things implemented using Cadence's "TIE" facility.
Cadence does indeed have their own SIMD architecture ("HiFi", really it's a family of similar but binary-incompatible ISAs). And indeed docs for that don't appear in public (though if you look carefully, details for how to emit the instructions are part of the GNU toolchain integration).
But that isn't this. If you want docs for this, talk to Espressif, not Cadence.
The P4 doesn't have a built-in radio though, so if you want those beefy RISC-V cores you will need to integrate a second ESP32 just to handle WiFi/BT :(
It will have USB-OTG and an LCD driver at least, which so far have been missing from all of their RISC-V parts.
> I think that’s totally fine. Might actually be the future direction.
Using a second board just for WiFi is definitely not totally fine for most applications. Having everything integrated into a single package important for everything from reducing BOM count to lower power consumption to development simplicity.
Cost optimization aside, that has always been the best way to use an ESP chip. Just go with one of their barebone models wired up as a peripheral to an ARM or RISC mcu.
We use esp32-s3 at my company (smart speaker) but we don't don anything fancy.
Can you explain this? Why use esp as a peripheral if you already have an ARM chip?
We were considering moving off of esp to something that would make it easier do cpu-bound AI inference on-device or to enable more advanced audio DSP algos.
Based on cost and development time, it’s usually just easier to add an ESP and communicate to it using a generic SPI library or something than to add a radio to your PCB and get vendor libraries working on an arbitrary platform.
Without the builtin radio it's really hard to justify the use of an ESP32 over, say, an STM32. The integrated small package with "everything" to make a fun project is the whole appeal.
Espressif has a huge advantage in lead times over ST recently. I migrated a few projects over because ST couldn't or wouldn't give us supply in under a month when you could buy ESP chips and have them on your doorstep practically overnight.
Depends if you want to condone IP theft (compatible independent developments is of course different but I'm not sure this is the case). R&D, Support, good documentation in English and accuracy of specs come at a price.
If they didn't copy the peripheral blocks, then all they did was implement the ARM IP just like ST Micro did in a way to be pin to pin compatible with the STM32 chips. Happy to learn otherwise. I wouldn't put them in a real product for many reasons but this is not one of them.
Espressif is Chinese company, just like gigadevices.
I don't think they do IP theft from STM32 (if it's even possible at current node size). They have very thorough datasheets different from STM32 ones and their own SDK with unique code (although it seems to be inspired by STM32 libraries, but absolutely not theft).
That's the whole appeal to hobbyists, sure. But I'm guessing Espressif wants to be considered for more serious applications as well. Currently, there are good reasons to choose, say, an STM32 over an ESP32 for a commercial product if you don't need RF (or if RF is handled by another part of the product, such as a SoM running Linux). I'm guessing they wanna change that.
Seems likely they'll continue releasing more models, further integrating the features of the P4 and C6 for example. Maybe we'll even get some risc-v SIMD instructions and support for off-chip SRAM.
Xtensa is an unusual beast because its USP (at least, back when it was owned by tensilica) was that you could easily add extensions. Not just off the shelf ones - ones you defined yourself. They had some automation that would generate a toolchain for you to use with your shiny new instructions. Most CPU architectures exist to allow programs written on once implementation to work on another, with Xtensa it's kind of the opposite - it exists to allow each chip to have its own special sauce.
Honestly I was a bit surprised that espressif used it without defining their own extension of some kind, if you're not doing that then you might as well use something better known.
Edit: ajross* points out that this SIMD extension is such a one, not an off the shelf one. So I guess that explains it.
It is possible there is some licensing issue around the SIMD, after all it is an optional component. It was available for the LX6 as well, but not included. It's been a good run but it's great the are going to the RISCV, at the very least for the vibe. I have used both architectures, more recently using their esp-idf and it is surprisingly uneventful to switch between them. The only issue I had is the different high/low speed timer devices between chips. In fact it is a surprise the on chip peripheral hardware is incredibly compatible with their idf. Sure, they have a layer for some calls but a lot is just issuing commands to io devices directly, and the same between riskv and tensilkica cores.
They also make a set of modules per chip, so you can get a particular chip in an easier to use package with e.g a built-in PCB antenna or antenna mounting ports or no antenna, various onboard flash sizes, that sort of stuff: https://www.espressif.com/en/products/modules
S3 also has USB support that I've come to hugely appreciate on dev boards... tho I just got some oddball single-port boards that used a CH340 anyways. Grrr.
S3 does not have real USB support (in the same way that any STM32 with "USB support" does) - it has a USB-UART/JTAG device that you cannot redefine built-in.
(edit: apparently it might? I can't find the documents on how to actually use the OTG device over the fixed IP)
I'm pretty confused by all this! Ie why they are set up like this. Example: All STM32 dev boards have two USB ports: One connected to the USB periph; one to a built-in ST-Link (JTAG). You flash and debug/print to CLI off JTAG, and use the USB one if your device needs to communicate with a PC etc during operation, or if you want to use DFU flashing for production boards or firmware updates.
If you are designing a board, you will probably always have the JTAG pins broken out to a port of your choice, and use an external debugger. Wire USB A/R.
The USB dev boards for C3 seem to all have only a UART bridge USB, and no JTAG! That is confusing because A: I'm not sure if this is a full-up USB peripheral for use as serial (But maybe not HID? Maybe it presents as USB-serial to the PC, but you program it on the MCU like UART?), and B: Why JTAG isn't table-stakes for a dev board.
I ended up buying a "C3-Rust" devboard, because it was the only one I found that had JTAG USB! (I am coincidentally using Rust, but that's only superficially relevant)
C3 boards are a mess, almost all of them are single-port with a USB-UART chip despite supporting USB-CDC & JTAG. Lots of C6 boards are dual-port but sometimes they're single-port, typically without a USB-UART chip. And to make it more confusing, sometimes a single-port C6 board will have a USB-UART chip that's hanging off a USB hub chip.
S3 has full USB-OTG support and most boards are either dual-port or single-port without a USB-UART chip. I dig 'em because I can put the TinyUF2 bootloader on them and get that Pi Pico experience of having them come up as mass storage.
Except for these S3 boards I bought in Arduino Uno format where the designer made every decisions as wrongly as possible — single USB, USB-UART chip, and the USB pins broken out to Arduino pins instead of using one of the optional headers they added.
Was it a secret? You could have guessed that something advertised [0] for "AI" had some kind of SIMD. Even ChatGPT 3.5 can give relevant code to use "AI" features [1].
True - load and store mask off the bottom 4 bits of the address. They try to help the situation by including an instruction which can shift a pair of 128-bit registers by bytes.
And the author is not documenting them either, just announcing his new niche library. It is not like disassembling a few functions to prove that they exist is dark magic. I just don't see any value in the article.
maybe I am missing something but isn't it barely faster than the offical ESP32_JPG? But fair enough, didn't know than JPEG decoding on MCUs is a widespread thing.
You need to go back and read it again. I provide links to the relevant Espressif documents and in my next article I provide a simple example to get started. Would you rather have me copy the hundreds of pages of PDF into my blog post instead of providing a link?
I've also definitely seen it reference invented methods on APIs (that would have been very nice if they existed) - that no past or future version implemented.
I had plans to use this SIMD support for some DSP algorithms on camera video feeds.... But looking at how badly documented it is, I may reconsider...
Without scatter/gather I don't think I'm gonna be able to meet my timing requirements (I need to distort images through warping, which is tricky to do without scatter/gather)
It does not. There are RISC-V chips out there (T-Head 906) that have a pre-1.0 vector extension, but these are 64-bit application processors. I'm sure we'll see ESP32 RISC-V chips with SIMD in the next few years.
Could save a couple of cycles per iteration by preloading the shift amounts into several GPRs before entering the loop, instead of initializing them just before use.
Yeah I did see that the S3 was meant to have SIMD but with basically no developer support for it (I guess till now?)
I was looking at doing some hardware image processing for a project (which I still haven't properly started) and looked at an S3 (too weak even with SIMD) to a Gowin FPGA (cheaper than the competition but FPGAs still seemed like a time sink learning HDL or VHDL) and then I ended up picking a pi zero 2, of which the Videocore IV has great SIMD which is now fully documented.
Only problem is that I _still_ haven't started the project even though I have parts sitting there. Was definitely tough trying to consider what the best option might be for processing high speed (90-120fps) video frames.
Looks like back in 2021 they had an intention to document these, but never quite got round to it:
https://esp32.com/viewtopic.php?p=88114&sid=f7f25776d9cfc6b6...
They do publish a bunch of opensource code that uses the SIMD stuff, and an assembler, so it isn't secret, just very badly documented.
Page 37-301 of the reference manual seems to have all you'd need, including binary instruction encodings, details on instruction timings, etc.
https://www.espressif.com/sites/default/files/documentation/...
The P4 can't come soon enough to get off Xtensa.
Cadence does indeed have their own SIMD architecture ("HiFi", really it's a family of similar but binary-incompatible ISAs). And indeed docs for that don't appear in public (though if you look carefully, details for how to emit the instructions are part of the GNU toolchain integration).
But that isn't this. If you want docs for this, talk to Espressif, not Cadence.
It will have USB-OTG and an LCD driver at least, which so far have been missing from all of their RISC-V parts.
Eg if you want 5ghz then use c5, or if you want some wifi-6 so c6, etc.
Also here’s a talk by them on how to use esp as a wifi coprocessor https://youtu.be/g14aEjnjRLw?si=TgkEyJJ2_L_Shuom
There’s also adafruit airlift https://www.adafruit.com/product/4201
Using a second board just for WiFi is definitely not totally fine for most applications. Having everything integrated into a single package important for everything from reducing BOM count to lower power consumption to development simplicity.
Can you explain this? Why use esp as a peripheral if you already have an ARM chip?
We were considering moving off of esp to something that would make it easier do cpu-bound AI inference on-device or to enable more advanced audio DSP algos.
I don't think they do IP theft from STM32 (if it's even possible at current node size). They have very thorough datasheets different from STM32 ones and their own SDK with unique code (although it seems to be inspired by STM32 libraries, but absolutely not theft).
Honestly I was a bit surprised that espressif used it without defining their own extension of some kind, if you're not doing that then you might as well use something better known.
Edit: ajross* points out that this SIMD extension is such a one, not an off the shelf one. So I guess that explains it.
* https://news.ycombinator.com/item?id=40267977
0: https://bitbanksoftware.blogspot.com/2024/01/esp32-s3-simd-m...
They also make a set of modules per chip, so you can get a particular chip in an easier to use package with e.g a built-in PCB antenna or antenna mounting ports or no antenna, various onboard flash sizes, that sort of stuff: https://www.espressif.com/en/products/modules
C3 if you don't, and are OK with RISC-V
PICO 3v2 otherwise.
(edit: apparently it might? I can't find the documents on how to actually use the OTG device over the fixed IP)
If you are designing a board, you will probably always have the JTAG pins broken out to a port of your choice, and use an external debugger. Wire USB A/R.
The USB dev boards for C3 seem to all have only a UART bridge USB, and no JTAG! That is confusing because A: I'm not sure if this is a full-up USB peripheral for use as serial (But maybe not HID? Maybe it presents as USB-serial to the PC, but you program it on the MCU like UART?), and B: Why JTAG isn't table-stakes for a dev board.
I ended up buying a "C3-Rust" devboard, because it was the only one I found that had JTAG USB! (I am coincidentally using Rust, but that's only superficially relevant)
S3 has full USB-OTG support and most boards are either dual-port or single-port without a USB-UART chip. I dig 'em because I can put the TinyUF2 bootloader on them and get that Pi Pico experience of having them come up as mass storage.
Except for these S3 boards I bought in Arduino Uno format where the designer made every decisions as wrongly as possible — single USB, USB-UART chip, and the USB pins broken out to Arduino pins instead of using one of the optional headers they added.
0: https://www.espressif.com/en/products/socs/esp32-s3
1: https://chat.openai.com/share/3e1f990d-e8eb-4e56-acbb-ad5a33...
We all knew there were SIMD instructions, but if there’s no information on how to use them or what they do…
There are some numbers here on the performance improvements he’s managed to make.
https://atomic14.substack.com/p/even-faster-jpeg-decoding
I've seen ChatGPT invent its own functions and commands ...
There, problem solved!
Without scatter/gather I don't think I'm gonna be able to meet my timing requirements (I need to distort images through warping, which is tricky to do without scatter/gather)
Could save a couple of cycles per iteration by preloading the shift amounts into several GPRs before entering the loop, instead of initializing them just before use.
I was looking at doing some hardware image processing for a project (which I still haven't properly started) and looked at an S3 (too weak even with SIMD) to a Gowin FPGA (cheaper than the competition but FPGAs still seemed like a time sink learning HDL or VHDL) and then I ended up picking a pi zero 2, of which the Videocore IV has great SIMD which is now fully documented.
Only problem is that I _still_ haven't started the project even though I have parts sitting there. Was definitely tough trying to consider what the best option might be for processing high speed (90-120fps) video frames.