ESP32-S3 has a few SIMD instructions

(bitbanksoftware.blogspot.com)

189 points | by _Microft 424 days ago

9 comments

londons_explore 424 days ago
ESP_Sprite, former opensource-projects-guy, now Espressif employee, is the best source of knowledge on this stuff.
Looks like back in 2021 they had an intention to document these, but never quite got round to it:
https://esp32.com/viewtopic.php?p=88114&sid=f7f25776d9cfc6b6...
They do publish a bunch of opensource code that uses the SIMD stuff, and an assembler, so it isn't secret, just very badly documented.
[-]
- londons_explore 424 days ago
  Upon further inspection, it now seems like it is much better documented...
  Page 37-301 of the reference manual seems to have all you'd need, including binary instruction encodings, details on instruction timings, etc.
  https://www.espressif.com/sites/default/files/documentation/...
  [-]
  - jononor 423 days ago
    This looks pretty decent! And then esp-dsp can be used as example code in some cases. Personally I want to have accelerated Neural Networks for ESP32-S3 using this. Ideally for https://github.com/sipeed/TinyMaix which is one of the smallest and most hackable CNN implementations.
adolph 424 days ago
The Xtensa processor comes from Cadence and for some reason they like to keep everything under NDA, even information which would help people use their processors. I find it hard to understand why the instruction set should be kept secret; a CPU vendor should make it as easy as possible for engineers to use their CPUs.
The P4 can't come soon enough to get off Xtensa.
[-]
- ajross 424 days ago
  FWIW this bit in the article is a little confused. The SIMD instructions being detailed appear to be Espressif-custom things implemented using Cadence's "TIE" facility.
  Cadence does indeed have their own SIMD architecture ("HiFi", really it's a family of similar but binary-incompatible ISAs). And indeed docs for that don't appear in public (though if you look carefully, details for how to emit the instructions are part of the GNU toolchain integration).
  But that isn't this. If you want docs for this, talk to Espressif, not Cadence.
- jsheard 424 days ago
  The P4 doesn't have a built-in radio though, so if you want those beefy RISC-V cores you will need to integrate a second ESP32 just to handle WiFi/BT :(
  It will have USB-OTG and an LCD driver at least, which so far have been missing from all of their RISC-V parts.
  [-]
  - antoniuschan99 424 days ago
    I think that’s totally fine. Might actually be the future direction. But yea would’ve been nice to have wifi/bt integrated.
    Eg if you want 5ghz then use c5, or if you want some wifi-6 so c6, etc.
    Also here’s a talk by them on how to use esp as a wifi coprocessor https://youtu.be/g14aEjnjRLw?si=TgkEyJJ2_L_Shuom
    There’s also adafruit airlift https://www.adafruit.com/product/4201
    [-]
    - Aurornis 424 days ago
      > I think that’s totally fine. Might actually be the future direction.
      Using a second board just for WiFi is definitely not totally fine for most applications. Having everything integrated into a single package important for everything from reducing BOM count to lower power consumption to development simplicity.
  - ComputerGuru 424 days ago
    Cost optimization aside, that has always been the best way to use an ESP chip. Just go with one of their barebone models wired up as a peripheral to an ARM or RISC mcu.
    [-]
    - clbrmbr 424 days ago
      Yet it’s possible to build some incredible applications on top of just ESP32, especially with extra RAM.
    - devmunchies 424 days ago
      We use esp32-s3 at my company (smart speaker) but we don't don anything fancy.
      Can you explain this? Why use esp as a peripheral if you already have an ARM chip?
      We were considering moving off of esp to something that would make it easier do cpu-bound AI inference on-device or to enable more advanced audio DSP algos.
      [-]
      - throwup238 423 days ago
        Based on cost and development time, it’s usually just easier to add an ESP and communicate to it using a generic SPI library or something than to add a radio to your PCB and get vendor libraries working on an arbitrary platform.
  - yau8edq12i 424 days ago
    Without the builtin radio it's really hard to justify the use of an ESP32 over, say, an STM32. The integrated small package with "everything" to make a fun project is the whole appeal.
    [-]
    - AlotOfReading 424 days ago
      Espressif has a huge advantage in lead times over ST recently. I migrated a few projects over because ST couldn't or wouldn't give us supply in under a month when you could buy ESP chips and have them on your doorstep practically overnight.
      [-]
      - vbezhenar 424 days ago
        Did you look at chinese STM clones? We used gd32, I liked it.
        [-]
        makapuf 423 days ago
        Depends if you want to condone IP theft (compatible independent developments is of course different but I'm not sure this is the case). R&D, Support, good documentation in English and accuracy of specs come at a price.
        [-]
        dbuder 422 days ago
        If they didn't copy the peripheral blocks, then all they did was implement the ARM IP just like ST Micro did in a way to be pin to pin compatible with the STM32 chips. Happy to learn otherwise. I wouldn't put them in a real product for many reasons but this is not one of them.
        vbezhenar 423 days ago
        Espressif is Chinese company, just like gigadevices.
        I don't think they do IP theft from STM32 (if it's even possible at current node size). They have very thorough datasheets different from STM32 ones and their own SDK with unique code (although it seems to be inspired by STM32 libraries, but absolutely not theft).
        [-]
        makapuf 422 days ago
        I don't think expressif steals from ST. I m quite sure GD steals from ST (source: ST themselves, see answer from ST in https://www.th3dstudio.com/2021/08/03/gd32-cpu-license-issue...)
        yau8edq12i 423 days ago
        Nobody said that espressif steals IP from ST.
        pantalaimon 423 days ago
        Is implementing the same peripheral register API really IP theft?
        [-]
        makapuf 423 days ago
        No, this was my remark about being compatible.
        6SixTy 424 days ago
        Wide product range. ARM and RISC-V all called GD32 with an extra letter for the exact line.
    - mort96 423 days ago
      That's the whole appeal to hobbyists, sure. But I'm guessing Espressif wants to be considered for more serious applications as well. Currently, there are good reasons to choose, say, an STM32 over an ESP32 for a commercial product if you don't need RF (or if RF is handled by another part of the product, such as a SoM running Linux). I'm guessing they wanna change that.
    - the__alchemist 424 days ago
      That's a big differentiator - it's surprising that there is no STM32 with Wi-Fi.
  - sitkack 424 days ago
    The ESP8684H2 is 1.20 qty 1, more than enough to handle BT an Wifi, then you can use any MCU you want as your application processor.
  - timschmidt 424 days ago
    Seems likely they'll continue releasing more models, further integrating the features of the P4 and C6 for example. Maybe we'll even get some risc-v SIMD instructions and support for off-chip SRAM.
- ajb 424 days ago
  Xtensa is an unusual beast because its USP (at least, back when it was owned by tensilica) was that you could easily add extensions. Not just off the shelf ones - ones you defined yourself. They had some automation that would generate a toolchain for you to use with your shiny new instructions. Most CPU architectures exist to allow programs written on once implementation to work on another, with Xtensa it's kind of the opposite - it exists to allow each chip to have its own special sauce.
  Honestly I was a bit surprised that espressif used it without defining their own extension of some kind, if you're not doing that then you might as well use something better known.
  Edit: ajross* points out that this SIMD extension is such a one, not an off the shelf one. So I guess that explains it.
  * https://news.ycombinator.com/item?id=40267977
- mianos 424 days ago
  It is possible there is some licensing issue around the SIMD, after all it is an optional component. It was available for the LX6 as well, but not included. It's been a good run but it's great the are going to the RISCV, at the very least for the vibe. I have used both architectures, more recently using their esp-idf and it is surprisingly uneventful to switch between them. The only issue I had is the different high/low speed timer devices between chips. In fact it is a surprise the on chip peripheral hardware is incredibly compatible with their idf. Sure, they have a layer for some calls but a lot is just issuing commands to io devices directly, and the same between riskv and tensilkica cores.
tzmlab 424 days ago
There's also a follow-up blog post "ESP32-S3 SIMD Minimal Example" [0].
0: https://bitbanksoftware.blogspot.com/2024/01/esp32-s3-simd-m...
amelius 424 days ago
Where is a good overview of the various ESP32 chips available and their features?
[-]
- mort96 424 days ago
  Espressif has a pretty decent overview on their website: https://www.espressif.com/en/products/socs
  They also make a set of modules per chip, so you can get a particular chip in an easier to use package with e.g a built-in PCB antenna or antenna mounting ports or no antenna, various onboard flash sizes, that sort of stuff: https://www.espressif.com/en/products/modules
- the__alchemist 424 days ago
  S3 if you want more pins and fast
  C3 if you don't, and are OK with RISC-V
  PICO 3v2 otherwise.
  [-]
  - tbyehl 424 days ago
    S3 also has USB support that I've come to hugely appreciate on dev boards... tho I just got some oddball single-port boards that used a CH340 anyways. Grrr.
    [-]
    - 15155 423 days ago
      S3 does not have real USB support (in the same way that any STM32 with "USB support" does) - it has a USB-UART/JTAG device that you cannot redefine built-in.
      (edit: apparently it might? I can't find the documents on how to actually use the OTG device over the fixed IP)
      [-]
      - bitbank 423 days ago
        I think you're thinking of the ESP32-C3. The S3 does have a fully programmable USB port that can do things like HID, mass storage, etc.
        [-]
        the__alchemist 423 days ago
        I'm pretty confused by all this! Ie why they are set up like this. Example: All STM32 dev boards have two USB ports: One connected to the USB periph; one to a built-in ST-Link (JTAG). You flash and debug/print to CLI off JTAG, and use the USB one if your device needs to communicate with a PC etc during operation, or if you want to use DFU flashing for production boards or firmware updates.
        If you are designing a board, you will probably always have the JTAG pins broken out to a port of your choice, and use an external debugger. Wire USB A/R.
        The USB dev boards for C3 seem to all have only a UART bridge USB, and no JTAG! That is confusing because A: I'm not sure if this is a full-up USB peripheral for use as serial (But maybe not HID? Maybe it presents as USB-serial to the PC, but you program it on the MCU like UART?), and B: Why JTAG isn't table-stakes for a dev board.
        I ended up buying a "C3-Rust" devboard, because it was the only one I found that had JTAG USB! (I am coincidentally using Rust, but that's only superficially relevant)
        [-]
        tbyehl 423 days ago
        C3 boards are a mess, almost all of them are single-port with a USB-UART chip despite supporting USB-CDC & JTAG. Lots of C6 boards are dual-port but sometimes they're single-port, typically without a USB-UART chip. And to make it more confusing, sometimes a single-port C6 board will have a USB-UART chip that's hanging off a USB hub chip.
        S3 has full USB-OTG support and most boards are either dual-port or single-port without a USB-UART chip. I dig 'em because I can put the TinyUF2 bootloader on them and get that Pi Pico experience of having them come up as mass storage.
        Except for these S3 boards I bought in Arduino Uno format where the designer made every decisions as wrongly as possible — single USB, USB-UART chip, and the USB pins broken out to Arduino pins instead of using one of the optional headers they added.
lunfard000 424 days ago
Was it a secret? You could have guessed that something advertised [0] for "AI" had some kind of SIMD. Even ChatGPT 3.5 can give relevant code to use "AI" features [1].
0: https://www.espressif.com/en/products/socs/esp32-s3
1: https://chat.openai.com/share/3e1f990d-e8eb-4e56-acbb-ad5a33...
[-]
- iamflimflam1 424 days ago
  Not a secret - just not documented very well if at all.
  We all knew there were SIMD instructions, but if there’s no information on how to use them or what they do…
  [-]
  - bobmcnamara 424 days ago
    IIRC, they have 128bit alignment requirements, so tricky to autovectorize.
    [-]
    - bitbank 423 days ago
      True - load and store mask off the bottom 4 bits of the address. They try to help the situation by including an instruction which can shift a pair of 128-bit registers by bytes.
      [-]
      - bobmcnamara 420 days ago
        That sounds really familiar. Maybe Altivec did that? I remember it did something like that but I wish that it would just fault.
  - lunfard000 424 days ago
    And the author is not documenting them either, just announcing his new niche library. It is not like disassembling a few functions to prove that they exist is dark magic. I just don't see any value in the article.
    [-]
    - iamflimflam1 423 days ago
      I’m not sure I’d call a JPEG decoding library “niche”.
      There are some numbers here on the performance improvements he’s managed to make.
      https://atomic14.substack.com/p/even-faster-jpeg-decoding
      [-]
      - lunfard000 419 days ago
        maybe I am missing something but isn't it barely faster than the offical ESP32_JPG? But fair enough, didn't know than JPEG decoding on MCUs is a widespread thing.
        [-]
        iamflimflam1 410 days ago
        The "official" version used in that blog post decodes the JPG all in one go - so it's pretty memory hungry. With JPEG encoders that decode sections of the image at a time you can minimise the amount of RAM that needs to be allocated. It's also possible to stream the display data out to screen using DMA while the next chunk of image data is being decoded.
        It's explained in a bit more details in this original blog post written before the library was optimised: https://atomic14.substack.com/p/the-fastest-esp32-jpeg-decod....
        It's very easy to forget what a range of MCUs there are, from very puny, to very capable. For example the Espressif range of MCUs - which you'll find in all sorts of consumer products - are very powerful. Couple that with a lot of cheap SPI based display modules and you very quickly start wanting to show images.
    - bitbank 423 days ago
      You need to go back and read it again. I provide links to the relevant Espressif documents and in my next article I provide a simple example to get started. Would you rather have me copy the hundreds of pages of PDF into my blog post instead of providing a link?
- amelius 424 days ago
  > Even ChatGPT 3.5 can give relevant code to use "AI" features
  I've seen ChatGPT invent its own functions and commands ...
  [-]
  - makapuf 423 days ago
    I've also definitely seen it reference invented methods on APIs (that would have been very nice if they existed) - that no past or future version implemented.
  - exe34 424 days ago
    if problem: solve_problem()
    There, problem solved!
    [-]
    - ssl-3 424 days ago
      # rest of problem-solving code goes here
- relaxing 424 days ago
  I love doing engineering based off of advertising material…
londons_explore 424 days ago
I had plans to use this SIMD support for some DSP algorithms on camera video feeds.... But looking at how badly documented it is, I may reconsider...
Without scatter/gather I don't think I'm gonna be able to meet my timing requirements (I need to distort images through warping, which is tricky to do without scatter/gather)
DeathArrow 423 days ago
What about ESP32-C3, using RISC-V architecture, does it also have SIMD instructions?
[-]
- guntars 423 days ago
  It does not. There are RISC-V chips out there (T-Head 906) that have a pre-1.0 vector extension, but these are 64-bit application processors. I'm sure we'll see ESP32 RISC-V chips with SIMD in the next few years.
hrydgard 423 days ago
Nice!
Could save a couple of cycles per iteration by preloading the shift amounts into several GPRs before entering the loop, instead of initializing them just before use.
fennecfoxy 422 days ago
Yeah I did see that the S3 was meant to have SIMD but with basically no developer support for it (I guess till now?)
I was looking at doing some hardware image processing for a project (which I still haven't properly started) and looked at an S3 (too weak even with SIMD) to a Gowin FPGA (cheaper than the competition but FPGAs still seemed like a time sink learning HDL or VHDL) and then I ended up picking a pi zero 2, of which the Videocore IV has great SIMD which is now fully documented.
Only problem is that I _still_ haven't started the project even though I have parts sitting there. Was definitely tough trying to consider what the best option might be for processing high speed (90-120fps) video frames.