ESP32-S3 has a few SIMD instructions

(bitbanksoftware.blogspot.com)

189 points | by _Microft 13 days ago

9 comments

  • londons_explore 13 days ago
    ESP_Sprite, former opensource-projects-guy, now Espressif employee, is the best source of knowledge on this stuff.

    Looks like back in 2021 they had an intention to document these, but never quite got round to it:

    https://esp32.com/viewtopic.php?p=88114&sid=f7f25776d9cfc6b6...

    They do publish a bunch of opensource code that uses the SIMD stuff, and an assembler, so it isn't secret, just very badly documented.

  • adolph 13 days ago
    The Xtensa processor comes from Cadence and for some reason they like to keep everything under NDA, even information which would help people use their processors. I find it hard to understand why the instruction set should be kept secret; a CPU vendor should make it as easy as possible for engineers to use their CPUs.

    The P4 can't come soon enough to get off Xtensa.

    • ajross 13 days ago
      FWIW this bit in the article is a little confused. The SIMD instructions being detailed appear to be Espressif-custom things implemented using Cadence's "TIE" facility.

      Cadence does indeed have their own SIMD architecture ("HiFi", really it's a family of similar but binary-incompatible ISAs). And indeed docs for that don't appear in public (though if you look carefully, details for how to emit the instructions are part of the GNU toolchain integration).

      But that isn't this. If you want docs for this, talk to Espressif, not Cadence.

    • jsheard 13 days ago
      The P4 doesn't have a built-in radio though, so if you want those beefy RISC-V cores you will need to integrate a second ESP32 just to handle WiFi/BT :(

      It will have USB-OTG and an LCD driver at least, which so far have been missing from all of their RISC-V parts.

      • antoniuschan99 13 days ago
        I think that’s totally fine. Might actually be the future direction. But yea would’ve been nice to have wifi/bt integrated.

        Eg if you want 5ghz then use c5, or if you want some wifi-6 so c6, etc.

        Also here’s a talk by them on how to use esp as a wifi coprocessor https://youtu.be/g14aEjnjRLw?si=TgkEyJJ2_L_Shuom

        There’s also adafruit airlift https://www.adafruit.com/product/4201

        • Aurornis 13 days ago
          > I think that’s totally fine. Might actually be the future direction.

          Using a second board just for WiFi is definitely not totally fine for most applications. Having everything integrated into a single package important for everything from reducing BOM count to lower power consumption to development simplicity.

      • ComputerGuru 13 days ago
        Cost optimization aside, that has always been the best way to use an ESP chip. Just go with one of their barebone models wired up as a peripheral to an ARM or RISC mcu.
        • clbrmbr 13 days ago
          Yet it’s possible to build some incredible applications on top of just ESP32, especially with extra RAM.
        • devmunchies 13 days ago
          We use esp32-s3 at my company (smart speaker) but we don't don anything fancy.

          Can you explain this? Why use esp as a peripheral if you already have an ARM chip?

          We were considering moving off of esp to something that would make it easier do cpu-bound AI inference on-device or to enable more advanced audio DSP algos.

          • throwup238 13 days ago
            Based on cost and development time, it’s usually just easier to add an ESP and communicate to it using a generic SPI library or something than to add a radio to your PCB and get vendor libraries working on an arbitrary platform.
      • yau8edq12i 13 days ago
        Without the builtin radio it's really hard to justify the use of an ESP32 over, say, an STM32. The integrated small package with "everything" to make a fun project is the whole appeal.
        • AlotOfReading 13 days ago
          Espressif has a huge advantage in lead times over ST recently. I migrated a few projects over because ST couldn't or wouldn't give us supply in under a month when you could buy ESP chips and have them on your doorstep practically overnight.
          • vbezhenar 13 days ago
            Did you look at chinese STM clones? We used gd32, I liked it.
            • makapuf 13 days ago
              Depends if you want to condone IP theft (compatible independent developments is of course different but I'm not sure this is the case). R&D, Support, good documentation in English and accuracy of specs come at a price.
              • dbuder 11 days ago
                If they didn't copy the peripheral blocks, then all they did was implement the ARM IP just like ST Micro did in a way to be pin to pin compatible with the STM32 chips. Happy to learn otherwise. I wouldn't put them in a real product for many reasons but this is not one of them.
              • vbezhenar 12 days ago
                Espressif is Chinese company, just like gigadevices.

                I don't think they do IP theft from STM32 (if it's even possible at current node size). They have very thorough datasheets different from STM32 ones and their own SDK with unique code (although it seems to be inspired by STM32 libraries, but absolutely not theft).

              • pantalaimon 12 days ago
                Is implementing the same peripheral register API really IP theft?
                • makapuf 12 days ago
                  No, this was my remark about being compatible.
            • 6SixTy 13 days ago
              Wide product range. ARM and RISC-V all called GD32 with an extra letter for the exact line.
        • mort96 13 days ago
          That's the whole appeal to hobbyists, sure. But I'm guessing Espressif wants to be considered for more serious applications as well. Currently, there are good reasons to choose, say, an STM32 over an ESP32 for a commercial product if you don't need RF (or if RF is handled by another part of the product, such as a SoM running Linux). I'm guessing they wanna change that.
        • the__alchemist 13 days ago
          That's a big differentiator - it's surprising that there is no STM32 with Wi-Fi.
      • sitkack 13 days ago
        The ESP8684H2 is 1.20 qty 1, more than enough to handle BT an Wifi, then you can use any MCU you want as your application processor.
      • timschmidt 13 days ago
        Seems likely they'll continue releasing more models, further integrating the features of the P4 and C6 for example. Maybe we'll even get some risc-v SIMD instructions and support for off-chip SRAM.
    • ajb 13 days ago
      Xtensa is an unusual beast because its USP (at least, back when it was owned by tensilica) was that you could easily add extensions. Not just off the shelf ones - ones you defined yourself. They had some automation that would generate a toolchain for you to use with your shiny new instructions. Most CPU architectures exist to allow programs written on once implementation to work on another, with Xtensa it's kind of the opposite - it exists to allow each chip to have its own special sauce.

      Honestly I was a bit surprised that espressif used it without defining their own extension of some kind, if you're not doing that then you might as well use something better known.

      Edit: ajross* points out that this SIMD extension is such a one, not an off the shelf one. So I guess that explains it.

      * https://news.ycombinator.com/item?id=40267977

    • mianos 13 days ago
      It is possible there is some licensing issue around the SIMD, after all it is an optional component. It was available for the LX6 as well, but not included. It's been a good run but it's great the are going to the RISCV, at the very least for the vibe. I have used both architectures, more recently using their esp-idf and it is surprisingly uneventful to switch between them. The only issue I had is the different high/low speed timer devices between chips. In fact it is a surprise the on chip peripheral hardware is incredibly compatible with their idf. Sure, they have a layer for some calls but a lot is just issuing commands to io devices directly, and the same between riskv and tensilkica cores.
  • tzmlab 13 days ago
    There's also a follow-up blog post "ESP32-S3 SIMD Minimal Example" [0].

    0: https://bitbanksoftware.blogspot.com/2024/01/esp32-s3-simd-m...

  • amelius 13 days ago
    Where is a good overview of the various ESP32 chips available and their features?
    • mort96 13 days ago
      Espressif has a pretty decent overview on their website: https://www.espressif.com/en/products/socs

      They also make a set of modules per chip, so you can get a particular chip in an easier to use package with e.g a built-in PCB antenna or antenna mounting ports or no antenna, various onboard flash sizes, that sort of stuff: https://www.espressif.com/en/products/modules

    • the__alchemist 13 days ago
      S3 if you want more pins and fast

      C3 if you don't, and are OK with RISC-V

      PICO 3v2 otherwise.

      • tbyehl 13 days ago
        S3 also has USB support that I've come to hugely appreciate on dev boards... tho I just got some oddball single-port boards that used a CH340 anyways. Grrr.
        • 15155 12 days ago
          S3 does not have real USB support (in the same way that any STM32 with "USB support" does) - it has a USB-UART/JTAG device that you cannot redefine built-in.

          (edit: apparently it might? I can't find the documents on how to actually use the OTG device over the fixed IP)

          • bitbank 12 days ago
            I think you're thinking of the ESP32-C3. The S3 does have a fully programmable USB port that can do things like HID, mass storage, etc.
            • the__alchemist 12 days ago
              I'm pretty confused by all this! Ie why they are set up like this. Example: All STM32 dev boards have two USB ports: One connected to the USB periph; one to a built-in ST-Link (JTAG). You flash and debug/print to CLI off JTAG, and use the USB one if your device needs to communicate with a PC etc during operation, or if you want to use DFU flashing for production boards or firmware updates.

              If you are designing a board, you will probably always have the JTAG pins broken out to a port of your choice, and use an external debugger. Wire USB A/R.

              The USB dev boards for C3 seem to all have only a UART bridge USB, and no JTAG! That is confusing because A: I'm not sure if this is a full-up USB peripheral for use as serial (But maybe not HID? Maybe it presents as USB-serial to the PC, but you program it on the MCU like UART?), and B: Why JTAG isn't table-stakes for a dev board.

              I ended up buying a "C3-Rust" devboard, because it was the only one I found that had JTAG USB! (I am coincidentally using Rust, but that's only superficially relevant)

              • tbyehl 12 days ago
                C3 boards are a mess, almost all of them are single-port with a USB-UART chip despite supporting USB-CDC & JTAG. Lots of C6 boards are dual-port but sometimes they're single-port, typically without a USB-UART chip. And to make it more confusing, sometimes a single-port C6 board will have a USB-UART chip that's hanging off a USB hub chip.

                S3 has full USB-OTG support and most boards are either dual-port or single-port without a USB-UART chip. I dig 'em because I can put the TinyUF2 bootloader on them and get that Pi Pico experience of having them come up as mass storage.

                Except for these S3 boards I bought in Arduino Uno format where the designer made every decisions as wrongly as possible — single USB, USB-UART chip, and the USB pins broken out to Arduino pins instead of using one of the optional headers they added.

  • lunfard000 13 days ago
    Was it a secret? You could have guessed that something advertised [0] for "AI" had some kind of SIMD. Even ChatGPT 3.5 can give relevant code to use "AI" features [1].

    0: https://www.espressif.com/en/products/socs/esp32-s3

    1: https://chat.openai.com/share/3e1f990d-e8eb-4e56-acbb-ad5a33...

    • iamflimflam1 13 days ago
      Not a secret - just not documented very well if at all.

      We all knew there were SIMD instructions, but if there’s no information on how to use them or what they do…

      • bobmcnamara 13 days ago
        IIRC, they have 128bit alignment requirements, so tricky to autovectorize.
        • bitbank 12 days ago
          True - load and store mask off the bottom 4 bits of the address. They try to help the situation by including an instruction which can shift a pair of 128-bit registers by bytes.
          • bobmcnamara 9 days ago
            That sounds really familiar. Maybe Altivec did that? I remember it did something like that but I wish that it would just fault.
      • lunfard000 13 days ago
        And the author is not documenting them either, just announcing his new niche library. It is not like disassembling a few functions to prove that they exist is dark magic. I just don't see any value in the article.
        • iamflimflam1 13 days ago
          I’m not sure I’d call a JPEG decoding library “niche”.

          There are some numbers here on the performance improvements he’s managed to make.

          https://atomic14.substack.com/p/even-faster-jpeg-decoding

          • lunfard000 8 days ago
            maybe I am missing something but isn't it barely faster than the offical ESP32_JPG? But fair enough, didn't know than JPEG decoding on MCUs is a widespread thing.
        • bitbank 12 days ago
          You need to go back and read it again. I provide links to the relevant Espressif documents and in my next article I provide a simple example to get started. Would you rather have me copy the hundreds of pages of PDF into my blog post instead of providing a link?
    • amelius 13 days ago
      > Even ChatGPT 3.5 can give relevant code to use "AI" features

      I've seen ChatGPT invent its own functions and commands ...

      • makapuf 13 days ago
        I've also definitely seen it reference invented methods on APIs (that would have been very nice if they existed) - that no past or future version implemented.
      • exe34 13 days ago
        if problem: solve_problem()

        There, problem solved!

        • ssl-3 13 days ago
          # rest of problem-solving code goes here
    • relaxing 13 days ago
      I love doing engineering based off of advertising material…
  • londons_explore 13 days ago
    I had plans to use this SIMD support for some DSP algorithms on camera video feeds.... But looking at how badly documented it is, I may reconsider...

    Without scatter/gather I don't think I'm gonna be able to meet my timing requirements (I need to distort images through warping, which is tricky to do without scatter/gather)

  • DeathArrow 12 days ago
    What about ESP32-C3, using RISC-V architecture, does it also have SIMD instructions?
    • guntars 12 days ago
      It does not. There are RISC-V chips out there (T-Head 906) that have a pre-1.0 vector extension, but these are 64-bit application processors. I'm sure we'll see ESP32 RISC-V chips with SIMD in the next few years.
  • hrydgard 13 days ago
    Nice!

    Could save a couple of cycles per iteration by preloading the shift amounts into several GPRs before entering the loop, instead of initializing them just before use.

  • fennecfoxy 11 days ago
    Yeah I did see that the S3 was meant to have SIMD but with basically no developer support for it (I guess till now?)

    I was looking at doing some hardware image processing for a project (which I still haven't properly started) and looked at an S3 (too weak even with SIMD) to a Gowin FPGA (cheaper than the competition but FPGAs still seemed like a time sink learning HDL or VHDL) and then I ended up picking a pi zero 2, of which the Videocore IV has great SIMD which is now fully documented.

    Only problem is that I _still_ haven't started the project even though I have parts sitting there. Was definitely tough trying to consider what the best option might be for processing high speed (90-120fps) video frames.