PyTorch Library for Running LLM on Intel CPU and GPU


308 points | by ebalit 11 days ago


  • vegabook 11 days ago
    The company that did 4-cores-forever, has the opportunity to redeem itself, in its next consumer GPU release, by disrupting the "8-16GB VRAM forever" that AMD and Nvidia have been imposing on us for a decade. It would be poetic to see 32-48GB at a non-eye-watering price point.

    Intel definitely seems to be doing all the right things on software support.

    • riskable 11 days ago
      No kidding... Intel is playing catch-up with Nvidia in the AI space and a big reason for that is their offerings aren't competitive. You can get an Intel Arc A770 with 16GB of VRAM (which was released in October, 2022) for about $300 or an Nvidia 4060 Ti with 16GB of VRAM for ~$500 which is twice as fast for AI workloads in reality (see: )

      This is a huge problem because in theory the Arc A770 is faster! It's theoretical performance (TFLOPS) is more than twice as fast as an Nvidia 4060 (see: ). So why does it perform so poorly? Because everything AI-related has been developed and optimized to run on Nvidia's CUDA.

      Mostly, this is a mindshare issue. If Intel offered a workstation GPU (i.e. not a ridiculously expensive "enterprise" monster) that developers could use that had something like 32GB or 64GB of VRAM it would sell! They'd sell zillions of them! In fact, I'd wager that they'd be so popular it'd be hard for consumers to even get their hands on one because it would sell out everywhere.

      It doesn't even need to be the fastest card. It just needs to offer more VRAM than the competition. Right now, if you want to do things like training or video generation the lack of VRAM is a bigger bottleneck than the speed of the GPU. How does Intel not see this‽ They have the power to step up and take over a huge section of the market but instead they're just copying (poorly) what everyone else is doing.

      • Workaccount2 11 days ago
        Based on leaks, it looks like intel somehow missed an easy opportunity here. There is an insane demand for high VRAM cards now, and it seems the next intel cards will be 12GB.

        Intel, screw everything else, just pack as much VRAM in those as you can. Build it and they will come.

        • dheera 11 days ago
          Exactly, I'd love to have 1TB of RAM that can be accessed at 6000 MT/s.
          • talldayo 11 days ago
            Optane is crying and punching the walls right now.
            • yjftsjthsd-h 11 days ago
              Does optane have an advantage over RAM here?
              • watersb 11 days ago
                Optane products were sold as DIMMS with single-DIMM capacity as high as 512 GB. With an Intel memory controller that could make it look like DRAM.

                512 GB.

                It was slower than conventional DRAM.

                But for AI models, Optane may have an advantage: it's bit-addressable.

                I'm not aware of any memory controllers that exposed that single-bit granularity; Optane was fighting to create a niche for itself, between DRAM and NAND Flash: pretending to be both, when it was neither.

                Bit-level operations, computational units in the same device as massive storage, is an architecture that has yet to be developed.

                AI GPUs try to be such an architecture by plopping 16GB of HBM next to a sea of little dot-product engines.

                • wtallis 11 days ago
                  > But for AI models, Optane may have an advantage: it's bit-addressable.

                  That's an advantage over NAND but not over DRAM. Fundamentally, DRAM is also bit-addressable, but everybody uses DRAM parts with memory cells organized into a hierarchy of groupings for reasons that mostly apply to 3D XPoint memory.

      • ponector 11 days ago
        I don't agree. Who will buy it? A few enthusiasts who wants to run LLM locally but cannot afford M3 or 4090?

        It will be a niche product with poor sales.

        • bt1a 11 days ago
          I think there's more than a few enthusiasts who would be very interesting in buying 1 or more of these cards (if they had 32+ GB of memory), but I don't have any data to back that opinion up. It is not only those who can't afford a 4090 though.

          While the 4090 can run models that use less than 24GB of memory at blistering speeds, models are going to continue to scale up and 24GB is fairly limiting. Because LLM inference can take advantage of splitting the layers among multiple GPUs, high memory GPUs that aren't super expensive are desirable.

          To share a personal perspective, I have a desktop with a 3090 and an M1 Max Studio with 64GB of memory. I use the M1 for local LLMs because I can use up to 57~GB of memory, even though the output (in terms of tok/s) is much slower than ones I can fit on a 3090.

          • michaelbrave 10 days ago
            Right now I have a 3090TI so it's not worth it for me to upgrade to a 4090, but I do run into Vram constraints a lot, especially with merging stable diffusion models, especially as the models get larger (XL-Cascade-etc). As I move toward running multiple LLMs at a time I run into similar problems.

            I would gladly buy a card that ran a touch slower but had massive Vram, especially if it was affordable, but I guess that puts me into that camp of enthusiasts you mentioned.

          • Dalewyn 11 days ago
            >models are going to continue to scale up and 24GB is fairly limiting

            >24GB is fairly limiting

            Can I take a moment to suggest that maybe we're very spoiled?

            24GB of VRAM is more than most peoples' system RAM, and that is "fairly limiting"?

            To think Bill once said 640KB would be enough.

            • hnfong 11 days ago
              It doesn't matter whether anyone is "spoiled" or not.

              The fact is large language models require a lot of VRAM, and the more interesting ones need more than 24GB to run.

              The people who are able to afford systems with more than 24GB VRAM will go buy hardware that gives them that, and when GPU vendors release products with insufficient VRAM they limit their market.

              I mean inequality is definitely increasing at a worrying rate these days, but let's keep the discussion on topic...

              • Dalewyn 11 days ago
                I'm just fascinated that the response/demand to running out of RAM is "Just sell us more RAM, god damn!" instead of engineering a solution to make due with what is practically (and realistically) available.
                • dekhn 11 days ago
                  I would say that increasing RAM to avoid engineering a solution has long been a successful strategy.

                  i learned my RAM lesson when I bought my first real linux PC. it had 4MB of RAM, which was enough to run X, bash, xterm, and emacs. But once I ran all that and also wanted to compile with g++, it would start swapping, which in the days of slow hard drives, was death to productivity.

                  I spent $200 to double to 8MB, and then another $200 to double to 16MB, and then finally, $200 to max out the RAM on my machine-- 32MB! And once I did that everything flew.

                  Rather than attempting to solve the problem by making emacs (eight megs and constantly swapping) use less RAM, or find a way to hack without X, I deployed money to max out my machine (which was practical, but not realistically available to me unless I gave up other things in life for the short term). Not only was I more productive, I used that time to work on other engineering problems which helped build my career, while also learning an important lesson about swapping/paging.

                  People demand RAM and what was not practically available is often available 2 years later as standard. Seems like a great approach to me, especially if you don't have enough smart engineers to work around problems like that (see "How would you sort 4M integers in 2M of RAM?")

                  • watersb 11 days ago
                    > I spent $200 to double to 8MB, and then another $200 to double to 16MB, and then finally, $200 to max out the RAM on my machine-- 32MB!

                    Thank you. Now I feel a log better for dropping $700 on the 32MB of RAM when I built my first rig.

                • nl 11 days ago
                  While saying "we want more efficiency" is great there is a trade off between size and accuracy here.

                  It is possible that compressing and using all of human knowledge takes a lot of memory and in some cases the accuracy is more important than reducing memory usage.

                  For example [1] shows how Gemma 2B using AVX512 instructions could solve problems it couldn't solve using AVX2 because of rounding issues with the lower-memory instructions. It's likely that most quantization (and other memory reduction schemes) have similar problems.

                  As we develop more multi-modal models that can do things like understand 3D video in better than real time it's likely memory requirements will increase, not decrease.


                • xoranth 11 days ago
                  People have engineered solutions to make what is available practical (see all the various quantization schemes that have come out).

                  It is just that there's a limit to how much you can compress the models.

                • michaelt 11 days ago
                  There has in fact been a great deal of careful engineering to allow 70 billion parameter models to run on just 48GB of VRAM

                  The people training 70B parameter models from scratch need ~600GB of VRAM to do it!

                • dragonwriter 10 days ago
                  Quantization and CPU mode and hybrid mode where the model is split between CPU and GPU exist and work well for LLMs, but in the end more VRAM is a massive quality of life improvement for running (and probably more for training, which has higher RAM needs and forbwhich quantization isn't useful, AFAIK) them, even ifbyou technically can do them on CPU alone or hybrid with no/lower VRAM requirements.
                • whiplash451 11 days ago
                  By the same logic, we’d still be writing assembly code on 640KB RAM machines in 2024.
                • hnfong 10 days ago
                  What makes you think people aren't trying to engineer a solution that uses less RAM?

                  There are millions (billions?) of dollars at stake here, and obviously the best minds are already tackling the problem. Only plebs like us who don't have the skills to do so bicker on an internet forum... It's not like we could realistically spend the time inventing ways to run inference with fewer resources and make significant headway.

        • loudmax 11 days ago
          I tend to agree that it would be niche. The machine learning enthusiast market is far smaller than the gamer market.

          But selling to machine learning enthusiasts is not a bad place to be. A lot of these enthusiasts are going to go on to work at places that are deploying enterprise AI at scale. Right now, almost all of their experience is CUDA and they're likely to recommend hardware they're familiar with. By making consumer Intel GPUs attractive to ML enthusiasts, Intel would make their enterprise GPUs much more interesting for enterprise.

          • mysteria 11 days ago
            The problem is that this now becomes a long term investment, which doesn't work out when we have CEOs chasing quarterly profits and all that. Meanwhile Nvidia stuck with CUDA all those years back (while ensuring that it worked well on both the consumer and enterprise line) and now they reap the rewards.
            • Wytwwww 11 days ago
              Current Intel and its leadership seems to be much more focused on long term goals/growth than before, or so they claim.
          • resource_waste 11 days ago
            I need offline LLMs for work.

            It doesnt need to be consumer grade, it doesnt need to be ultra high either.

            It needs to be cheap enough for my department to expensive it via petty cash.

          • antupis 11 days ago
            It would be same playbook that NVIDIA did CUDA where was market 2010 when it was research labs and hobbyists doing vector calculations.
        • Aerroon 11 days ago
          It's about mindshare. Random people using your product to do AI means that the tooling is going to improve because people will try to use them. But as it stands right now if you think there's any chance you want to use AI in the next 5 years, then why would you buy anything other than Nvidia?

          It doesn't even matter if that's your primary goal or not.

        • talldayo 11 days ago
          > Who will buy it?

          Frustrated AMD customers willing to put their money where their mouth is?

        • resource_waste 11 days ago


          These are noob hardware. A6000 is my choice.

          Which really only further emphesizes your point.

          >CPU based is a waste of everyone's time/effort

          >GPU based is 100% limited by VRAM, and is what you are realistically going to use.

        • jmward01 11 days ago
          Microsoft got where they are because the developed tools that everyone used. The got the developers and the consumers followed. Intel (or AMD) could do the same thing. Get a big card with lost of ram so that the developers get used to your ecosystem and then sell the enterprise GPUs to make the $$$. It is a clear path with a lot of history and it blows my mind Intel and AMD aren't doing it.
          • zoobab 10 days ago
            "Microsoft got where they are because the developed tools that everyone used."

            It's not like they don't have a monopoly on pre-installed OSes.

        • alecco 11 days ago
          AFAIK, unless you are a huge American corp with orders above $100m Nvidia will only sell you old and expensive server cards like the crappy A40 PCIe 4.0 48GB GDDR6 at $5,000. Good luck getting SXM H100s or GH200.

          If Intel sells a stackable kit with a lot of RAM and a reasonable interconnect a lot of corporate customers will buy. It doesn't even have to be that good, just half way between PCIe 5.0 and NVLink.

          But it seems they are still too stuck in their old ways. I wouldn't count on them waking up. Nor AMD. It's sad.

          • ponector 11 days ago
            Parent comment requested non-enterprise, consumer grade GPU with tons of memory. I'm sure there is no market for this.

            However, server solutions could have some traction.

            • alecco 10 days ago
              Hobbyists are stacking 3090s with NVLink.
      • glitchc 11 days ago
        I think the answer to that is fairly straightforward. Intel isn't in the business of producing RAM. They would have to buy and integrate a third-party product which is likely not something their business side has ever contemplated as a viable strategy.
        • monocasa 11 days ago
          Their GPUs as sold already include RAM.
          • glitchc 11 days ago
            Yes, but they don't fab their own RAM. It's a cost center for them.
            • monocasa 11 days ago
              If they can sell the board with more RAM for more than their extra RAM costs, or can sell more GPUs total but the RAM itself is priced essentially at cost, then it's not a cost center.
            • RussianCow 11 days ago
              That's not what a cost center is. There is an opportunity for them to make more money by putting more RAM into their GPUs and exposing themselves to a different market. Whether they physically manufacture that RAM doesn't matter in the slightest.
              • glitchc 11 days ago
                I agree with you, it makes good business sense. No doubt. I'm merely positing that Intel executives are not used to doing business as integrators. Their core business has always been that of a supplier.
        • actionfromafar 9 days ago
          I wonder what the market dynamics for NVidia would be if Intel bought as much VRAM as it could.
    • chessgecko 11 days ago
      Going above 24GB is probably not going to be cheap until gddr7 is out, and even that will only push it to 36gb. The fancier stacked gddr6 stuff is probably pretty expensive and you can’t just add more dies because of signal integrity issues.
      • frognumber 11 days ago
        Assuming you want to maintain full bandwidth.

        Which I don't care too much about.

        However, even 16->24GB is a big step, since a lot of the model are developed for 3090/4090-class hardware. 36GB would place it lose to the class of the fancy 40GB data center cards.

        If Intel decided to push VRAM, it will definitely have a market. Critically, a lot of folks will also be incentivized to make software compatible, since it will be the cheapest way to run models.

        • 0cf8612b2e1e 11 days ago
          At this point, I cannot run an entire class of models without OOM. I will take a performance hit if it lets me run it at all.

          I want a consumer card that can do some number of tokens per second. I do not need a monster that can serve as the basis for a startup.

          • hnfong 11 days ago
            A maxed out Mac Studio probably fits your requirements as stated.
            • 0cf8612b2e1e 11 days ago
              If I were willing to drop $4k on that setup, I might as well get the real NVidia offering.

              The hobbyist market needs something priced well under $1k to make it accessible.

        • rnewme 11 days ago
          How comes you don't care about full bandwidth?
          • frognumber 11 days ago
            Mostly because I use this for development.

            If a model takes twice as long to run.... I'll live. Worst-case, it will be mildly annoying.

            If I can't run a model, that's a critical failure.

            There's a huge step up CPU->GPU which I need, but 3060 versus 4090 isn't a big deal at all. Indeed, the 24GB versus 16GB is a bigger difference than the number of CUDA cores.

          • Dalewyn 11 days ago
            The thing about RAM speed (aka bandwidth) is that it becomes irrelevant if you run out and have to page out to slower tiers of storage.
    • sitkack 11 days ago
      What is obvious to us, is an industry standard to Product Managers. When is the last time you have seen an industry player upset the status quo? Intel has not changed that much.
    • zoobab 11 days ago
      "It would be poetic to see 32-48GB at a non-eye-watering price point."

      I heard some Asrock motherboard BIOSes could set the VRAM up to 64GB on Ryzen5.

      Doing some investigations with different AMD hardware atm.

      • stefanka 11 days ago
        That would be an interesting information. Which MB works with with which APU with 32 or more GB of VRAM. Can you post your findings please?
      • LoganDark 11 days ago
        When has an APU ever been as fast as a GPU? How much cache does it have, a few hundred megabytes? That can't possibly be enough for matmul, no matter how much slow DDR4/5 is technically addressable.
        • zoobab 10 days ago
          "APU ever been as fast as a GPU"

          Ryzen5 has both CPU+GPU on one chip, the BIOS allows you set the amount of VRAM. They share the same RAM bank, you can set 16GB of VRAM and 16GB for the OS if you use a 32GB RAM bank.

          • LoganDark 10 days ago
            What I'm saying is that GPUs rely on having the memory close to the die so that it actually has enough bandwidth to saturate the cores. System memory is not very close to the CPU (compared to GPUs), so I have doubts about whether an APU would be able to reach GPU levels of performance over gigabytes of model weights.
    • belter 11 days ago
      AMD making drivers of high quality? I would pay to see that :-)
    • haunter 11 days ago
      First crypto then AI, I wish GPUs were left alone for gaming.
      • talldayo 11 days ago
        Are there actually gamers out there that are still struggling to source GPUs? Even at the height of the mining craze, it was still possible to backorder cards at MSRP if you're patient.

        The serious crypto and AI nuts are all using custom hardware. Crypto moved onto ASICs for anything power-efficient, and Nvidia's DGX systems aren't being cannibalized from the gaming market.

      • azinman2 11 days ago
        Didn’t nvidia try to block this in software by slowing down mining?

        Seems like we just need consumer matrix math cards with literally no video out, and then a different set of requirements for those with a video out.

        • wongarsu 11 days ago
          But Nvidia doesn't want to make consumer compute cards because those might steal market share from the datacenter compute cards they are selling at 5x markup.
          • talldayo 11 days ago
            Nvidia does sell consumer compute cards, they're just sold at datacenter compute prices:

            Nvidia's approach to software certainly deserves scrutiny, but their hardware lineup is so robust that I find it hard to complain. Jetson already exists for low-wattage solutions, and gaming cards can run Nvidia datacenter drivers on headless Linux without issue. The consumer compute cards are already here, you just aren't using them.

          • dragonwriter 10 days ago
            What we need is more real competition for NVidia cards at all levels, so rather than avoiding competing with themselves, they are worried about actually competing with the competition.

            But capital doesn't want to invest in competition when it can instead invest in a chance at a moat that allows charging monopoly rents.

      • baq 11 days ago
        They were.

        But then those pesky researchers and hackers figured out how to use the matmul hardware for non-gaming.

    • OkayPhysicist 11 days ago
      The issue from the manufacturer's perspective is that they've got two different customer bases with wildly different willingness to pay, but not substantially different needs from their product. If Nvidia and AMD didn't split the two markets somehow, then there would be no cards available to the PC market, since the AI companies with much deeper pockets would buy up the lot. This is undesirable from the manufacturer's perspective for a couple reasons, but I suspect a big one is worries that the next AI winter would cause their entire business to crater out, whereas the PC market is pretty reliable for the foreseeable future.

      Right now, the best discriminator they have is that PC users are willing to put up with much smaller amounts of VRAM.

    • UncleOxidant 11 days ago
      > Intel definitely seems to be doing all the right things on software support.

      Can you elaborate on this? Intel's reputation for software support hasn't been stellar, what's changed?

    • whalesalad 11 days ago
      still wondering why we can't have gpu's with sodimm slots so you can crank the vram
      • amir_karbasi 11 days ago
        I believe that the issue is that graphic cards require really fast memory. This requires close memory placement (that's why the memory is so close to the core on the board). expandable memory will not be able to provide the required bandwidth here.
        • frognumber 11 days ago
          The universe used to have hierarchies. Fast memory close, slow memory far. Registers. L1. L2. L3. RAM. Swap.

          The same thing would make a lot of sense here. Super-fast memory close, with overflow into classic DDR slots.

          As a footnote, going parallel also helps. 8 sticks of RAM at 1/8 the bandwidth each is the same as one stick of RAM at 8x the bandwidth, if you don't multiplex onto the same traces.

          • riskable 11 days ago
            It's not so simple... The way GPU architecture works is that it needs as-fast-as-possible access to its VRAM. The concept of "overflow memory" for a GPU is your PC's RAM. Adding a secondary memory controller and equivalent DRAM to the card itself would only provide a trivial improvement over, "just using the PC RAM".

            Point of fact: GPUs don't even use all the PCI Express lanes they have available to them! Most GPUs (even top of the line ones like Nvidia's 4090) only use about 8 lanes of bandwidth. This is why some newer GPUs are being offered with M.2 slots so you can add an SSD ( ).

          • wongarsu 11 days ago
            GPUs have memory hierarchies too. A 4090 has about 16MB of L1 cache and 72MB of L2 cache, followed by the 24GB of GDDR6 RAM, followed by host ram that can be accessed via PCIe.

            The issue is that GPUs are massively parallel. A 4090 has 128 streaming multiprocessors, each executing 128 "threads" or "lanes" in parallel. If each "thread" works on a different part of memory that leaves you with 1kB of L1 cache per thread, and 4.5kB of L2 cache each. For each clock cycle you might be issuing thousands of request to your memory controller for cache misses and prefetching. That's why you want insanely fast RAM.

            You can write CUDA code that directly accesses your host memory as a layer beyond that, but usually you want to transfer that data in bigger chunks. You probably could make a card that adds DDR4 slots as an additional level of hierarchy. It's the kind of weird stuff Intel might do (the Phi had some interesting memory layout ideas).

        • magicalhippo 11 days ago
          Isn't part of the problem that the connectors add too much inductance, making the lines difficult to drive at high speed? Similar issue to distance I suppose but more severe.
      • riskable 11 days ago
        You can do this sort of thing but you can't use SODIMM slots because that places the actual memory chips too far away from the GPU. Instead what you need is something like BGA sockets ( ) which are stupidly expensive (e.g. $600 per socket).
      • justsomehnguy 11 days ago
        Look at the motherboards with >2 Memory channels. That would require a lot of physical space, which is quite restricted on a 50 y/o standard for the expansion cards.
      • chessgecko 11 days ago
        You could, but the memory bandwidth wouldn’t be amazing unless you had a lot of sticks and it would end up getting pretty expensive
  • Hugsun 11 days ago
    I'd be interested in seeing benchmark data. The speed seemed pretty good in those examples.
  • captaindiego 11 days ago
    Are there any Intel GPUs with a lot of vRAM that someone could recommend that would work with this?
    • Aromasin 11 days ago
      There's the Max GPU (Ponte Vecchio), their datacentre offering, with 128GB of HBM2e memory, 408 MB of L2 cache, and 64 MB of L1 cache. Then there's Gaudi, which has similar numbers but with cores specific for AI workloads (as far as I know from the marketing).

      You can pick them up in prebuilds from Dell and Supermicro:

      Read more about them here:

    • goosedragons 11 days ago
      For consumer stuff there's the Intel Arc A770 with 16GB VRAM. More than that and you start moving into enterprise stuff.
      • ZeroCool2u 11 days ago
        Which seems like their biggest mistake. If they would just release a card with more than 24GB VRAM, people would be clamoring for their cards, even if they were marginally slower. It's the same reason that 3090's are still in high demand compared to the 4090's.
  • DrNosferatu 11 days ago
    Any performance benchmark against 'llamafile'[0] or others?

    [0] -

  • donnygreenberg 11 days ago
    Would be nice if this came with scripts which could launch the examples on compatible GPUs on cloud providers (rather than trying to guess?). Would anyone else be interested in that? Considering putting it together.
  • antonp 11 days ago
    Hm, no major cloud provider offers intel gpus.
    • belthesar 11 days ago
      Intel GPUs got quite a bit of penetration in the SE Asian market, and Intel is close to releasing a new generation. In addition, Intel's allowing for GPU virtualization without additional license fees (unlike Nvidia and GRID licenses), allowing hosting operators to carve up these cards. I have a feeling we're going to see a lot more Intel offerings available.
    • VHRanger 11 days ago
      No, but for consumers they're a great offering.

      16GB RAM and performance around a 4060ti or so, but for 65% of the price

      • _joel 11 days ago
        and 65% of the software support, less I'm inclined to believe? Although having more players in the fold is definitely a good thing.
        • VHRanger 11 days ago
          Intel is historically really good at the software side, though.

          For all their hardware research hiccups in the last 10 years, they've been delivering on open source machine learning libraries.

          It's apparently the same on driver improvements and gaming GPU features in the last year.

          • frognumber 11 days ago
            I'm optimistic Intel will get the software right in due course. Last I looked, it wasn't all there yet, but it was on the right track.

            Right now, I have a nice NVidia card, but if things stay on track, I think it's very likely my next GPU might be Intel. Open-source, not to mention better value.

          • HarHarVeryFunny 11 days ago
            But even if Intel have stable optimized drivers and ML support, it'd still need to be supported by PyTorch/etc for most developers to want to use it. People want to write at high level, not at CUDA-type level.
            • VHRanger 11 days ago
              Intel is supported in Pytorch, though. It's supported from their own branch, which is presumably a big annoyance to install, but they do work
              • HarHarVeryFunny 11 days ago
                I just tried googling for Intel's PyTorch, and it's clear as mud as to exactly what's run on the GPU and what is not. I assume they'd be bragging about it if this ran everything on their GPU the same as it would on NVDIA, so I'm guessing it just accelerates some operations.
    • anentropic 11 days ago
      Lots offer Intel CPUs though...
  • Valerie_Wilson 11 days ago
  • tomrod 11 days ago
    Looking forward to reviewing!