> On the desktop scene, AMD’s Ryzen 7950X3D mixes cache configurations and has mild clock speed differences. For portable devices, the Ryzen Z1 here mixes cores with different physical designs and has larger clock speed deltas across the chip. AMD’s other consumer offerings do not use hybrid configurations.
There are also desktop APUs using a combination of Zen 4 and Zen 4c cores (Ryzen 3 8300G, Ryzen 5 8500G), as well as other mobile CPUs doing the same (Ryzen 3 7440U, Ryzen 5 7545U).
I hate these hybrid configurations, I recently rebuilt my desktop and jumped from Intel to AMD (7950X, not 7950X3D) specificially because I don't want a hybrid desktop CPU. All of Intel's 14th gen desktop offerings are hybrid. I hope AMD doesn't get on that bandwagon.
I do a lot of highly parallelized image processing and preprocessing for machine learning and I don't want any cores holding other cores back.
As your workload grows in thread count, processors become limited by memory bandwidth per socket, as well as power and thermal constraints. The 'big' cores can only perform at their peak frequency, power draw and IPC when not in a high multicore workload. So you end up wasting silicon resources, power in an all-big core design. Instead if you make the last several cores to be turned on smaller and optimized for a lower power state, you can fit more of them in, and achieve higher overall thread throughput for a given die size or power/thermal envelope.
So imagine you set a given budget for silicon die size and chip power. You might then have 3 design possibilities. Say 8 big cores, 16 little or 12 in hybrid. The hybrid design should be able to match the all-big in single/lightly threaded loads, while beating the all-big in highly multithreaded. These tradeoffs of course work best in form-factors that are constrained by cost or battery power. Which is where you see amd using them in consumer. Or by space/power where you see AMD giving choice of small/dense core options in Epyc.
And what happens if you run something that's embarrassingly parallel and you need all the juice you can get from your cores. With hybrids you have to rely on the OS to decide when it will utilize everything and when not which in some heavy loads could cost you performance wise.
The OS doesn't decide how many threads the application spawns. Is there a hybrid system you're aware of where the OS will refuse to schedule N threads across N cores due to the cores not all being the same type?
The only such example I'm aware of is Intel's recently-launched Meteor Lake laptop chips that include an extra two low-power E cores on a separate chiplet from the rest of the CPU cores, running at half the performance of the regular E cores and with no L3 cache, so whether they're used or not only makes a few percent difference to embarrassingly parallel workloads.
I didn’t say that the OS refuses to utilize cores. But I’ve seen examples of embarrassingly parallel tasks that utilize 80% of the cpu to run faster on CPUs with only performance cores when compares to hybrids with the same number of threads. There might be some other reason for that, but my gut feeling is to stick with all performance designs.
> I’ve seen examples of embarrassingly parallel tasks that utilize 80% of the cpu to run faster on CPUs with only performance cores when compares to hybrids with the same number of threads
The "same number of threads" bit is throwing me off. A hybrid config will generally have more CPU cores allowing it to handle more threads for a chip of the same size, so testing with a fixed thread count can only tell you that the performance of each small core is less than the performance of a large core. But maybe the examples you have in mind are complicated by SMT/HyperThreading (which Intel has on P cores but not E cores, while AMD has it on both)?
> The 'big' cores can only perform at their peak frequency, power draw and IPC when not in a high multicore workload.
This is really not true. With a decent motherboard and sufficient cooling, you can force the big cores to stay virtually locked at peak clock speeds. Der8auer does it for a Ryzen 8000 APU here just as an example: https://youtu.be/VNYx72Elgss?t=335
Putting a laptop chip into a desktop platform and giving it water cooling and unlimited power delivery isn't a particularly strong counterexample. That's nowhere near the operating point the silicon was designed for, and the fact that AMD shipped a few SKUs that enable such usage as an afterthought doesn't disprove much.
It is hybrid but the only difference is that the little cores are clocked lower.
It is pretty common for a CPU to have different clock speeds depending on how many cores are active (not to mention AVX clocks, so clock speed is just a thing that bounces around). This doesn’t seem drastically different.
I concur with this and did the same. I think the schedulers are still not mature enough to properly handle these slower efficiency cores, and then you wind up getting frustrated when one of your tasks gets offloaded to one or more of these unexpectedly.
The interesting thing is that even on highly dense workloads, the benchmarks and real world tests are showing that these hybrid core architectures are doing surprisingly great things.
I have more reading to do on the AMD side, so if you have more info than I take this with a grain of salt, but at least on the Intel side of the camp, the performance of a fully saturated E core is only slightly worse than its P core buddy, with 2x E cores meeting or exceeding the work output of a single P core + its hyperthread.
Even if the only purpose for the E cores is to run OS and other background workloads while your big processing job is getting unfettered access to the P core array, I imagine you'd see pretty great gains here.
I've owned a bunch of expensive ultrabooks, like $2500 samsung .. whatever it was with an aircraft aluminum frame and dell xps 13zs all with some form of intel low power laptop chip.
They were all horrible to use. They constantly throttled to the point that I eventually ripped the dell open and repasted everything inside of it including de-lidding the cpu. I had to refoil and do a bunch of other stuff you wouldn't have to do on a desktop to repaste it was honestly nerve wracking and I've done all sorts of custom desktop pc water loop pasting stuff. Point being that it was an awful thing to have to do but the xps ran SO bad I did it. I gave the dell to my mom and I think I gave that samsung away too, I def don't have it anymore. And then I was done with intel ulv procs.
That dell xps 13z was the prettiest, nicest form factor laptop I've used and it SUCKED and it was so sad.
2 months ago I bought an ally and I was absolutely terrified that it would run the same as those.
It runs GREAT. It feels almost like using my amd ryzen 7700x, obviously not the same power but it's snappy and I don't mind using it as a laptop in bed at all. They did a great job. I have mine with a case on it that has a laptop keyboard/trackpad and it holds the screen up in bed, and I just use a travel mouse with it. I do wish the screen was larger though, so I wonder if I'd be happier with the Legion.
> I think they should not have marketed the device for gaming.
What do you mean by this? Like you think it's better as a productivity or media consumption device? I never use my Ally for watching anything because the screens so small and I have an ipad, I kind of figured I'd use it more if I got the legion instead.
I have plenty other computers that are better suited for the type of gaming I do.
I imagined using the device, with its stand, on my tummy. Mostly reading but also hosting stuff. It's about 650 grams without the controllers, so not ideal, but better than my 1.4 kg laptop. Better than a phone.
Minisforum is about to release a high end Ryzen 8000 14 inch tablet real soon. Still a kilo, though.
I don't remember exactly what I did but I had to pry apart a bunch of copper heatspreader tubes and things like that, and I have a circle of copper heat spreader material and other weird stuff I haven't needed on other computers, I may not have delidded it but it was a very harrowing job. I was actually, as far as I know according to reddit and wherever I was reading dell xps forum stuff the first person to take it apart and do it. I wrote a tutorial about it I THINK on reddit. IIRC I had to remove the entire main board from the chassis and stuff.
edit: Oh, I found a 2022 post about here on HN. I guess I didn't de-lid it. I took pictures of everything for the tutorial but no idea where that'd be, reddits search for dell 13z is worthless. Wow I didn't have that laptop very long. I absolutely hated using it, I wouldn't even use it in bed, and I DID get it replaced at one point hoping it was a bad laptop.
This is giving me terrible flashbacks I even broke a few things, there were hidden screws holding board down you had to remove from the underside somehow or break everything when you lifted the board up. I had NO idea they were there and everything was super fragile.
Like the post says near the end, this seems like at least partly the result of AMD not being the size of Intel or Apple, so not being able to have the staffing to create, validate, etc. two radically different large and small designs. So instead they tweak the synthesis of one design to lower max freq in exchange for area. There's less perf downside to using these cores, but less area/power upside.
The perf impact of this or any hybrid design is also blunted by the power/cooling limits on non-hybrid designs: you can't afford to blast all the full-size cores on the chip at the max theoretical frequency at once anyway. Then, at the other end, lightly-threaded workloads can run mostly or only on the larger cores, so they're not heavily impacted either.
If there's a way this ends up fun for consumer folks that follow this stuff, it'd probably be by allowing products w/more cores. In server chips using Zen 4c (Bergamo), more cores is explicitly the focus. On their desktop platform, one CCD of full-sized cores and one of smaller cores would allow a 24C two-CCD package. That would add another "flavor" of CPU, the way X3D is another "flavor", but I suspect there are applications where the extra throughput could help. No particular hints that this is part of their plans, to be clear.
Another part of it is that AMD's Zen core is a much more area-efficient design than Intel's Raptor Cove (et. al.) core. Raptor Cove is 7.43mm^2 (Intel 7, Excluding L2/3 Cache) while Zen4 is 3.84 mm^2 (TSMC N5, Excluding L3 Cache). Intel 7 and TSMC N5 have comparable densities, though the numbers have changed over time and it depends on the transistor libraries used, etc. AMDs core is close to a clean-sheet redesign and, as Gelsinger has admitted, Intel has lost design leadership.
The Zen4c core has a comparable size to the Gracemont E-cores, so they don't really need a separate design.
Right, more bandwidth would really help mobile iGPUs, maybe more than it'd help many-core CPUs. I'm not sure if they want to sell super-beefy consumer iGPUs, or if AM5 is physically capable of having more channels. But it would be shiny, and seems like it's worked out in the large Apple SoCs.
Would've been cool if Asus had really tried to optimize the Ally around the Z1 with faster RAM, higher GPU clocks, and better battery life instead of offering the Z1 Extreme in an upsell. It's going for $400 now, which would've been real competition a year ago. Hard to recommend it now over the OLED Deck.
Will be interesting to see AMD try this in consumer processors, with frame time graphs.
Whilst in some ways I’d prefer OLED , the 120hz display with high nits output on the ally has been phenomenal for retro titles using BFI. In fact it’s been so good I don’t think I could give it up now. It’s a pity because in every other way, it’s not really that ideal for me. I’m a sucker for that motion resolution but.
You need 120hz minimum to be able to do BFI for a title pumping out at 60hz.
BFI stands for "Black Frame Insertion" and it enhances motion resolution in LCD/OLED/Sample and Hold displays by inserting a black frame or blackout between regular frames. This has the result of reducing motion blur by shortening the time each frame is visible, making fast-moving content appear sharper and clearer.
It effectively mimics in a way older CRT monitors, improving the perception of motion. Which for me is huge, because I still game on CRT's on the regular and LCD's abysmal handling of motion is a major detriment to me of my enjoyment of such things, especially for titles that were designed and produced for displays that had much clearer motion.
It's a niche use case i'll admit, and i'd never claim otherwise, and it's not for everybody. But its funny that without that capability, id trade it in for an OLED deck tomorrow.
That’s simply not an issue. I’m not trying to simulate a CRT, I’m trying to avoid nausea inducing sample and hold blur. Which BFI does a good enough job of. Sample and hold displays including OLED are to me just both smeary messes at anything below 240hz.
I also use BFI with my OLED and sure, BFI at 240hz is better but it’s not available to me in a handheld.
I don’t perceive any flicker at all, not even on my CRTs and never have. But oddly, I found motion blur induced by sample and hold displays induces nausea in my system. I’m really sensitive to it for some reason.
I think the best way to put it is that currently, despite their drawbacks, BFI/Display strobing is about as close as we can currently get, but it's not perfect. They both come with drawbacks. Especially with the BFI im using (which is built into retroarch.)
It gets you really close, but not all the way to the motion clarity of a CRT. It's not at the point yet where I would remove my CRT's from my desk, but another few years of brightness increases and resolution increases in <27" and under OLED's, and I could be almost convinced to finally retire them.
Funnily enough VR headsets might be some of the best displays for retro due to their ability to do really good low persistance strobing without something called crosstalk.
Another drawback I forgot to mention is that if you couldnt perhaps tolerate 60hz on a CRT, you may find the same with BFI/strobing, so again, it rules it out for a lot of people.
This is interesting. The article mentioned Zen4c is architecture same to Zen4 but optimized for density running at lower frequency.
Question here if anyone knows the answer: it seems like high frequency requires significantly more transistors? And is optimized for density also means less power consumption(assuming both zen4 and zen4c running at same frequency)?
For a given process design kit (PDK), the synthesis tool will have a few different types of transistors. They correspond to different trade-offs between size and power on one hand, and speed on the other. The fastest the transistor, the bigger it is and the more it leaks (lower voltage threshold means faster switch time, but more leakage).
For a given target frequency, the synthesis tool will always use the most efficient transistors it can. And the result is a mix, using the few available types. But the highest the frequency, the higher the proportion of faster and bigger transistors in the mix.
This is the bird's eye view and very simplified, but hopefully enough to get the idea ;)
It may require more transistors in some places, e.g. in longer buffer chains needed to drive greater capacitances at higher frequencies, but it requires mostly bigger transistors in many places.
According to AMD, both the big core and the small core use the same RTL design, but a different physical design, i.e. they use different libraries of standard cells (optimized either for high speed or for low area and low power consumption) and different layouts in the custom parts.
My understanding is that AMD approches the core count for multi thread / single or limited thread task at high frequency challenge in a very different way from Intel.
Intel goes with here are some real beefy cores who can do anything , here are some weaker core who can do only some task.
AMD goes here are half of the cores who can go real fast, here are half core who must remain slower, but everyone can do everything.
In theory, Intel could have better perf if optimized for, while AMD could have better perf with any generic random app out there... As long as the OS has enough hint to put the right app on the right core, and bothers to do it.
I think it's much less a philosophical difference and much more about what they had lying around. Intel had Atom core designs available to pair up with their desktop cores, and combining them into one chip was clearly a rush job rather than the plan from the start.
On the other hand, AMD only really has their Zen series of cores to use, but they rely more than Intel on automated layout tools so they can more easily port designs to a different fab process or do a second physical layout of the same architecture on the same process.
They don't require more transistors, they require bigger transistors. Ideally if transistor A is pushing a line with twice the capacitance attached compared to transistor B, transistor A would be twice as wide and so have twice the drive current of transistor B. But of course making transistors bigger increases the capacitive load of driving them. So you solve for an equilibrium trading off the current to capacitance ratio against total chip size. And the Ryzen 4 versus 4c choose different ratios to optimize.
 Back in the day due to the intrinsic capacitance of the transistors themselves. These days more because bigger transistors are further apart leading to more line capacitance.
It is very rare to be able to fuse off part of a CPU core. Fusing off part of its cache is common, but other than that the only example that comes to mind is some server CPUs where Intel fused off the third vector unit.
High clock rates require smaller clock domains, where everything needs to happen in the same clock cycle. If you break the same logic into smaller clock domains, you need buffers between the domains. Zen4c significantly dropped the max frequency, so there are fewer clock domains and much less chip area spent on buffering transistors.
Otoh, modern power management involves clock gating --- turning off the clock in specific clock domains that aren't being used at the moment; having fewer clock domains makes that less granular and potentially less effective.
Other's points about individual transistors being smaller for a lower frequency design also applies. There may be other complementary benefits from lowering the frequency target too.
But note, it's not magic. The Zen4c server parts, where design area had been most disclosed, use a lot less space per core, and for L1 cache, but L2 and L3 cache take about the same area per byte as on Zen4.
Yeah, I think high frequency requires more transistors to do buffering of signals. Also reducing the cache speeds and size allows simpler, smaller designs to be used. Finally, reduced frequencies mean that you don't need as high a voltage to force signals to go to 0 or 1 quickly so you need less power. All of this gives zen4c lower power consumption at the same frequency.
Apple has just designed the CPU cores with the highest IPC among all known until now.
The highest IPC is the best choice for minimum energy consumption in a CPU with few cores, like a mobile phone or light laptop CPU, because it allows the same performance at a lower clock frequency.
A higher IPC requires more transistors and more money spent for the design of the core. Apple has been rich enough to afford both the design cost and the cost of being able to use exclusively more advanced manufacturing processes.
When the IPC is increased to very high values, the area and the power consumption are increased by greater factors than the increase in performance.
Therefore, for a CPU with many cores, like a server CPU, the highest IPC is not the best. There will be a lower IPC that is better, because it allows packing more cores in a given chip area and a given thermal design power, leading to a higher aggregate performance for multi-threaded tasks.
The optimum IPC depends on the characteristics of the CMOS manufacturing process. Unfortunately, the design rules of the up-to-date CMOS processes are secret and in recent years the CPU manufacturers have published much less information about their designs than it was customary decades ago.
Therefore, based on the publicly available information it is impossible to estimate which would be the optimum IPC for a CPU with many cores.
Nevertheless, it is likely that the IPC of the big Apple cores is too high and even the IPC of the big Intel and AMD cores is likely to be too high.
The IPC of the Zen 4c cores may be closer to optimal in the current technology, but the companies which design CPU cores probably cannot afford to explore enough of the design space to determine which would be a really optimum IPC.
Even among single-threaded tasks, some workloads offer high instruction level parallelism that can be exploited by a wide and slow core while others consist of long dependency chains that minimize instruction level parallelism and are better served by a more narrow core that can run at much higher clock speeds.
You are right, but in the context of CPU cores the normal criteria for optimization are the ratio between performance and energy consumption and the ratio between performance and price, where the price is supposed to be determined by the manufacturing cost, which in turn for a given manufacturing process is determined by die area, so the second ratio becomes the ratio between performance and die area.
If economic criteria like money spent for the design are used for optimization, than it becomes pretty much impossible to compare the competing companies, because almost every CPU design might be considered to have an optimum IPC, no matter how bad it is, because it is the best that could be achieved within their budget, by their design team.
Because Apple can sell expensive products, the die area and the cost of the manufacturing process had little importance for them. Because their products use only few cores, the area of one core was not constrained, because the total die area remained below the process limits, when the core area is multiplied by the number of cores.
So their main optimization criterion has been the ratio between performance and energy consumption for a single core, without area per core and power dissipation per core constraints. In this case the IPC is not constrained, there is no optimization problem for determining the IPC, the higher IPC, the better. Therefore, unlike the designer of cores for a server CPU, Apple has designed the cores with the highest IPC that they could achieve within the project schedule and budget, in a given TSMC process.
Cracking answer, thanks. Just a not-so-small question though, increasing IPC linearly exponentiates the complexity and therefore the price. Current CPUs like x86 already have enormous scheduling windows for OOO (something like 200 instructions) and similarly sized instruction retirement windows (the ROB I think). In your view, is throwing load of money at the problem a good solution? I feel it may come back and bite them later.
The problem is that what really counts for someone who purchases a CPU are the 2 optimization criteria that I have mentioned above, which determine how much money will be spent initially for buying it and how much money will be spent over its lifetime for its energy consumption, both directly and indirectly (for cooling, occupied volume, carrying weight).
On the other hand, the customers, except for the biggest customers who might receive samples for in-house testing before purchasing, cannot estimate accurately those costs, because the vendors avoid to provide adequate information, but the customers must depend on things like published benchmarks and extrapolations from older systems.
So the CPU vendors have the incentive to skew their designs to get good results in some popular benchmarks, even when this policy results in designs that are inferior at better criteria.
For example, for many years Intel has included the secure hash extensions only in their Atom CPUs, even if they would have been more useful in their other CPUs. This was caused because Geekbench included a SHA subtest, the ARM CPUs already had SHA instructions and Intel wanted for their Atom CPUs to reach Geekbench scores competitive with ARM. Only after Zen has also added SHA, the Intel Core and Xeon CPUs have also implemented it.
The desktop CPUs spend a lot of resources for obtaining the best scores in the most popular single-threaded benchmarks.
While a responsive computer is very desirable, the truth is that whenever a computer with a 5-GHz CPU reacts slowly to user input, that is guaranteed to be caused by badly designed software.
The multi-threaded performance is much more fundamental, because it has physical limits determined by the current CMOS technology and it is also the true limit for productivity when the computer is used with really optimized software.
What I mean, is that targeting popular single-threaded benchmarks in order to promote sales can result in spending more resources and in much more complex CPU cores than in the case when the cores would be optimized only for objective criteria and they would be exploited by better software.
Namely Wide Decode Stage Design. My numbers may not be up to date but most x86 from AMD and Intel only has 3-4. Some of the recent **mount from Intel retrofitted with 6 stage ( I think, and you can count Zen 3 dong 8 with micro-ops but I dont think that is a fair comparison ).
Apple have been on this design for at least 3 iterations or 6 years. Zen 5 will be a ground up x86 design with wide decode by default ( i.e not something bolted on ). It will probably take some learning curve for them as well. The only other wide decode stage design I am aware of is POWER10. But then we never have enough information about it to read through. And it isn't clear how well the latest x86 ISA fits in compared it AArch64 which is fairly clean without AArch32 baggage.
The Anandtech article  on M1 goes over this. But again none of these are new. If you follow the Apple CPU Core design trend. And there are insane amount of smaller details that matters. And the industry are all trending on similar wide decode designs in terms of high performance core, from Cortex X4 and the up coming X5 to Qualcomm's Oryon core. I would envision in 4-5 years time, unless Apple has any other breakthrough. All Performance CPU Core design to be converging into something fairly similar from a high level. This also echo what was witnessed by a Chip And Cheese article. ( Which I dont have time to find out the link so people will have to do some digging themselves :) )
x86 at least has it being more CISCy on its side, with a read/write potentially being inlined into another instruction (perhaps that's the only significant thing but it's still relatively meaningful).
I'm sure intel & AMD know fully well that there's lots of benefit to decoding wider, it's just significantly more expensive to do so on x86 than it is on aarch64 and thus they spend silicon where it's more effective to do so.
The article sidesteps the question of whether the OS scheduler can operate the differential cores effectively. The 7900X3D and 7950X3D became infamous for having less practical performance compared to their lower-tier brother, because the scheduler couldn't grasp which cores to schedule threads on.
Does Zen4+Zen4c have a "Ryzen Thread Director" of sorts? Everything I've heard up to this point suggests it does not, relying completely on the OS scheduler to just figure it out (spoiler: it won't).
AMD might make great hardware, but they consistently seem to give nary a damn about software.
Since the cores have the same microarchitecture and thus nearly identical performance characteristics aside from maximum frequency, the scheduling decisions are far simpler than for the 7950X3D or Intel's heterogeneous parts. It's only slightly more complex than Intel's more recent iterations of Turbo Boost, where 8-10 nominally identical cores have different maximum frequencies and one or two are identified as the fastest cores.
The optimal scheduler behavior is: if a task needs more performance than a Zen 4c core can provide, run it on a full-sized Zen4 core. If it doesn't need that extra performance, run it on a Zen4c core to save power.
A properly authored multi-core workload will balance across the cores even if some of them are slower via things like work-stealing, so while having a "thread director" to move the heaviest workloads onto the fastest cores would be nice, this architecture probably still works out pretty good even if it's less than perfect.
Not certain if it's related, but the recent p-state changes to the related code for AMD in the Linux scheduler has gotten smarter over the last several releases. It can pinpoint which cores are the fastest out similarly speced cores, based on manufacturing processes and related, and prefer those cores for tasks as they can run just a little bit faster.