Lies, Damn lies, and It’s the ring bus, stupid
How Intel made a rookie mistake that was easy to spot a mile away
It’s that time again, where I casually remind Intel ($INTC) that they’re a bunch of fuckups who can’t hack it anymore without their visionary leadership team, Rob Noyce and Gordon Moore, long since dead and buried; the company left to iterate on iterative work that was 10 years out of date by the time they got to it.
Last we looked at Intel we discussed their questionable geopolitical foresight and their technical expertise at the same time in recalling their recent product history (Pentium -> Core/Core i series to the 14th generation).
The same disclosures and prerequisites from that article apply, and as always take this with a huge dose of salt if you wish.
The MHz wars and the Ring Bus
Since the Dawn of the personal computing era, silicon chip manufacturers like Intel and AMD, and those since departed like Motorola; along with myriad others, have been engaging in a scientific arms race of megahertz (MHz).
This was obvious to even the casual consumer in the 1980s and 90s, as semiconductor home computing processor availability widened globally. Home computers went from singular MHz (Apple II, MOS/1.023MHz) into the hundreds of MHz (Pentium II, 233-450 MHz). With this increase came raw computational power, something necessary for the future - 3D rendering.
Megahertz lead to gigahertz by the 2000s, (GHz, thousand multiple of 1 megahertz) and you began measuring your technological epeen in “gigs”, though some would often confuse this moniker with storage capacity or memory amount, something that would later become another metric to measure an epeen by.
These megahertz and subsequent gigahertz of computational power come at a cost - switch-generated heat and power consumption. Silicon semiconductors are akin to vast arrays of small gate switches, mechanical-electric relays, if you will. These open and close through electric pulses to make onlyfans appear on your phone, thousands of times per second. As time goes on we are able to reduce their size and multiply their amount, all while increasing the speed at which they switch open and close at. This is the basis of Moore’s Law.
This is also why Intel knew by 2004 that splitting 1 core into 2 threads was never ever going to be enough. A future processor would need to be multi-core/multi-thread, having multiple CPU cores in one semiconductor die, all tied together. This is where we would find Core/Core2 and eventually Core i families and multi-core computing.
Intel, by late 2008, had chosen to use a system known as a “ring bus” in order to further advance the multi-core theory, a ring bus akin to a central nervous system of physical lines interconnecting various parts of itself (the CPU cores, system agent, graphics unit etc) as well as external I/O (memory, PCI Express). The “ring” ties the “bus” together, and on this bus the devices communicate together, speaking and listening in a timed order. The end result is something similar to automotive CANBUS, where devices such as the car’s engine computer, the lights, radio and so on all communicate with each other across wires - but on a microscopic scale.
As the bus (and various devices) cycle at ever-increasing speeds - including speed of the memory itself - the bus needs an ever-increasing level of power to sustain itself and provide the basis for devices to interconnect, at the correct speed and times.
Unfortunately, the bus is a fragile thing, delicate interconnections woven in the silicon substrate - and they can be damaged very easily with stray high voltage for any extended period of time. By 2010, limitations were being discussed in scientific papers, and in 2021 Intel’s biggest competitor AMD was being warned it too could face the same fate.
By the 2020s, in computing processor speeds, 6GHz “single core clock” was on the horizon (sometimes even achievable in some rare overclocking circumstances), with Intel leading the pack head heavy with performance crowns in 2021-22 for its 12th generation chips reaching speeds of more than 5.2GHz on “Turbo boost”, and sometimes, if the stars aligned, a few MHz more for specific cores.
Turbo boost being an Intel feature, a momentary increase beyond the normal parameters to increase clock speed MHz higher than base frequency, consuming more power and ideally in turn produces more computational output in a given workload.
12th gen was good, apparently, at keeping the ring going so to speak. So what came next? In enters 13. And in 2023, refresh / 14.
The Bus breaks down
As we eluded to in prior work, Intel’s preferred method to winning the MHz wars involves what amounts to blasting silicon with undue voltage and hoping for the best, while changing as little as possible in its X86 pasta sauce.
The Pentium 4 and subsequent first-generation Core products were for example, to put it mildly, personal desk heaters. This would improve slightly over the years as Intel moved to smaller lithography scales, but somehow continues to plague them even in 2024.
While competitors innovate and offer new architectures on a seemingly daily basis, Intel ultimately stagnates, and wants to relive the glory days. In this era you may sometimes see fantastic engineering designs like backside power delivery, but for the most part on the consumer-facing side they prefer to do the “tried-and-true” method, being gun-shy of taking on risk and changing things at the architecture level more than skin deep and iteratively.
As of 2022, this was not just their motto but entire corporate ethos. The agenda of “iterate as little as possible and add voltage”, something akin to Colin Chapman’s “Simplify, and add lightness” Lotus motto, wasn’t working out.
Intel knew it had to move on from legacy technology such as the ring bus and on to die-to-die interconnected chiplet systems, which could sustain much more throughput. They would do this starting with their high-end workstation/server parts known as “Sapphire Rapids” and on some mobile devices too. Their core desktop segment redesign would come soon enough. Instead…
Intel would attempt to use what they had learned in that space for 13th and 14th generation without actually making chiplet tiles, opting for monolithic design on these products. This would be a costly mistake, one outsourced to some degree to their arch rivals at TSMC.
Some users of the 13th and 14th generation multi-core parts quickly noticed that units with high (8+) “performance” cores were having instability issues, with applications crashing and reporting issues like “out of video memory”. This issue was noted as early as 2023 on Internet forums, with Intel offering several “microcode changes” as time went on to resolve the issue. None of these did, and the furor continued.
By July 2024, the issue was lagging their stock price and forcing their hand with an investor class-action suit, which in part claims Intel knew of the problem but had, in all of its prowess, no real way to fix it. The damage had been done. The ring bus had been, for all intents and porpoises, burnt; with Intel promising more microcode fixes and BIOS software from vendors were to come to “help solve the issue”.
Their other response, for those seeking compensation or a refund, or had otherwise had their RMA rejected prior? Better RMA resolution. 2 years of extra warranty* and troubleshooting tips that suggested to the user they may have to accept performance below what Intel had been claiming in promotional materials just months prior. The reason for all of this?
Intel didn’t stringently direct motherboard manufacturers or system integrators (think Lenovo, HP, Dell, Acer; companies you buy whole computers from) on system core and ancillary voltages - instead preferring to offer non-binding guidance, and due to the ring bus effectively being tied to CPU core voltage, under some circumstances and after a period of time (possibly under Turbo boost) the damage could be done to this critical component.
*That 2 years extended warranty? Probably doesn’t apply to you, unless you built your computer from a selection of components that you bought in a store or you ordered online, and you kept your original sales receipt. “Boxed only” as the kids say. “Tray” would have to talk to their vendor. If you bought a single Intel processor in a box from Best Buy, Intel will (probably) send you a comparable replacement. Gratis. Maybe.
Bought a 13900K-based whole computer from a company (or a processor from a wholesale tray vendor) and you’re having issues? It’s a crapshoot. Some will gladly honor your claim. However: These integrators/wholesalers aren’t obligated to provide additional support, nor are they obligated to take a defective unit back after their return policy. You’re shit out of luck if it happens to you, according to Big Blue. Not their problem.
0x129
As reported by tech YouTube blogger “TechYesCity”, Intel’s response in updated microcode (available as of yesterday at time of writing) didn’t solve the issue. It in fact may be exasperating it further and adding to it, impacting even more users long-term.
The 0x129 variant of microcode caused his machine to run hotter and consume higher voltages under testing (and, subsequently, overall wattage - producing more heat) compared to the code given at launch. Intel was seemingly now causing these chips to enter a state of “thermal throttling” under some conditions; an over-value of temperature where the CPU sensors hit a hard limit (212F/100C in this case) and it begins to restrain itself, reducing the core clocks speeds in order to reduce heat and potential damage.
They would slow themselves down quicker, in order to drop power consumption and heat - and protect themselves - diminishing clock speeds and reducing computational power. The new microcode is appearing to only making the issue worse, by taking voltages that should normalize around 1.29-1.3~V and pushing them to beyond 1.4V, causing this new over temperature condition.
The fix that didn’t
So what’s left for these users? As TechYesCity suggests, if you have the option to set such power values (K-series parts and applicable motherboards) you should reduce the voltages manually and adjust memory speeds/timings to correspond, if so applicable, and also revert to the launch microcode. Intel’s latest fix effectively making the problem just that much worse.
Beyond that, some motherboard manufacturers (even after update) allow users to modify these voltage values beyond safe specifications, as reported by BuildZoid. This is a dangerous oversight, as these values must be rigorously enforced and it’s clear that there is still no actual force being used, putting end users who may modify these values incorrectly at severe risk.
So.. as far as I can see: nothing has been fixed, new bugs were added, and things are seemingly in a worse state than ever before. Intel is encouraging users to take the microcode update and hope for the best. Most everyone else is mum on the subject.
Elsewhere in the news, Intel recently cancelled an upcoming developer talk which was set to preview new products in just a few weeks, seemingly to save costs, saying it would instead present these marvels at vendor fairs. Any hopes for a good cycle of press likely quashed amongst the scandalous response. No “Thanks, Steve” this year.
We will of course have further thoughts as time goes on.
-RS