Comments on: Copper Wires Have Already Failed Clustered AI Systems https://www.nextplatform.com/2024/09/13/copper-wires-have-already-failed-clustered-ai-systems/ In-depth coverage of high-end computing at large enterprises, supercomputing centers, hyperscale data centers, and public clouds. Mon, 16 Dec 2024 12:10:00 +0000 hourly 1 https://wordpress.org/?v=6.7.1 By: Hermann Strobel https://www.nextplatform.com/2024/09/13/copper-wires-have-already-failed-clustered-ai-systems/#comment-242489 Mon, 16 Dec 2024 12:10:00 +0000 https://www.nextplatform.com/?p=144690#comment-242489 In reply to Michael Orwin.

POET has partners with most crucial importance:
Mitsubishi, FIT (Foxconn), Luxshare: They all supply the BIG.

In 2024 POET got three awards regarding best AI solution

POET recently received additional two 25 million dollar investments

Together with Mitsubishi they develop 3.2Tb transceivers (400Gb/lane)
Far ahead of the market !!

POET is massively increasing its mass production in Asia (China and Malaysia)

With POET’s Lightsources they are at the forefront in chip2chip communication.

The company has NO debt, but around 83Mill $ in Cash

]]>
By: Michael Orwin https://www.nextplatform.com/2024/09/13/copper-wires-have-already-failed-clustered-ai-systems/#comment-235549 Tue, 24 Sep 2024 23:10:32 +0000 https://www.nextplatform.com/?p=144690#comment-235549 I’ve heard that Poet Technologies have an optical interposer that’s supposed to enable connections to be accurate, and be made automatically. Apparently, the accuracy reduces the heat generated. Any thoughts on if Poet’s interposers are likely to make a big difference, or if they aren’t critical because Wade says the problems with optical components are being solved? It might hard to say because it looks like Poet doesn’t have anything in volume production yet.

]]>
By: James Choi https://www.nextplatform.com/2024/09/13/copper-wires-have-already-failed-clustered-ai-systems/#comment-235286 Sun, 22 Sep 2024 16:48:11 +0000 https://www.nextplatform.com/?p=144690#comment-235286 Just because Intel had a few bad quarters it does not mean it is defunct. nvlink is not just a protocol, and a lot of assumption is made that the interconnect bandwidth is the bottleneck. It is a limited view of system architecture. The parameters of tokens/s which is the buzzword today – we have seen it already achieved in the likes of Cerebras and Groq, so are these charts really relevant ? There are a lot of smart people at nvidia amd intel and it is not upto Wade to solve this problem, which may not be a problem actually.

]]>
By: itellu3times https://www.nextplatform.com/2024/09/13/copper-wires-have-already-failed-clustered-ai-systems/#comment-234913 Thu, 19 Sep 2024 15:03:11 +0000 https://www.nextplatform.com/?p=144690#comment-234913 We will look back on this in twenty years and laugh.
Anyone remember the early Cray computers? Very steampunk, by modern standards.
Hardware and software architectures will change and compress all this stuff by 10^3 or 10^6 in twenty years, the macro power density issues will vanish. What about power density at the chip level? I dunno, that’s probably harder.

]]>
By: Calamity Jim https://www.nextplatform.com/2024/09/13/copper-wires-have-already-failed-clustered-ai-systems/#comment-234735 Tue, 17 Sep 2024 21:10:30 +0000 https://www.nextplatform.com/?p=144690#comment-234735 In reply to Slim Albert.

I reckon such spinal flexibility as exhibited in Google’s Apollo networking could be as beneficial there as in a rodeo show (prevents injury), and a great way to save a whole bunch of silver dollars on them there optical AI interconnects too ( https://www.nextplatform.com/2024/08/23/this-ai-network-has-no-spine-and-thats-a-good-thing/ ). A win-win for all involved!

]]>
By: Slim Albert https://www.nextplatform.com/2024/09/13/copper-wires-have-already-failed-clustered-ai-systems/#comment-234694 Tue, 17 Sep 2024 17:08:53 +0000 https://www.nextplatform.com/?p=144690#comment-234694 Wade’s quite right about this IMHO. The computational potential of the very large machines installed for HPC and at hyperscalers (thousands of GPUs) can’t be properly tapped with copper, and that limits how they are used and what kind of models are run on them (at present). Google seems to have gotten that message already with its “‘Apollo’ optical circuit switching (OCS) network […] reconfigurable for different topologies [replacing] the spine layer in a leaf/spine Clos topology. (We need to dig into this a little deeper.)” ( source: https://www.nextplatform.com/2023/05/11/when-push-comes-to-shove-google-invests-heavily-in-gpu-compute/ )

On the “technical” side, Wade expresses Profitability (P) in original (yet valuable) Units of (Token/Second) / (($/token) x Watts), which simplify to (Token/$)x(Token/Joule). The second term scales linearly to units of tokens-per-kilowatt-hour: P_kwh = P x 3.6e⁶ — this keeps the 6x profitability advantage on his plot at 50 tokens/s. If the price of electricity in $-per-kilowatt-hour is C (eg. C = 0.17 $/kWh), then, his Profitability (P) can be converted to P_$, in units of Tokens-per-$, using: P_$ = √(P_kwh/C) = 1897 x √(P/C). With this then, optics have a 2.4x token-per-$ advantage over copper at 50 tokens/s of interactivity — which is still quite valuable IMHO (caveats: all errors mine, and not an economist … capital and operating costs may need separating).

]]>
By: Rakesh Cheerla https://www.nextplatform.com/2024/09/13/copper-wires-have-already-failed-clustered-ai-systems/#comment-234536 Mon, 16 Sep 2024 15:30:07 +0000 https://www.nextplatform.com/?p=144690#comment-234536 The base models like 14T MoE are rarely used for inference. Instead, multi-agents rely on smaller distilled models (see https://platform.openai.com/docs/models), switching between them for better performance and cost savings. Since each agent interacts via software, scale-up and 256GPUs don’t come into the picture. I think of the base models as powerful gods — you don’t call them every day!

That said, I love the model and the insightful analysis; it helps us think about next-generation systems. It would be valuable to talk to multi-agent startups understand the models they use, the inference speed challenges,and measure the best token rates today.

]]>