In our research, we’ve heard several things that argue against the cloud being a suitable host for HPC today, here are a few of the biggest: 1) You can’t get enough instances in the same region for a large HPC workload…2) You can’t significantly reduce your IT staffing just because you go to the cloud – you still need app specialists, storage specialists, tuning specialists, etc., etc., and 3) The costs of public cloud are 3-7x the fully burdened costs of on-prem.
It’s interesting to consider recent moves by Amazon and Google to extend their system life by an additional year. They added 25% to the useful life of their systems. There’s a reason for this. Could it be that their reading of the tea leaves doesn’t have them taking over the IT world? I don’t know, but it’s something to consider.
]]>The way I see it, this gives us all a lot more time to work on the warp drive….
]]>Anyway it is a shame to see HPC get swallowed up by a cloud of rack mounted PCs but I suspect there will always be a lab somewhere who are building something so different that they even have to make their own parts.
]]>Seems to me the HPC people can have that now or in the very near future and as the article points out, at much better economics – not years and years away. That’s mainly because of what is offered today and the quick rate of further development taking place in AI. Did you see how NVDA opted to not focus on the FP64 market with latest GPUs? I imagine that is because the AI they’ve already managed to achieve together with partners, provides the needed answers, just in a different way. Someone likened the AI approach to having done the math in one’s head vs. the sort of brute strength approach with emphasis on FP 64 precision as being longhand math on paper. While the answers will be the same, one is much more expensive in terms of compute power, and unnecessary. I find that an apt analogy.
]]>This is not to say that there isn’t room for custom hardware components (integrated, chiplet, or wafer-based), but that hardware needs to seamlessly provide its value without requiring application and workload modification. Let’s face it, HPC’s Achilles Heel has always been its legacy (some would say antiquated or perhaps decrepit) software stacks which take anywhere from 9-18 months to port before much but never all of the full potential of any new hardware or system can be realized. When you combine system delivery delays (how often has any custom HPC system been delivered on time), the huge porting cost and time delays, and the rapid performance gains of the underlying hardware (distributed HPC systems don’t use or cannot tolerate rolling hardware upgrades which means for much of their very limited productive years, they use 2+ generations of lagging hardware), one has to question their economic viability. Add into the mix politics driven by many who have little to no understanding of the technology, science, or potential benefits to humanity, and the current custom HPC operating model becomes extremely questionable.
In contrast, cloud providers have mastered supply chains leading to massive economies of scale, scale-out system / solution management with fully integrated security and resiliency, and the ability to rapidly and seamlessly integrate new hardware and capabilities (lagging hardware is quickly redeployed downstream to less demanding applications ensuring that demanding applications operate on the best at hand). If they see a viable market or an opportunity through public-private funding, they are more than willing to invest to deliver what their customers need. Some cloud providers are even certified to provide on-premises gear within high security / sensitive environments including various three-letter agencies.
Cloud providers are far from perfect, but they can support nearly all but perhaps the most extreme niche HPC applications and their capabilities and benefits from multiple economic and technology perspectives far outweigh their shortcomings. Industry standards are critical to defining the mechanical and communication edges, but they need to be carefully designed to enable not hinder innovation. Far, far too often hardware and semi-conductor companies driving industry standards try to lock down, and all too often, artificially delay specifications and innovations to meet their own business requirements. Such efforts have slowed and constrained innovation. Further, such efforts have led to overly complex standards which unfortunately leads to many non-fully interoperable and compliant components. Fortunately, some of this is starting to change, e.g., the DMTF specifies a wide range of flexible data models that abstract the underlying hardware implementation which can dramatically simplify software-hardware integration while accelerating innovation and new service and capability delivery.
]]>I think GreenLake is a kind of outpost, and it has traction and a chance. But that only addresses the infrastructure layer. The addition of Cray certainly gives HPE more longevity because of all of those skills in HPC, particularly with workload expertise. But consider that many of the key architects of Cray now work at Microsoft Azure.
]]>