The Hidden Cost Of Compromise: Why HPC Still Demands Precision

PARTNER CONTENT: As the technology industry continues its shift towards AI dominance, an important schism is opening up that threatens to impact scientific progress, along with important humanitarian endeavors such as disaster response.

To understand what’s at stake, consider critical weather modeling: these simulations hinge on maintaining precision across billions of calculations. When tracking a hurricane’s potential landfall or a wildfire’s spread, a single rounding error, magnified across millions of data points, could misplace a storm’s eye, or misjudge a fire’s direction – the difference between timely evacuation and catastrophe.

While this industry-wide shift has accelerated AI advances, it’s creating an unintended consequence: the erosion of high precision computing, which is vital for important applications such as climate modeling, aerospace engineering, and pharmaceutical research. As the scientific fields face growing pressure to compromise between performance and precision, the consequences range from wasted resources to potentially lost lives.

High performance computing (HPC) is the backbone of global research and innovation. It enables scientists to make breakthroughs in molecular interaction, optimize clean energy solutions, and forecast extreme weather events. The potential impact of this work extends beyond the laboratory. It directly affects our ability to protect communities, drive innovation, and sustain economic growth.

In scientific computing, double precision floating point arithmetic (FP64) represents numbers using 64 bits of data, allowing for up to 16 decimal places of accuracy. This level of precision is important for scientific applications where small errors can compound dramatically. In weather modeling, millions of variables – from temperature gradients to ocean currents – interact across countless time steps. These calculations demand FP64 accuracy because errors cascade exponentially. Unlike AI, where a rounding error might generate an out-of-context sentence, imprecise weather calculations can miss critical patterns that impact emergency response and cost lives.

In pharmaceutical research, precision dictates lifesaving innovation. Molecular dynamics simulations require extraordinary accuracy to model how proteins fold, and drugs bind to receptors. A single computational error could cause researchers to miss promising drug candidates or waste millions pursuing compounds that ultimately fail in clinical trials.

In aerospace and automotive engineering, computational fluid dynamics (CFD) simulations validate every design through rigorous modeling of aerodynamics and structural integrity. These simulations require double-precision calculations to capture critical details in turbulent flows. Reduced precision can create unstable simulations that produce incorrect models, potentially leading to catastrophic failures.

Precision: The Growing Gap

The importance of this conversation is not escaping the attention of many of the leaders in scientific computing. A recent study from Oak Ridge National Laboratory demonstrated that while mixed precision techniques can deliver up to 8X performance gains in select scenarios, they fall critically short for workloads requiring the precision that only FP64 calculations can provide. Their study demonstrates the value of FP64’s stability and reproducibility for applications that rely on high precision and accuracy.

This challenge extends beyond mere performance metrics. As industry expert Earl Dodd notes, the prevailing myth that “lower precision is always less reliable” has led to oversimplified decision making about precision requirements. The reality is more nuanced: different applications require different precision levels, and blanket approaches to reducing precision can have severe consequences for scientific applications. While AI workloads might tolerate reduced precision, scientific computing often requires the unwavering accuracy that only double-precision floating-point calculations can provide.

However, despite the continuing need for high precision formats, AI accelerators are dominating hardware roadmaps virtually across the board with chip manufacturers. The industry is making a strong shift towards AI-first architectures at the expense of traditional scientific computing needs. This shift isn’t merely a temporary trend – it represents a fundamental realignment of the computing landscape with far-reaching consequences. Just as scientific organizations are finding powerful ways to combine AI and HPC approaches in their research, hardware development is forcing an artificial separation between these complementary methodologies. This divergence threatens to limit the potential of both scientific computing and AI by constraining how they can work together.

This shift is especially evident in the latest GPU architectures, where the gap between performance formats is continuing to widen with each generation. This growing performance gap is clearly illustrated in the recent ORNL research. Their analysis of both Nvidia and AMD GPU architectures shows FP16 matrix operation performance quickly increasing while FP64 performance grows at a more modest pace. This data tells a clear story: the computing industry’s focus on AI is creating an increasing divide between low-precision and high-precision computing capabilities. The result is a computing ecosystem that is fracturing.

GPU performance trends (2012-2024) showing a widening gap between AI-focused FP16 and scientific FP64 operations. Source: Oak Ridge National Laboratory, 2024

The consequences of this shift are undeniably concerning for scientific computing. As Dodd emphasizes, precision in scientific computing directly impacts human lives, environmental protection, and technological progress. The trade-offs between performance and precision aren’t just technical considerations – they represent fundamental choices about the reliability and trustworthiness of our scientific simulations. When we compromise on precision, we risk more than just computational accuracy; we risk the integrity of scientific discovery itself.

Scientific users are being forced to adapt to architectures that weren’t built for their needs. Software frameworks, development tools, and even better hardware interfaces are being designed primarily for AI workloads, creating an environment where scientific computing increasingly feels like an afterthought. This shift raises a serious question: In the rush to AI dominance, is the industry throwing the baby out with the bath water?

NextSilicon’s Innovation: The Maverick-2 ICA

NextSilicon has developed a fundamentally different approach to scientific computing acceleration. The Maverick-2 accelerator – built upon its intelligent compute architecture (ICA) – represents a breakthrough in addressing the precision-performance paradox, offering a purpose-built solution that refuses to compromise on either front.

Unlike traditional accelerators that force scientific applications to adapt to fixed architectures, Maverick-2 introduces an innovative software-defined hardware approach that dynamically adapts to workload demands. This intelligent architecture learns from application behavior in real time, identifying and optimizing computational “hotspots” while maintaining the precision essential for scientific computing. Maverick-2 represents a fundamental shift in scientific computing – a purpose-built solution engineered to address the precision, power, and scalability challenges that modern scientific workloads demand.

The innovation lies in Maverick-2’s distributed compute architecture, powered by specialized mill cores designed specifically for scientific workloads. These cores work together to automatically identify and optimize the hotspot flows – the portions of applications that account for the bulk of runtime. By transforming these hotspots into optimized compute graphs and distributing them efficiently across the system, Maverick-2 enables massive parallelization while maintaining full FP64 precision.

Maverick-2’s adaptive compute model represents a paradigm shift in acceleration technology. Rather than requiring extensive code rewrites or compromising on precision, it dynamically reconfigures based on workload characteristics. The system’s intelligence orchestrates multiple layers of optimization, managing everything from workload distribution to sophisticated power management, thermal controls, and automated resource allocation.

Where traditional GPU-centric systems struggle with data movement inefficiencies, introducing latency and limiting precision scalability, Maverick-2’s unified memory architecture eliminates these bottlenecks. By providing industry-leading random access memory bandwidth through its distributed HBM3E memory system, Maverick-2 significantly reduces the data movement overhead that typically constrains throughput in traditional architectures. This approach delivers up to 4X better performance-per-watt compared to conventional GPU solutions.

Peak performance metrics can be misleading – what matters is sustained performance in real-world scientific computing. Maverick-2 is engineered specifically for this reality, delivering consistent acceleration across an organization’s entire application portfolio. Rather than optimizing for narrow benchmarks or specific scenarios, the system maintains its performance advantages across diverse scientific workloads, ensuring reliable results and maximizing research productivity across the board.

Perhaps most significantly for scientific organizations, Maverick-2 takes on the burden of extensive code modifications through its intelligent developer toolchain, which natively supports common programming languages and frameworks like C/C++, Fortran, OpenMP, and Kokkos. Scientific researchers no longer have to rewrite their code to achieve performance gains. With Maverick-2, they can run their applications without modification, eliminating weeks, months, and even years of porting effort – and even avoiding vendor lock-in. Organizations can leverage their existing software investments while gaining unprecedented performance improvements without sacrificing their established workflows or investing in specialized programming expertise.

Validation At Scale

Maverick-2 is proving its impact in some of the most demanding HPC environments. At Sandia National Laboratory, the National Nuclear Security Administration integrates Maverick-2 into its Advanced Simulation and Computing program, leveraging its real-time reconfiguration capabilities to enhance simulation accuracy and power efficiency. This collaboration demonstrates how Maverick-2’s intelligent acceleration approach can enhance critical scientific computing workloads at one of the world’s premier research institutions. The platform’s real-time optimization capabilities ensure that computational resources are always aligned with application requirements while delivering consistent performance across diverse scientific workloads – all while maintaining the precision essential for national security applications.

As Hyperion Research wrote in a recent white paper: “This innovative approach is set to strengthen the capabilities of Sandia and its associated laboratories, and positions NextSilicon to take on future large-scale production systems.”

There are other success stories underway. A leading German automotive manufacturer has validated Maverick-2’s effectiveness for crash simulations running in the open source SU2 design and simulation software, demonstrating significant speedups and computational efficiency. Top research labs around the globe are evaluating Maverick-2, as scientific organizations look to ensure that high-precision computing remains viable, efficient, and scalable in the face of AI-driven hardware shifts. Maverick-2 proves that scientific computing doesn’t have to compromise. It delivers precision where it is needed, performance where it’s demanded, and efficiency where it is essential.

As we navigate the AI revolution, we must not lose sight of a fundamental truth: Scientific progress depends on computational precision just as much as it depends on raw performance. We can bridge the growing gap between AI and HPC requirements by rethinking how we approach scientific computing acceleration. Solutions like Maverick-2 demonstrate that we don’t have to accept compromises between precision and performance. Through intelligent, software-defined acceleration, we can deliver the computational capabilities that science demands while maintaining the precision it requires.

Elad Raz is chief executive officer of NextSilicon.

Contributed by NextSilicon.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

4 Comments

  1. It’s a good point that the double precision used for scientific computation is getting overshadowed by quarter and half precision AI hardware. The dream of a plug in card that automatically accelerates Fortran and C codes with CUDA and ROCm compatibility planned sounds wonderful.

    Where are the benchmarks, feeds and speeds?

    For example it would be interesting to see to what extent, if any, SPECfp benefits from Maverick-2 acceleration.

  2. I have some friendly suggestions for NextSilicon. There should be a liquid-cooled version of the Maverick-2 PCIe card that is compatible with Supermicro’s liquid-cooled workstation. When NextSilicon is ready to release the specifications of Maverick-2, add a “Tech Specs” tab to the NextSilicon website so potential customers can quickly find the specs of this product without any marketing fluff. Maverick-2 should support Compute Express Link Type 2 (CXL Type 2) to provide cache coherent access to host memory and the HBM on Maverick-2. NextSilicon should have a Maverick card supporting PCIe Gen 6 and CXL for sale by 2026 to align with the launch of AMD’s Venice processor and Intel’s Diamond Rapids processor.

    A cloud service should be created that allows potential customers to easily upload source code and evaluate the performance of Maverick-2. The runtime limit could be 60 seconds. The uploaded source code should be prevented from accessing the internet or the local file system. The uploaded source code should be able to print a maximum of 5000 characters. Potential customers would submit a job to a queue and see the results on the website when the job completes. No registration or email address should be required.

    The NextSilicon website should have detailed documentation describing the microarchitecture of Maverick-2, especially the memory subsystem. The latency, bandwidth and size of each level in the memory hierarchy should be listed. Techniques for optimizing the performance of Maverick-2 should be described. When Maverick-2 becomes available for sale, there should be links on the NextSilicon website to webpages where the product can be bought.

    NextSilicon should work with HPC software providers, like Q-Chem, to get their applications running on Maverick-2. The NextSilicon website should show the performance of these real-world HPC applications on Maverick-2 and alternative platforms. Maverick-2 should provide at least a 3x better price/performance ratio on real-world HPC applications than the best available NVIDIA GPUs and x86 CPUs.

  3. A nice addition to Tim’s recent article on many tech details of Maverick-2 ( https://www.nextplatform.com/2024/10/29/hpc-gets-a-reconfigurable-dataflow-engine-to-take-on-cpus-and-gpus/ )!

    Also, thanks for linking to “A recent study from Oak Ridge National Laboratory” where Table 3 shows that a 10x speedup with MxP only occurs (so far at least) for dense matrix situations, while the more relevant sparse matrix computations may see just 1.5x (eg. CFDand Climate/Weather in their Table 2, and I expect contaminant transport and environmental flows, unfortunately).

    Maverick-2’s mill cores that automatically identify computational hotspots and optimize the corresponding compute graphs (reducing data movement overhead among others) sounds like quite the ticket imho. I like the 4x better perf-per-watt vs GPUs and hope it gets realized in sparse computations (eg. both direct and iterative solvers). In particular, I expect the distributed HBM3E approach, with dynamic reconfiguration of compute, to help out with the HPCG and Graph500 memory access challenges (here in HPC-relevant FP64, rather than the more common FP32 of dataflow devices).

    Looking forward to seeing some conference presentations or papers related to Sandia’s NNSA Vanguard Penguin Tundra testing of this quite promising ASIC ( https://www.sandia.gov/research/news/sandia-partners-with-nextsilicon-and-penguin-solutions-to-deliver-first-of-its-kind-runtime-reconfigurable-accelerator-technology/ )!

    • Quite so! This could be quite useful to help navigate the environmental regulation rodeo as explained by BASF when presenting their new 3 PetaFlop/s Quriosity — “the world’s largest supercomputer used in industrial chemical research”. Those reconfigurable FP64 NextSilicon-branded Maverick-2 chips could get them to assess the “Potential impact of crop protection products on groundwater quality” in even less than a few hours (instead of years), or through more complex environment models: https://www.basf.com/global/en/who-we-are/innovation/how-we-innovate/our-RnD/Digitalization_in_R-D/supercomputer

      I reckon it’d be quite a sight to see the rancheros at Corteva (Dow/Elanco-Dupont), Bayer-Monsanto, and Syngenta, down the Arbuckle and saddle-up to join this here computational steer wrestling contest, for everyone’s improved environmental protection, with great crop production!

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.