Comments on: Why Aren’t There Software-Defined NUMA Servers Everywhere?

By: Karim Manaouil

Wed, 19 Oct 2022 13:11:42 +0000

In reply to Mark Funk.

We built Popcorn Linux [1] in Academia using a DSM (we also have support for a distributed hypervisor on top of KVM that distributes vCPUs all over the infrastructure). If the application running atop is serial or exhibits little memory sharing between its threads, it works great. However, for multi-threaded software with lots of memory sharing, you’re screwed! There is an enormous slowdown, and it’s implemented using InfiniBand RDMA.

[1] http://www.popcornlinux.org/

]]>

By: Karim Manaouil

Wed, 19 Oct 2022 12:58:07 +0000

By: J Arh

Wed, 21 Sep 2022 18:00:25 +0000

Tidalscale was the first of these players to commercialize an inverse hypervisor. Ideas that did similar work in the memory field existed as software defined memory controller (ScaleMP since 1993), SW Distributed Shared Memory (Treadmarks, etc), HW DSM (HPE Superdome and Dragonhawk etc), and even a single system image cluster JVM published by Assaf Schuster and colleagues at Technion and IBM Haifa [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.422] in 2000.

I invested in TidalScale because commercialization requires working with existing workloads (imposing neither new consistency models as DSMs did nor excessive code optimization burden (as NUMA and HW DSMs did), creating a reasonable control plane, productizing around value proposition beyond out of the box performance which is hard, supporting custoners, creating and delivering to roadmaps, all of which Tidalscale has done better than anyone else that went before them and trust me because I was on the other side of these technologies for a while (meaning not a purveyor or investor but rather a customer or potential acquirer).

]]>

By: Mark Funk

Sun, 18 Sep 2022 22:29:47 +0000

Interesting effort. But it brought back a memory for me: At the beginning of our attempt decades ago to get our local management to step up to ccNUMA (cache-coherent NUMA) – you know which OS – our third line manager – whom we highly respected – said in so many words “Cool, I’d love to do it. But, instead bring me a solution that allows me to do the same thing over multiple clustered nodes.” Our answer was “Yes, of course, sir” but all the while we were thinking “It’s all about the relative latency, stupid!”. (With all due respect, Timothy.)

Yes, it certainly can be done, just as this company is doing. But, just as you say, you want software to perceive this as being just one big NUMA (and I assume ccNUMA) with a single address space per process, with threads spanning the processors of multiple nodes. Yes, you can have software create the illusion of a single address space over multiple nodes. But there is one more aspect of ccNUMA that is critical to latency. It does not matter how many copies of a block of data there are across all of the processors, the cache coherency of those caches make it look as though there is exactly one version of it. From the software’s point of view, there is only ever one block, one location for that data. EVER. Sure, software can create that last illusion as well, simply by creating exactly one instance of a data block over all of the non-NUMA nodes of the IO-linked (or non-cache coherent) cluster. If the data is changing, the data block is on one node at a time. (If not, a copy can be made and reside on multiple.) If data isn’t changing, this solution can work great, just like normal distributed computing. Change it frequently and performance hell breaks through.

Again, from the application’s point of view, it can be made to work. And, yes, the performance of the links between systems have gotten to be remarkably fast just as you have been reporting for years. But now picture your application wanting to rapidly change some many data blocks, just as you would want to do transparently in a ccNUMA system. It’s not on your system, so it needs to be found, a global lock of sorts needs to be used to change ownership, software-IO needs to be invoked to move the data, and then you get told of its availability, all functionally transparent to you. But now picture, during the time that this is occurring, it gets pulled away from you by one and then N others all competing in exactly the same way. ccNUMA gets into wars like this but resolves it rapidly.

When we were looking at it, we were thinking DBMS as an example. It’s not just the data in the DB that is important here. It is all of the supporting structures as well, and they are all being accessed rapidly by scads of threads modifying those common structures. More power to this company, but as with distributed computing, there needs to be a design in the application that takes the extra latency into account. For the right usage, a win. Use it wrong, well, I sure hope I am wrong for the programmers using this system, for I don’t see this as a solution transparent for them as assumed for a ccNUMA system.

]]>

By: Charles W. Christine

Fri, 16 Sep 2022 01:31:38 +0000

Reading the whitepaper I have to wonder about whether special compilers or data design are involved to take advantage of this.

Being able to send the registers/stack from one physical machine to another is interesting. Just pick up on the new machine and access the data need on that machine. But how do you design some locality into the system to prevent a situation where the virtual CPU is hopping around between physical hardware every 4th instruction or so because the data needed for a tight block of code is distributed all over the place. This would slow down execution tremendously.

Sure, a fair number of workloads likely exhibit locality. But not all do. Can static compiler/analysis tools even reason about this problem given that things can change dynamically at runtime? Years ago when dealing with NUMA architectures with large memory spaces I did encounter situations where even within a single physical box the costs of migrating a process across physical sockets to get to the memory it needed could strain a system in interesting ways. Here those costs involve hops across a network which are orders of magnitude slower.

And then there is the question of stack frames. Does every machine have to have the same binaries installed with the same jump locations available so a process can get to the code it needs regardless of which machine it’s on? Is there a fixup task that needs to be performed to deal with this? Does stack frame layout randomization interfere with the ability to move cpu state around between machines and would the need to turn that off create security problems?

I’m sure other have thought about all this. Would be interesting to know more about the details. Creating a uniform machine model with a huge memory and compute space that transcends the capabilities of a single machine yet seamlessly shares a single binary sound fascinating.

]]>

By: Herman

Thu, 15 Sep 2022 07:26:56 +0000

I’ll take a Beuwulf cluster of that.

]]>

By: Timothy Prickett Morgan

Wed, 14 Sep 2022 11:37:22 +0000

By: Timothy Prickett Morgan

Wed, 14 Sep 2022 11:36:08 +0000

By: Timothy Prickett Morgan

Wed, 14 Sep 2022 11:33:44 +0000

In reply to North S Hinkle.

I am well aware of being NUMA aware. But I am not talking about that. I am talking about software that creates the NUMA cluster, not the software that is embedded in the OS kernel or hypervisor that figures out how to cordon off workloads so they have locality in a NUMA region.

You need to be less mean.

]]>

By: No

Wed, 14 Sep 2022 10:40:12 +0000

I have been using software defined numa for a ~decade. Look around some more.

]]>