We built Popcorn Linux [1] in Academia using a DSM (we also have support for a distributed hypervisor on top of KVM that distributes vCPUs all over the infrastructure). If the application running atop is serial or exhibits little memory sharing between its threads, it works great. However, for multi-threaded software with lots of memory sharing, you’re screwed! There is an enormous slowdown, and it’s implemented using InfiniBand RDMA.
]]>We built Popcorn Linux in Academia. But we use DSM, which means we move memory instead of CPU. However, processes can migrate to other machines to scale out.
]]>I invested in TidalScale because commercialization requires working with existing workloads (imposing neither new consistency models as DSMs did nor excessive code optimization burden (as NUMA and HW DSMs did), creating a reasonable control plane, productizing around value proposition beyond out of the box performance which is hard, supporting custoners, creating and delivering to roadmaps, all of which Tidalscale has done better than anyone else that went before them and trust me because I was on the other side of these technologies for a while (meaning not a purveyor or investor but rather a customer or potential acquirer).
]]>Yes, it certainly can be done, just as this company is doing. But, just as you say, you want software to perceive this as being just one big NUMA (and I assume ccNUMA) with a single address space per process, with threads spanning the processors of multiple nodes. Yes, you can have software create the illusion of a single address space over multiple nodes. But there is one more aspect of ccNUMA that is critical to latency. It does not matter how many copies of a block of data there are across all of the processors, the cache coherency of those caches make it look as though there is exactly one version of it. From the software’s point of view, there is only ever one block, one location for that data. EVER. Sure, software can create that last illusion as well, simply by creating exactly one instance of a data block over all of the non-NUMA nodes of the IO-linked (or non-cache coherent) cluster. If the data is changing, the data block is on one node at a time. (If not, a copy can be made and reside on multiple.) If data isn’t changing, this solution can work great, just like normal distributed computing. Change it frequently and performance hell breaks through.
Again, from the application’s point of view, it can be made to work. And, yes, the performance of the links between systems have gotten to be remarkably fast just as you have been reporting for years. But now picture your application wanting to rapidly change some many data blocks, just as you would want to do transparently in a ccNUMA system. It’s not on your system, so it needs to be found, a global lock of sorts needs to be used to change ownership, software-IO needs to be invoked to move the data, and then you get told of its availability, all functionally transparent to you. But now picture, during the time that this is occurring, it gets pulled away from you by one and then N others all competing in exactly the same way. ccNUMA gets into wars like this but resolves it rapidly.
When we were looking at it, we were thinking DBMS as an example. It’s not just the data in the DB that is important here. It is all of the supporting structures as well, and they are all being accessed rapidly by scads of threads modifying those common structures. More power to this company, but as with distributed computing, there needs to be a design in the application that takes the extra latency into account. For the right usage, a win. Use it wrong, well, I sure hope I am wrong for the programmers using this system, for I don’t see this as a solution transparent for them as assumed for a ccNUMA system.
]]>Being able to send the registers/stack from one physical machine to another is interesting. Just pick up on the new machine and access the data need on that machine. But how do you design some locality into the system to prevent a situation where the virtual CPU is hopping around between physical hardware every 4th instruction or so because the data needed for a tight block of code is distributed all over the place. This would slow down execution tremendously.
Sure, a fair number of workloads likely exhibit locality. But not all do. Can static compiler/analysis tools even reason about this problem given that things can change dynamically at runtime? Years ago when dealing with NUMA architectures with large memory spaces I did encounter situations where even within a single physical box the costs of migrating a process across physical sockets to get to the memory it needed could strain a system in interesting ways. Here those costs involve hops across a network which are orders of magnitude slower.
And then there is the question of stack frames. Does every machine have to have the same binaries installed with the same jump locations available so a process can get to the code it needs regardless of which machine it’s on? Is there a fixup task that needs to be performed to deal with this? Does stack frame layout randomization interfere with the ability to move cpu state around between machines and would the need to turn that off create security problems?
I’m sure other have thought about all this. Would be interesting to know more about the details. Creating a uniform machine model with a huge memory and compute space that transcends the capabilities of a single machine yet seamlessly shares a single binary sound fascinating.
]]>Thanks for that. I did not know that.
]]>Enlighten us with your software stack, then, and how it does what Virtual Iron, RNA Networks, or ScaleMP did.
]]>I am well aware of being NUMA aware. But I am not talking about that. I am talking about software that creates the NUMA cluster, not the software that is embedded in the OS kernel or hypervisor that figures out how to cordon off workloads so they have locality in a NUMA region.
You need to be less mean.
]]>