Allgemein

Why d-Matrix bets on in-memory compute to break the AI inference bottleneck

Why d-Matrix bets on in-memory compute to break the AI inference bottleneck

AI inference is poised to become the single largest compute workload. For some time now, after all, AI service providers like OpenAI have claimed that they are mostly constrained by compute availability. Even as these companies build massive, power-hungry data centers to run both their training and inference workloads, the need for optimized hardware and software solutions remains.

Meanwhile, it’s not just the frontier AI labs that are concerned with the cost of running these models, but increasingly, enterprises want full control over their AI stack, too.

Open reasoning models became mainstream in 2025. And those models have become increasingly competitive with frontier models from the likes of Anthropic, Google, and OpenAI. Thanks to this, it has now also become significantly easier and cost-effective to run a small fine-tuned model that focuses on a specific domain than to pay for API access to a large general model.

There is a huge market here for running those models, and unsurprisingly, we’re also seeing a lot of innovation on the hardware side now, even as it tends to take a few years for those companies to get their inferencing chips to market.

Why d-Matrix decided to build a new kind of inference chip

D-Matrix, which recently raised a $275 million funding round, is tackling this by moving away from standard GPU architectures. Rather than just competing on raw FLOPs like many peers, the company’s Corsair platform bets on a heterogeneous architecture designed specifically to break the memory bottleneck.

As d-Matrix CEO Sid Sheth noted in an interview earlier this month, he was somewhat lucky not to be part of the first batch of AI chip companies to get started around ten years ago. Back then, convolutional neural networks (CNN) were the state of the art, and most people assumed that vision acceleration and similar workloads were the killer application for those chips. But when Sheth and Sudeep Bhoja started d-Matrix in 2019, they quickly realized that they didn’t want to be just another computer vision accelerator.

The d-Matrix Jetstream card.

The d-Matrix Jetstream card. (Credit: d-Matrix)

Sheth, who worked on early Pentium processors at Intel and then left to work at a number of other semiconductor and networking companies, saw that Nvidia already owned the AI training space.

By 2019, Sheth says, Nvidia had effectively won the training crown. “It’d be a fool’s errand to try to do something there unless you’re substantially differentiated,” he says.

Instead, the team decided to ask itself what a new, highly efficient inferencing hardware platform would look like if they could build it from scratch. D-Matrix isn’t the first company to do this, of course. Google, after all, has been building its Tensor Processing Units (TPUs) for quite a few generations now, as has AWS with its Trainium chips (which are, despite their name, now optimized for inference, too).

The area the d-Matrix team decided to focus on was the connection between compute cores and memory. Modern large language models (LLMs) need a lot of compute capacity and fast memory access to generate their tokens, but separating compute and memory introduces latency and creates a potential bandwidth bottleneck.

“We knew that something special was needed, something that was more efficient, something that did not just solve for compute, but solved for compute and memory and memory bandwidth and memory capacity and all of these things,” Sheth explains. “And how do you put all of that together in a very efficient way and make it scalable? We knew something different was needed, so we just started building it.”

D-Matrix's AI inference chip architecture.

D-Matrix’s AI inference chip compute/memory architecture. (Credit: d-Matrix)

What d-Matrix decided to build looks quite different from today’s GPUs that are often used to run models. The company’s solution is, in a way, an even more radical approach to connecting memory and compute than what Apple is doing with its Unified Memory Architecture in its M-series chips, which combine memory and compute on a single chip, with the CPU and GPU sharing this memory.

For transformer-based inference, the bottleneck is rarely compute; it’s moving weights. D-Matrix addresses this issue via its Digital In-Memory Compute (DIMC) technology, where matrix multiplications occur directly within the memory cell.

“In our case, it is not like the memory block is discrete from the compute block. The memory block is the compute block. We essentially do all the matrix multiplications right inside the memory cell, and then we use adder trees that are embedded in the memory array to do the summation,” Sheth explains.

By using a chiplet approach, d-Matrix can then scale those DIMC units as needed, with a control core based on the RISC-V architecture managing the overall data flow.

“That [architecture] was the leap of faith,” Sheth says, and building out this novel compute architecture took a few years. After a few pivots in how the team approached the overall design, it settled on the current approach. The idea to use chiplets with die-to-die interconnects to be able to better scale their solution was part of this cycle.

As Sheth also stresses, this chiplet approach will allow the company to not just scale its solution based on customer needs, but also quickly react to changes in workloads if a new type of model architecture suddenly becomes popular.

Other AI hardware companies take a slightly different approach, with Cerebras, for example, focusing on massive wafer-sized chips with 900,000 AI cores. Cerebras puts 44GB of SRAM on its latest WSE-3 chips, but it’s not combining memory and compute in the same way d-Matrix does, and instead continues to separate the two into more distinct units.

D-Matrix architecture

The d-Matrix hardware/software stack. (Credit: d-Matrix)

Unlike some other companies in this space, d-Matrix currently focuses on selling its hardware as air-cooled PCIe cards — branded as Jetstream — or trays with its accelerators built in — not to power its own inferencing service or to sell a rack-based solution.

As with all custom hardware solutions, the software stack may be just as important as the underlying hardware. Different models need different kernels to run on the d-Matrix Jetstream cards. Sheth notes that most models look quite similar to each other, so the changes needed to make new DeepSeek, Llama, Qwen, or other models run on the platform are pretty straightforward.

For those models, the company offers precompiled kernels. When the company works with hyperscalers and their first-party models, creating or adapting an existing kernel is usually no problem for the developers in those companies. But over time, d-Matrix plans to make it far easier for developers to adopt its platform by integrating with Nvidia’s Triton Inference Server, for example.

Most of d-Matrix’s customers, Sheth notes, aren’t running frontier models but what he calls “sub-frontier” models — often distilled down from the larger version of an open model.

What’s next?

As for the coming year, Sheth believes that we will see true heterogeneous deployments that mix GPUs and other accelerators. He also notes that he expects even more companies to look at generative video. While we’ve seen a few products like Google’s Veo and OpenAI’s Sora gather quite a bit of momentum, Sheth doesn’t think these models have quite had the “ChatGPT moment” yet — and that’s definitely true for open video models, which is what would likely run on the d-Matrix chips.

He also notes how there are plenty of AI chip companies launching all the time. But he argues that it’s one thing to get funding for those startups and another to get a chip to market.

A lot of inference, he believes, will also not run in a data center but on the user’s phone or on their PC or laptop.

“You need to have this kind of a hybrid loop where developers can use localized inference on their desktop or notebook — or whatever it is — and develop applications in real time,” he says. “But then they can stream that application to run at scale in the cloud, and then see how it runs. And if they have to make any changes, then they can loop it back into their local developer environment. 
 Inference cannot really go broad without the ability to run it locally, right on your notebook, or some localized computing form factor.”

The post Why d-Matrix bets on in-memory compute to break the AI inference bottleneck appeared first on The New Stack.