However the amount of resources at stake is incredible. The delta between NVIDIA's value and AMD's is bigger than the annual GDP of Spain. Even if they needed to hire a few thousand engineers at a few million in comp each, it'd still be a good investment.
They have made an alternative to the CUDA language with HIP, which can do most of the things the CUDA language can.
You could say that they haven't released supporting libraries like cuDNN, but they are making progress on this with AiTer for example.
You could say that they have fragmented their efforts across too many different paradigms but I don't think this is it because Nvidia also support a lot of different programming models.
I think the reason is that they have not prioritised support for ROCm across all of their products. There are too many different architectures with varying levels of support. This isn't just historical. There is no ROCm support for their latest AI Max 395 APU. There is no nice cross architecture ISA like PTX. The drivers are buggy. It's just all a pain to use. And for that reason "the community" doesn't really want to use it, and so it's a second class citizen.
This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.
My understanding is that the reason is that the real market for 3 (GPUs for compute) didn't show up until very late, so AMD's GCN bet didn't pay off. Even in 2021, NVIDIA's revenue from gaming was above data center revenue (a segment they basically had no competition in, and 100% of their revenue was from CUDA). AMD meanwhile won the battle for Playstation and Xbox consoles, and was executing a turnaround in data centers with EPYC and CPUs (with Zen). So my guess as to why they might have underinvested is basically: for much of the 2010s they were just trying to survive, so they focused on battles they could win that would bring them revenue.
This high level prioritization would explain a lot of "misexecution", e.g. if they underhired for ROCm, or prioritized APU SDK experience over data center, their testing philosophy ("does this game work ok? great").
They can't bring themselves to put so much money into it that it would be an obvious fail if it didn't work.
But I think more importantly, what is often missed in this analysis is that most programmers doing ML work aren't writing their own custom kernels. They're just using pytorch (or maybe something even more abstracted/multi-backend like keras 3.x) and let the library deal with implementation details related to their GPU.
That doesn't mean there aren't footguns in that particular land of abstraction, but the delta between the two providers is not nearly as stark as its often portrayed. At least not for the average programmer working with ML tooling.
(EDIT: also worth noting that the work being done in the MLIR project has a role to play in closing the gap as well for similar reasons)
"it'd still be a good investment." - that's definitely not a sure thing. Su isn't a risk taker, seems to prefer incremental growth, mainly focused on the CPU side.
The first time, they went ahead and killed off their effort to consolidate on OpenCL. OpenCL went terribly (in no small part because NVIDIA held out on OpenCL 2 support) and that set AMD back a long ways.
Beyond that, AMD does not have a strong software division or one with the teeth to really influence hardware to their needs . They have great engineers but leadership doesn’t know how to get them to where they need to be.
Nvidia is massively overvalued right now. AI has rocketed them into absolute absurdity, and it's not sustainable. Put aside the actual technology for a second and realize that public image of AI is at rock bottom. Every single time a company puts out AI-generated materials, they receive immense public backlash. That's not going away any time soon and it's only likely to get worse.
Speaking as someone that's not even remotely anti-AI, I wouldn't touch the shit with a 10 foot pole because of how bad the public image is. The moment that capital realizes this, that bubble is going to pop and it's going to pop hard.
That switch will reduce the NVIDIA margins by a lot. NVIDIA probably has 2 years left of being the only one with golden shovels.
Perhaps in keeping with the broader thread here, they had only ever funded a single contract developer working on it, and then discontinued the project (for who-knows-what legal or political reasons). But the developer had specified that he could open-source the pre-AMD state if the contract was dissolved, and he did exactly that! The project is active with an actively contributing community, and is rapidly catching up to where it was.
https://www.phoronix.com/review/radeon-cuda-zluda
https://vosen.github.io/ZLUDA/blog/zludas-third-life/
https://vosen.github.io/ZLUDA/blog/zluda-update-q4-2024/
IMO it's vital that even if NVIDIA's future falters in some way, the (likely) collective millennia of research built on top of CUDA will continue to have a path forward on other constantly-improving hardware.
It's frustrating that AMD will benefit from this without contributing - but given the entire context of this thread, maybe it's best that they aren't actively managing the thing that gives their product a future!
Throwing a vast amount of effort at something isn't sufficient.
https://www.modular.com/blog/democratizing-ai-compute-part-5...
(Maybe "nothing special" is a little bit strong, but as a chip designer I've never seen the actual NVIDIA chips as all that much of a moat. What makes it hard to find alternatives to NVIDIA is their driver and CUDA stack.)
Curious to hear others' opinions on this.
So far those guesses haven't worked out (not surprising as they have no specific ML expertise and are not partnered with any frontier lab), and no amount of papering over with software will help.
That said I'm hopeful the rise of reasoning models can help, no one wants to bet the farm on their untested clusters but buying some chips for inference is much safer.
If it's a question of first principles, there is a small glimmer of hope in a company called tinygrad making the attempt - https://geohot.github.io//blog/jekyll/update/2025/03/08/AMD-...
If the current 1:16 AMD:NVIDIA stock value difference is entirely due to the CUDA moat, you might make some money if the tide turns. But who can say…
OpenCL is completely open (source) and so why wouldn't we, all of us, throw our weight behind OpenCL.
(no, I have no connection with them and have nothing to do with them, other than having learned a bit).
https://www.linkedin.com/posts/jtatarchuk_beyondcuda-democra...
Actually it might be better to spend 1B on shares and 10x 100M on development and take ten attempts in parallel and use the best of them.
If they wanted to prioritize this, they would. They're simply not taking it seriously.
But — too late. First versions of ROCm were terrible. Too much boilerplate. 1200 lines of template-heavy C++ for a simple FFT. Can't just start hacking around.
Since then, the CUDA way is cemented in minds of developers. Intel now has oneAPI, and it is not too bad, and hackable, but there is no hardware and no one will learn it. And HIP is "CUDA-like", so why not CUDA, unless you _have to_ use AMD hardware.
Tl;dr first versions of ROCm were bad. Now they are better, but it is too late.
Neither will encroach too much on the others turf. The two companies don't want to directly compete on the things that really drive the share price.