One example I can think of that happened recently is geometry kernel computing. For decades, these kernels have been largely sequential compute bodies of software. In CAD systems, there are some opportunities for concurrent solutions to model geometry, and those depend on the dependencies that arise as an artifact of how they are defined, constructed, changed, combined, assembled.
But there are limits. And those center right on sequential compute limits. Roughly 5 Ghz is available these days. That tends to place an upper limit on what can be resolved without worrying about dependencies.
Recently, Dyndrite has made a GPU centric geometry kernel from first principles. It can do things in a blink of an eye that are either not practical or very time consuming using the sequential compute kernels.
That capability is going to have profound effects on how we manufacture things. Additive is a current target, but the idea of fast and flexible geometry is going to have ripple effects all over the place.
I would argue this kind of innovation would see less investment and overall demand had we been able to continue ramping up sequential compute capability.
Concurrency may mean a return to more custom hardware too. Same basic reason; namely, sequential compute no longer seeing the massive gains it did earlier on.
When we can package tasks up into hardware, those tasks become very efficient and software can be simplified, and having those tasks happen at the same time may not require as much kernel level type software managing things like interrupts.
It's almost like we are coming around full circle!
When one takes a look back at say the 80's early computing on 8 bit machines, the first ones didn't have much in the way of custom hardware.
The original Apple 2 was made from discrete logic and a CPU. Software drove pretty much everything, except for the built in graphics system. And all that was is a frame buffer. No assists of any kind.
The CPU pretty much did everything. Even reading and writing from the disk drives, which had a simple hardware assist in the form of a state machine, and that's it.
Despite a 1Mhz clock, having so much of the computer driven by software meant being able to optimize it over time. Those machines were sold and used from the late 70's through the early 90's.
Along came custom hardware, and often the same CPU at a similar clock.
Atari and C64 machines, for example, had graphics and sound devices, and on the Atari a serial I/O system that looks a little like USB does today. Those machines were able to do more and featured some basic capability not driven by the CPU itself. Graphics, sound, some I/O, could all happen concurrently and that made more things possible.
The IBM PC looked a lot like an Apple 2, lots of discrete logic, but no built in graphics system. Add on cards, MGA, EGA, CGA, VGA and more continued to offer more and more capability and like the Apple 2, more add in cards could deliver more features and some concurrency where needed.
Test, measurement was one use case. Music was another where multiple devices could be added and they could perform concurrently. Automation and control was another case where lots of I/O could be added to the system reasonably. That could be coupled with local compute resources too for concurrency.
The same was seen in the 16 bit era. Simple CPU and dumb graphics system was the Atari ST, Apple Mac and the Amiga featured a similar CPU, but coupled with custom chips that made things like video editing and production possible, for example.
Today, we've got computers with lots of little sub-systems doing many things, and they all use similar CPU's too.
We've topped out on CPU's, much like the older parts and computers did.
8 bit machines ran from under 1 Mhz to a few Mhz, maybe 10 tops.
16 bit machines ran from a few to 10's of Mhz.
32 bit machines and 64 bit machines today top out at 5'ish Ghz.
Until the last decade or so, this was all single core, sequential compute too. Concurrency happened via add on dedicated sub system, or via interrupts and a complex scheduling system able to get things done concurrently.
The only places to go are multi-core CPU, many core GPU and custom task / process based hardware.
Again, that is given we do not somehow find a way to very significantly improve single thread performance.
Software will continue to evolve. More and more things will be done in ways that can take advantage of simultaneous execution and or dedicated hardware.
And it can play out in a lot of ways!
Cell phones are super interesting. In terms of general compute performance, some of them approach laptop speeds. But they are super efficient and have custom hardware assists for a lot of things.
Getting things done in real time, and with sufficient fidelity as to not degrade the task in a meaningful way on an increasingly small power budget means doing more on a smaller battery, potentially not even using a battery, or making devices smaller, more transparent to their users.
Desktop computers are fading fast, but for those brute force high demand tasks that require tons of RAM, CPU cores, GPUs, many I/O channels, etc...
Laptops continue to get more lean and mean, while remaining performant for many tasks.
More is possible on a cell phone every cycle.
Wearables are now a thing.
Wow, I just rambled on a lot.
Really, I think your question is better expressed in terms of:
Multi-core CPU
Role of the GPU
Custom dedicated to task hardware.
Those things may contribute in a parallel sense, a concurrent one, or be mixed mode, depending.
Mix in the power budget, and that's going to speak more to what is important for the masses than anything else is.
Your question is different in some niches where it really is all about peak sequential compute limits and the slow move of software over to multi-core computing vs increasingly complex custom hardware.
And on that last note, think about instructions. Just adding one or two that package up whole loops, or groups of instructions are a powerful way to get more sequential compute power these days.