- make instructions and CPUs simpler for the CPU and not human beings, since most assembly is created by compilers and not humans. - use a "load-store" paradigm that preferred registers over slower RAM for intermediate uses, - use the chip space saved by simpler CPU instructions for more registers to support #2 above.
x86 traces its lineage to the early 8080 CPU developed in the 70's. This is when RAM was as fast or slower than the CPU, when 4k of RAM was $2000, and hand-assembly was more common. So it made sense to optimize for the most action out of each instruction and rely on memory for intermediate uses.
The RISC/CISC "war" was before the early-to-mid 90's brought multi-core CPUs, large CPU caches, SIMD instructions, and numerous things CPUs do to the instruction stream like branch prediction, register renaming, out-of-order execution and etc. that are common now. So it doesn't matter so much anymore.
SIMD in particular is something like the RISC paradigm on steroids - load large, wide register files with values and do operations on them all with a few instructions.
ARM's simplicity, along with the fact ARM licensed its designs out to other silicon designers/integrators, now translates into lower power usage/heat output and ease of integration into system-on-a-chip designs. Hence why it's super common in phones.
ARM64 and ARM are different architectures, just like AMD64 and x86 are different architectures with different register sets and instruction sets.
---------
Let me start off by saying how they're the same. Both modern x86 and modern Application-class ARM chips are pipelined, superscalar, SIMD-accelerated, out-of-order, micro-coded, multicore systems.
* Microcoded / Macrocoded-- Both ARM and x86 split instructions, and combine instructions, for better efficiency. For example, ARM will combine AESE + AESMC instructions together as a singular "Macrocoded" instruction that will be executed once-per-clock tick under the hood. x86 combines cmp/jmp instructions. "Under the hood", both x86 and ARM are load/store architectures.
* Pipelined -- Every clock-cycle, the CPU will try to start a new instruction, even if the previous instructions haven't completed yet. Multiply is commonly 5 clock cycles for example, but both ARM and x86 can execute a multiply every clock cycle.
* Superscalar -- Instead of executing one instruction at a time, modern CPUs try to execute many instructions in parallel. This "combos" with pipelined architectures. Modern CPUs can do 4-uops, though Apple's newest ARM chip tries to do 8-uops per clock. (A78 only does 4 IIRC).
* Out-of-order -- The pipeline and superscalar parts of modern processors are extremely aggressive, and are willing to execute assembly instructions out-of-order. There is a "retirement" unit that puts instructions back in order. This Out-of-order unit mostly works with reorder-buffer registers (ROB), "secret" registers that can hold values for the sole purpose of out-of-order execution.
* Multicore -- Both x86 and ARM have multiple cores that can each execute a pipelined / superscalar / out-of-order thread.
------------
What are the differences? Well, they both have different assembly languages, but that's kinda obvious and boring.
IMO, the most "exciting" difference is the memory model. The memory model is the set of rules that the multicore part of the chip follows. In particular: when to flush the L1 cache, and when to wait on external cores.
Lets think of the following piece of code:
store [x], 1; // Set the memory location "x" to 1
// What if Thread#2 sets "x" to 2 here??
load R0, [x]; // Load the memory location "x"
printf("%d\n", R0); // print the value of R0
Now the printf will print "1" under most circumstances. But another thread may cause printf to print "2", at least in theory.But practice doesn't match theory. What if "x" was in L1 cache? L1 cache is never shared between cores on any system I'm aware of (x86 or ARM). As such, if your core was ONLY looking at L1 cache for that "load R0, [x]" statement, then it will NEVER see x change between those two lines!!
In practice, x86 and ARM probably will "never" see x set to 2 in between the store-and-load. Both CPUs have "decided" to optimize L1 cache to keep the above lines as fast as possible. (In fact: this optimization is called store-to-load forwarding: instead of reading the value of "x" from memory, both x86 and ARM will read the value of "x" from the "store [x]" instruction right above it)
-------
Okay, so ARM and x86 do the same thing in this case still. But I think I've explained enough to finally bring up the "memory model" of systems. That is to say: the decisions that the multicore-designers came up with about where, and when, a core is allowed to update variables in a multithreaded context.
In between each assembly instruction, your CPU-cores are communicating with each other (called "snooping"), determining if loads or stores need to be reordered. And it turns out, x86 and ARM have made different decisions here.
The x86 system follows a set of rules called "Total Store Ordering", which is summarized as "Don't reorder stores between threads". If Thread#1 did "store [x], store[y]", then all other threads see "store[x] then store[y]", in that order.
ARM however, is far more aggressive. ARM cores are willing to reorder stores in its memory model. That means if Thread#1 did "store[x] then store[y]", then it may look like "store[y] then store[x]" on Thread#2.
But why? Why does ARM do this? Simple: because its faster. If "x" remains in L1 cache, but "y" leaves L1 cache first, then "y" is updated in global memory before x.
On x86, a snooping message needs to be sent when y is flushed out of L1 cache, telling all other cores that "x" is changed, and that Core#1 needs to flush "x" before any other core is allowed to read "x".
------
ARM REQUIRES the programmer (or really, the compiler) to stick in memory barriers into the code, to "reorder" the stores back into correct order.
In practice, these memory-barriers only are used in your "spinlock" and "mutex" implementations. In all other cases, the ARM CPU is allowed to reorder reads-and-writes with each other across threads, enabling slightly faster performance across the board.
x86 was more conservative with its multithreaded implementation. Fewer memory-barriers are needed, fewer reorderings are allowed. As such, I expect a whole slew of ARM multithreaded bugs as x86 code is ported into ARM. In fact, Apple foresaw this, and provided a mode for total-store-ordering on their most recent ARM chips!