How do new models differ from small update of the same model?

Question

dontoni · Accepted Answer

GPT-4 not only has orders of magnitude more parameters than GPT-3.5. It also has a different architecture, using a Mixture of Experts approach rather than a raw GPT.What is interesting to me is the fact that they haven&rsquo;t developed (or at least publicly disclosed so) this idea further. What if you have 37 &ldquo;experts&rdquo;, but each being notoriously small? Is it a requisite that each expert is a fully functional LLM on its own? Can&rsquo;t they interconnect like the brain does with its lobes?