- Have you ever stopped and wondered what the secret sauce is behind DeepSeek?
- What systemic changes are responsible for the impressive improvement in GPT 5?
- How exactly did Google manage to make Gemma, its compact model, perform fabulously despite its small size?
We live in an AI-saturated world. You open Twitter (sorry, X) and there’s a new model every week. Just like people once waited in anticipation for the next iPhone release, we now wait for the next iteration of an LLM. The general public is satisfied just talking to these models, but for people like me, that’s not enough. The AI scientist in me always wants to peek inside. I want to see what’s actually going on under the hood.
And here’s the truth: the LLM race is no longer just about throwing more GPUs at the wall and scaling parameters. It’s about architecture. The small, clever design tricks that make a modern LLM more memory-efficient, more stable, and yes, more powerful.
This blog is about those design tricks for a modern LLM. I went down the rabbit hole of model papers and engineering write-ups, and I found 10 architectural optimisations that explain why models like DeepSeek V3, Gemma 3, and GPT 5 punch above their weight.
Why should you read this blog?
If you’re just curious about AI, you can skip to the cool diagrams and metaphors. If you’re building or fine-tuning models, you’ll pick up ideas that might actually save your GPU bill. And if you’re like me, someone who finds beauty in elegant systems, then you’ll appreciate the creativity here.
Also, fair warning – This is a long post. I’m aiming for breadth. Grab some coffee or chai. Or both. You’ll need it.
The recipe for cooking up a hit LLM (one that makes money and people actually use) looks something like this:
Awesome LLM = Super-sized model + Super-sized and high-quality data + Extra toppings (RLHF, RLAIF, post-training, Constitutional AI, whatever’s trending this week)
The super-sized model is usually built on a variant of the Transformer architecture released by Vaswani et. al in 2017.

However, we have observed over the years that the base version itself is not enough. Different labs, research groups, and tech companies have come out with different interpretations of the same design for a modern LLM.
Let’s look at some of these architectures in this blog.
10 Architectural optimisation techniques
Multi-Head Latent Attention (MLA)

Let’s break it down word by word: Multi-head is the same as classic Transformers, having multiple heads. Latent means to shrink stuff into a smaller space. Attention is still attention.
The issue: in normal attention, we project Q/K/V and store a KV cache that grows linearly with sequence length. At 128k tokens, it’s monstrous.
The trick: MLA compresses keys and values before storing, then decompresses later. Like zipping files before uploading to Dropbox.
DeepSeek uses this to save tons of memory. The small extra compute is nothing compared to buying another giant GPU. You can read more about it in the DeepSeek V2 paper.
Sparse Mixture-of-Experts (MoE) + Shared Expert
MoE layers are one of the coolest hacks in recent years. Instead of one giant feed-forward layer, you have many experts. A router decides which experts get activated per token. This means you can have a trillion parameters but only use 10% per token.
But early MoE models were unstable. Enter the shared expert. It’s like a default expert that every token always goes through, ensuring some consistency across tokens. The shared expert acts like the safety net at a circus: you can still do wild flips, but you won’t die if you miss.
Here’s an illustration of this approach from the DeepSeek V3 paper.

What’s more, even LLAMA 4 utilizes this setup:

Meta has written a very insightful blog on Llama 4 where they discuss it in detail.
Grouped-Query Attention (GQA)
Full multi-head attention is expensive. Every query head gets its own key/value projections. GQA says let’s share keys and values across groups of queries. Suddenly, your memory and compute bill drop without killing performance.

It’s like ordering food at a restaurant as a group instead of everyone placing separate orders. Less chaos, faster service, and better experience for everybody.
Check out the GQA paper if you want to read further.
Sliding Window + Global Attention Hybrids
Do we really need every token to see every other token? Nope. Most tokens only care about their neighbors. Some rare ones, like a title, a summary sentence, or a special marker, may actually need the big picture.
So the trick is simple: mix both worlds. Use sliding window attention for the bulk (cheap and efficient), and drop in global attention every now and then to stitch the whole story together. Gemma 3 does this neatly: five local layers, then one global, repeat. Always start local.
Think of it like gossip. Most of the time, you’re tuned into what your neighbors are whispering. But once in a while, you need that one loud announcement from across the room to make sense of the whole story.
This isn’t new-new since papers like Longformer (Beltagy et al., 2020) first showed how sliding windows could scale attention. It even tried different settings of the sliding window:

And Luong et al., 2015 explored global attention. Here is an illustration from the paper of how it’s computed in practice:

What’s cool is how a modern LLM remixes these design ideas in production-ready ways. I definitely recommend reading the Gemma paper on how they utilize the combination of these two techniques.
Smart Normalization Tricks
LayerNorm, RMSNorm, Pre-Norm, Post-Norm, QK-Norm… sounds like a shampoo aisle, right? But these tiny tweaks actually make a huge difference.
Pre-Norm stabilizes gradients in deep networks. Without it, your loss can explode halfway through training, and you’ll be crying over hours of wasted compute. (Yes, I’ve been there.)
RMSNorm is cheaper, simpler, and often just as good. No mean subtraction, just scales by variance. Saves memory, speeds things up.
QK-Norm normalizes queries and keys inside attention. Keeps values sane, prevents them from going wild like a toddler on sugar.
The OLMo 2 paper showed how much stability these small tweaks add. Honestly, normalization is like seasoning salt while cooking: a pinch in the right amount can make it delicious. However, too little or too much, and your whole training “dish” collapses into a salty/bitter mess.
No Positional Embeddings (NoPE)
This one honestly shocked me. Some models just ditch positional embeddings entirely. No absolute, no relative, nothing. They just rely on causal masking to figure out the order.
And weirdly? It works. Models actually generalize better to longer sequences than they saw during training. It’s like teaching a dancer rhythm by feel instead of counting “1, 2, 3, 4.” They just get it.
SmolLM3 uses NoPE in some layers, proving you don’t always need fancy positional hacks. The paper Rope to Nope and Back Again: A New Hybrid Attention Strategy dives into this more if you want the geeky details.
Not mainstream yet, but definitely something to keep an eye on as it might just be the secret sauce for handling ultra-long contexts.
Playing with Depth vs Width
Do you build deep and narrow or shallow and wide? There’s no silver bullet here. It is all about what your hardware can handle and how fast you need answers.
- Deep has more layers. Great for building a rich representational hierarchy, but can be slower to train.
- Wide has fewer layers. Better parallelism, faster inference, easier on certain hardware.
Case study: GPT-OSS goes wide, Qwen3 goes deep. Both crush it.
Think of it like athletes: sprinters vs marathon runners. Different builds, different strengths, same elite results.
Attention Biases & Sinks
Attention layers often drop bias terms to save parameters. But some models are bringing them back. Biases act like tiny steering wheels by helping attention lock onto important signals.
Then there are attention sinks: special tokens that everything attends to. Think of them like “anchor points” in the sequence. They stabilize long-context behavior.
GPT-OSS uses both. Sometimes old-school ideas come back in fashion!
Hybrid Blocks & Gated Variants
Not all blocks in modern LLMs have to be boring, identical Transformer layers. Some models mix attention layers with gated blocks, state-space layers, or other lightweight modules.
Why? Power + efficiency. It’s like a relay race team, some runners are sprinters and some are endurance champs, but together they crush it.
Case in point: Qwen3-Next mixes Gated Attention and DeltaNet-style blocks. Totally experimental, but super exciting.

Honestly, this hits a little nostalgia for me as I grew up in an era of gates in NLP. LSTMs, GRUs were the coolest architectures on the block.
Seeing gates make a comeback in modern Transformers reminds me why foundations still matter. Modern AI scientists shouldn’t just chase the latest trends since sometimes the old tricks come in handy in ways you never expect.
Multi-Token Prediction (MTP)
Traditionally, LLMs predict one token at a time. Slow, painstaking, like pecking each letter on a typewriter.
MTP says: Why not predict multiple tokens at once? Train the model to see ahead a bit, and combine it with speculative decoding. It leads to faster training, faster inference, same or even better performance.

It’s like typing with autocomplete on steroids. You’re not pecking letter by letter anymore; you’re basically sliding over whole words, sometimes phrases, in one go. Qwen3-Next loves this trick, and it’s one of the reasons it feels snappy despite being a big model.
Here’s the original MTP paper for you to get more context about this approach.
Key Highlights
Alright, let’s zoom out and take stock. We’ve talked about 10 design tricks that a modern LLM uses to punch way above its weight. What really matters?
Memory & Efficiency: MLA and GQA are lifesavers if your KV cache is exploding. Compressing keys and values, sharing projections, leads to memory being saved without killing performance. Sliding window + global hybrids? Same idea. Most tokens don’t care about the whole context, so give them local attention. Sprinkle in global layers when needed, like Gemma 3 does.
Capacity & Scaling: MoE with shared experts, hybrid blocks, and depth vs width choices prove there’s no single right answer. Need huge capacity without slowing inference? Sparse MoE + shared expert. Prefer parallelism and speed? Go wide. Look at Qwen3 (deep) vs GPT-OSS (wide); they both work. Hybrid blocks give flexibility and efficiency. Gated layers, state-space blocks, try to mix and match.
Stability & Training: RMSNorm, Pre/Post-Norm, QK-Norm, and dense layers before MoE routing are the unsung heroes. They prevent exploding gradients, stabilize early training, and save you from tearing your hair out. Tiny tweaks, huge payoff.
Generalization & Context: NoPE, attention sinks, sliding/global hybrids, all help with long sequences and make models generalize better. Sometimes, throwing away positional embeddings actually makes the model stronger.
Fun Takeaway: A lot of these tricks are old ideas reborn. Gates? Hello LSTMs and GRUs. MTP? Typing with autocomplete on steroids. Lesson: know your foundations. The old stuff will save your life when the new stuff gets weird.
Recommendations (my take)
- Memory tight? Start with MLA, GQA, and sliding window attention.
- Scale smart? Sparse MoE + shared expert.
- Stabilize training? Don’t skip normalization tricks and dense prep layers.
- Latency a concern? Choose depth vs width wisely.
- Long-context needed? Try NoPE, attention sinks, and hybrid attention.
Conclusion
Here’s the bottom line: a modern LLM design isn’t just about “bigger = better.” It’s a mix of efficiency hacks, scaling strategies, stability tweaks, and generalization tricks. Put them together thoughtfully, and you get a model that’s fast, powerful, and actually usable.
Understanding why these tricks work is more important than just copying them. The old ideas, such as gates, normalization, and careful routing, keep coming back and saving your GPU, your sanity, or both.
At the end of the day, building LLMs is part engineering, part art, and part controlled chaos. Nail the foundations, sprinkle in clever hacks, and you’ve got a model that doesn’t just exist but also performs, scales, and impresses. And honestly, that’s way more fun than just chasing the latest hype.
Still curious about these techniques that help design a modern LLM? Feel free to check out Sebastian Raschka’s Blog, where he covers these topics in further detail!
Login to continue reading and enjoy expert-curated content.




