I’m a bit confused by the “hidden_size ≥ 512 threshold” framing.
High-tier models include 32L (hidden=384) and 64L (hidden=256), while several configs below 512 fail.
This seems less like a strict hidden-size threshold and more like specific depth–width interaction points.
Could you clarify why this is described as a hidden-dimension threshold rather than a joint depth–width effect?