Speech tokens and text tokens are treated the same in LLMs, they just learn speech tokens as a different language as I stated. They will learn about using speech tokens in a similar way they learn about text tokens.
Unfortunately reasoning capabilities do decrease because of a few reasons:
- Simply not much training data forcing the model to reason.
- They are usually trained on relatively little amount of data. For example most models are trained on trillions of tokens of text but only billions of tokens of audio.
- Small sizes, most models are less than 3b params and hence most just donβt have great reasoning capabilities.