Special tokens inquiry

#3
by kevinlu1248 - opened

Were these special tokens used in a part of the pre-training pipeline for showing other code files in context?

<|file_name_start|>
<|file_name_end|>

These tokens were introduced during our mid-training long-context extension phase to indicate file boundaries. We’ll share more about our formatting choices in our tech report.

Awesome, thanks! We already got a fine-tune out with it but so far it's worse than Qwen2.5 Coder 7B. I suspect it's because the format is wrong, so I'd love to read the tech report (even if it's an early draft).

Or if you can just share the format here so we can start testing it, that would be awesome too. Here's the one from Qwen for example:

Screenshot 2025-12-07 at 4.09.25 PM

Sign up or log in to comment