SpoomplesMaxx Base — Qwen3-14B CPT

A continued pre-training (CPT) of Qwen3-14B-Base on a curated mix of fiction, character knowledge, prose, and domain-specific corpora. This is the base model — further SFT and DPO stages follow.

Model Description

This model is part of the SpoomplesMaxx training pipeline: CPT → SFT → DPO

The CPT stage teaches the model general language patterns, domain knowledge, and writing styles by training on raw text corpora without chat templates. It grounds the model in character knowledge, narrative prose, multilingual content, and uncensored language before instruction tuning.

Training Data

CPT Curriculum (3 phases)

The prepared dataset (aimeri/spoomplesmaxx-cpt-small-Qwen3-14B-Base) was assembled from three curriculum phases, each with different repeat factors to control emphasis.

Phase Focus Repeat Key Sources
Phase 1: Core Knowledge Characters, lore, world-building Custom character cards, AO3 works, fictional character DBpedia, NSFW prose (Literotica, nsfwstory, NSFW-Stories), movie spoilers
Phase 2: Domain Prose Writing quality, narrative style Gutenberg prose, LongPage (long-form + planning traces), light novels, TV dialogue, FimFiction, TV Tropes, Brazilian news/law, Huberman Lab transcripts
Phase 3: Language Diversity Robustness, multilingual Toxic conversations (pile-toxicity-balanced series), Fandom wiki lore

Total training samples: 50,000 (after dataset_limit, repacked to 3072 tokens)

The prepared dataset is pre-tokenized and publicly available on HuggingFace. Private data (custom character cards) is included in the tokenized form only.

Training Configuration

Parameter Value
Base Model Qwen/Qwen3-14B-Base
Training Phase CPT (Continued Pre-Training)
Epochs 2
Steps ~782
Batch Size (per device) 4
Gradient Accumulation 8
Effective Batch Size 128 (4 × 8 × 4 GPUs)
Learning Rate 3e-5
LR Scheduler Cosine with min LR (min_lr_rate=0.01 → floor 3e-7)
Warmup Ratio 0.0
Weight Decay 0.1
Max Gradient Norm 1.0
Max Sequence Length 3072
Precision BF16
Optimizer 8-bit Paged AdamW
Gradient Checkpointing Yes
Liger Kernel Yes (fused lm_head + cross-entropy)
Dataset Repacking Yes (stream mode, 3072 tokens)
DeepSpeed ZeRO-3 (full parameter sharding)

Hardware

  • GPU: 4× NVIDIA H100
  • Training Time: ~14 hours

Metrics

Metric Value
Train Loss 1.735
Train Perplexity 5.668
Samples/sec 1.964
Total FLOPs 1.26 × 10¹⁸

Intended Use

This CPT base model is intended as a foundation for the SpoomplesMaxx pipeline:

  • Next step: SFT with chat/instruction data and persona injection
  • Final step: DPO alignment with preference pairs

Use cases after full pipeline:

  • Creative writing and fiction generation
  • Character roleplay with consistent personas
  • Uncensored conversational AI
  • Multilingual content (English, Portuguese, some Italian/Spanish)

Not Recommended For:

  • Direct use as a chat assistant (this is a base model — no instruction tuning yet)
  • Factual Q&A or knowledge retrieval (CPT emphasizes narrative over factuality)
  • Production safety-critical applications

Limitations

  • This is a base model without instruction following. It will continue text, not answer questions.
  • Domain knowledge is biased toward fiction, character knowledge, and creative writing.
  • Contains uncensored/NSFW training data — outputs may include explicit content.
  • Multilingual content is weighted toward English with some Portuguese/Brazilian content.
  • 3072 token context window during CPT (base model supports 128K; longer contexts untested post-CPT).

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "aimeri/spoomplesmaxx-base-qwen3-14b",
    dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("aimeri/spoomplesmaxx-base-qwen3-14b")

# Text continuation (base model — no chat template)
inputs = tokenizer("The castle stood silent against the darkening sky, its towers reaching toward clouds that promised rain. Inside,", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model, please cite the base model and datasets:

@misc{qwen3-14b-base,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  url={https://huggingface.co/Qwen/Qwen3-14B-Base}
}

Acknowledgments

  • Qwen Team for the Qwen3-14B-Base model
  • PocketDoc for the Dans-Prosemaxx, DanChat format and related datasets
  • PJMixers-Dev for curated fiction and RP datasets
  • HuggingFace for the Transformers and Accelerate libraries
  • DeepSpeed for ZeRO-3 distributed training

Downloads last month
73
Safetensors
Model size
15B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aimeri/spoomplesmaxx-base-qwen3-14b

Finetuned
(69)
this model
Quantizations
2 models