StoryKimi-Zero / README.md
yuvraj-singh-9886's picture
Add StoryKimi ZeroGPU implementation
3b70c60

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: StoryKimi Zero
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: mit
hardware: zero-gpu
short_description: Generate stories with StoryKimi model using ZeroGPU

StoryKimi Zero - DeepSeek V3 Inspired Model on ZeroGPU

A PyTorch implementation of a DeepSeek V3 inspired transformer model with Mixture of Experts (MoE), Latent Attention, and other advanced features, deployed on Hugging Face Spaces with ZeroGPU for efficient inference.

StoryKimi Model

πŸ“Š Training Results & Model Weights

πŸ“ˆ View Training Report: StoryKimi Training Results on WandB

πŸ’Ύ Pre-trained Weights:

  • Hugging Face Model: YuvrajSingh9886/StoryKimi
  • WandB Checkpoints: Check the WandB report above for additional trained model checkpoints

🌟 Features

  • ZeroGPU Integration: Dynamic GPU allocation with NVIDIA H200 slices (70GB VRAM)
  • Latent Attention: Efficient attention mechanism with compressed key-value representations
  • Mixture of Experts (MoE): 8 experts with top-2 routing and shared expert support
  • SWiGLU Activation: Advanced activation function in expert layers
  • Sinusoidal Positional Embeddings: Position encoding for sequence understanding
  • Interactive Interface: User-friendly Gradio interface with real-time generation
  • Multiple Sampling Methods: Top-k sampling with temperature control
  • Real-time Generation: Fast inference with automatic scaling

πŸ”§ Model Architecture

Default Configuration

  • Embedding Dimensions: 384
  • Decoder Layers: 6
  • Attention Heads: 8
  • MoE Experts: 8 (top-2 routing)
  • Block Size: 128 tokens
  • Vocabulary Size: Based on Llama-2-7b tokenizer (~32,000 tokens)
  • Latent Dimension: 64 (for compressed attention)

ZeroGPU Configuration

  • GPU Type: NVIDIA H200 slice
  • Available VRAM: 70GB per workload
  • Max Duration: 120 seconds per generation
  • Deployment: Hugging Face Spaces with automatic scaling

🎯 Usage

  1. Enter your story prompt in the text box
  2. Select model checkpoint (Checkpoint 2000 available)
  3. Adjust generation parameters:
    • Max Length: 10-128 tokens
    • Temperature: 0.1-2.0 (creativity vs coherence)
    • Top-k: 1-100 (vocabulary filtering)
  4. Click "Generate Text" to create your AI-generated story
  5. Enjoy your personalized story!

πŸ’‘ Generation Tips

  • Lower temperature (0.1-0.7) for more coherent and focused stories
  • Higher temperature (0.8-2.0) for more creative and diverse outputs
  • Adjust top-k to control vocabulary diversity and randomness
  • Use descriptive prompts for better and more relevant results
  • Experiment with different lengths to find your preferred story format

πŸ”„ ZeroGPU Benefits

  • Free GPU Access: No cost for users to generate stories
  • Efficient Resource Usage: GPU allocated only when needed for inference
  • Automatic Scaling: Handles multiple concurrent users seamlessly
  • High Performance: NVIDIA H200 acceleration for fast generation
  • No Setup Required: Ready-to-use interface with pre-loaded model

πŸ—οΈ Technical Implementation

Model Features

  • Latent Attention: Compressed key-value representations for efficiency
  • Mixture of Experts: 8 experts with intelligent routing
  • Advanced Activation: SWiGLU for better performance
  • Positional Encoding: Sinusoidal embeddings for sequence understanding

Deployment Features

  • ZeroGPU Decorator: @spaces.GPU(duration=120) for dynamic allocation
  • Optimized Loading: Efficient model loading and initialization
  • Error Handling: Robust error management for better user experience
  • Real-time Feedback: Live generation status and results

πŸš€ Local Development

Want to run this locally or contribute? Check out the full repository:

πŸ“ Source Code: YuvrajSingh-mist/SmolHub/StoryKimi

Quick Local Setup

# Clone the repository
git clone https://github.com/YuvrajSingh-mist/SmolHub.git
cd SmolHub/StoryKimi

# Install dependencies
chmod +x install.sh
./install.sh

# Run Gradio interface
cd gradio
python app.py

Training Your Own Model

# Set your HF token for Llama-2 tokenizer access
export HF_TOKEN="your_token_here"

# Basic training
python trainer.py

# Advanced training with custom parameters
python trainer.py --embeddings_dims 512 --experts 16 --epochs 5

πŸ“Š Model Performance

The model has been trained on diverse text data and shows strong performance in:

  • Story Generation: Creative and coherent narrative creation
  • Text Continuation: Natural extension of given prompts
  • Style Adaptation: Adapting to different writing styles and genres
  • Character Development: Creating consistent characters and dialogue

πŸ”— Related Links

πŸ“ License

MIT License - See LICENSE file for details