Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

fnmodel / README_inference.md

aeb56

Transform Space into professional inference UI for fine-tuned model

5e458c4 about 1 month ago

preview code

raw

history blame contribute delete

2.66 kB

metadata

title: Kimi 48B Fine-tuned - Inference
emoji: 🚀
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l40sx4

🚀 Kimi Linear 48B A3B Instruct - Fine-tuned

Professional inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model.

Model Information

Model: optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune
Base Model: moonshotai/Kimi-Linear-48B-A3B-Instruct
Parameters: 48 Billion
Fine-tuning Method: QLoRA (Quantized Low-Rank Adaptation)
Architecture: Mixture of Experts (MoE) Transformer

Features

✨ Professional Chat Interface

Clean, modern UI for seamless conversations
Chat history with copy functionality
System prompt customization

⚙️ Advanced Generation Settings

Temperature control for creativity
Top-P and Top-K sampling
Repetition penalty adjustment
Configurable response length

🎮 Optimized Performance

Multi-GPU support (4xL40S recommended)
Automatic device mapping
bfloat16 precision for efficiency
~96GB VRAM requirement

Usage

Click "Load Model" - Initialize the model (takes 2-5 minutes)
Set System Prompt (optional) - Define the assistant's behavior
Start Chatting - Type your message and hit send
Adjust Settings - Fine-tune generation parameters as needed

Generation Parameters

Temperature (0.0 - 2.0)

Low (0.1-0.5): Focused, deterministic responses
Medium (0.6-0.9): Balanced creativity
High (1.0-2.0): More creative and diverse outputs

Top P (0.0 - 1.0)

0.9 (recommended): Good balance
Lower values: More focused
Higher values: More diverse

Max New Tokens

Maximum length of generated response
1024 (default): Good for most use cases
Increase for longer responses

Hardware Requirements

Recommended: 4x NVIDIA L40S GPUs (192GB total VRAM)
Minimum: 4x NVIDIA L4 GPUs (96GB total VRAM)
Memory: ~96GB VRAM in bfloat16 precision

Fine-tuning Details

This model was fine-tuned using QLoRA with the following configuration:

LoRA Rank (r): 16
LoRA Alpha: 32
Target Modules: q_proj, k_proj, v_proj, o_proj (attention layers only)
Dropout: 0.05

Support

For issues or questions:

Built with ❤️ using Transformers and Gradio