Nanbeige4.1-3B for Apple Neural Engine
CoreML conversion of Nanbeige/Nanbeige4.1-3B for on-device inference on Apple Neural Engine using ANEMLL.
Available Context Lengths
| Variant | Context | Use Case |
|---|---|---|
| ctx512 | 512 | Quick Q&A, simple tasks |
| ctx1024 | 1024 | Short conversations |
| ctx2048 | 2048 | Tool calls, multi-turn chat |
Model Details
| Parameter | Value |
|---|---|
| Source Model | Nanbeige/Nanbeige4.1-3B |
| Architecture | LLaMA |
| Context Length | 512 |
| Batch Size | 16 |
| Chunks | 3 |
| LUT (FFN) | 6-bit, per_channel=4 |
| LUT (LM Head) | 6-bit, per_channel=4 |
Recommended Sampling
| Parameter | Value |
|---|---|
| Temperature | 0.6 |
| Top-p | 0.95 |
| Repeat penalty | 1.0 |
Requirements
- macOS 15 (Sequoia) or later with Apple Neural Engine
- 8GB RAM or more
- Python 3.9+ with
coremltools>=9.0andtransformers
Quick Start
# Install dependencies
pip install "coremltools>=9.0" transformers huggingface_hub
# Clone the model
git lfs install
git clone https://huggingface.co/anemll/nanbeige41-3b-ane-lut6-4-ctx512
cd nanbeige41-3b-ane-lut6-4-ctx512
# Extract CoreML model files
find . -type f -name "*.zip" -exec unzip {} \;
# Run chat
python chat.py --meta ./meta.yaml
For conversation mode with history:
python chat_full.py --meta ./meta.yaml
Note: First load takes time as macOS places the model on ANE. Subsequent loads are instant. Use Ctrl-D to exit, Ctrl-C to interrupt.
Tool Calling
Nanbeige4.1-3B natively supports tool/function calling. The model uses <tool_call> tags to invoke functions.
Tool calls require context for the system prompt, tools schema,
<think>reasoning, and response. ctx512 works for simple queries; use ctx1024+ for reliable tool calling.
Live Demo
The included test_tool_call.py runs a full tool-calling loop on ANE with live HTTP calls:
- Model decides which tool to call
- Script executes the tool (real API request)
- Model summarizes the result
# Weather in San Francisco (default)
python test_tool_call.py --meta ./meta.yaml
# Custom city
python test_tool_call.py --meta ./meta.yaml --question "Weather in Paris?"
python test_tool_call.py --meta ./meta.yaml --question "Weather in Tokyo?"
The get_weather tool fetches live data from open-meteo.com (free, no API key).
Example output:
[Step 3] Parsed: get_weather({"location": "San Diego"})
[Step 4] Executing 'get_weather' (live HTTP call)...
Result: {"location": "San Diego", "temp_f": 57.3, "humidity": 78, "condition": "Partly cloudy"}
[Step 5] Second inference (model summarizes result)...
Final answer:
──────────────────────────────────────────────────
The current weather in San Diego is:
- **Temperature**: 57.3°F
- **Conditions**: Partly cloudy
- **Humidity**: 78%
How It Works
The model receives tools as JSON schemas in the system prompt using ChatML format:
<|im_start|>system
You are a helpful assistant with access to tools.
<tools>
{"type": "function", "function": {"name": "get_weather", ...}}
</tools>
<|im_end|>
<|im_start|>user
What is the weather in San Francisco?<|im_end|>
<|im_start|>assistant
The model responds with a tool call:
<think>
The user wants weather info. I should call get_weather.
</think>
<tool_call>
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
</tool_call>
The tool result is fed back as a <tool_response>, and the model generates a natural language summary.
iOS/macOS App
Try the sample Chat-Bot app on TestFlight:
- Install TestFlight from App Store
- Join beta: TestFlight Link
- Add custom models via HuggingFace URLs
Links
License
ANEMLL conversion pipeline: MIT License Source model (Nanbeige4.1-3B): Apache 2.0
- Downloads last month
- 42