--- license: mit title: Yuuki-api sdk: docker emoji: 🚀 colorFrom: purple colorTo: pink pinned: true thumbnail: >- https://cdn-uploads.huggingface.co/production/uploads/68a8bd1d45ff88ffe886e331/64rKm_tmEj0DPMdzTAWCM.png ---
|
**Self-hosted inference server.** Three Yuuki model variants. Lazy loading with memory caching. REST API with OpenAPI docs. Health check and model list endpoints. CORS enabled for web clients. Automatic model downloads at build time. CPU-optimized with ~50 tokens/second. |
**Production-ready deployment.** Dockerized for HuggingFace Spaces. Health checks with auto-restart. Request/response timing metrics. Configurable token limits. Temperature and top-p sampling. No API keys. No rate limits. Just inference. |
Multi-Model SupportThree Yuuki variants: `yuuki-best`, `yuuki-3.7`, and `yuuki-v0.1`. Each model maps to its HuggingFace checkpoint. Clients specify the model via the `model` field in POST requests. Default is `yuuki-best` if not specified.Lazy Loading & CachingModels are loaded into memory only when first requested, not at server startup. Once loaded, they remain cached for the lifetime of the process. This allows the server to start instantly while supporting multiple models without consuming memory upfront.REST API with DocsStandard REST endpoints: `GET /` for API info, `GET /health` for status, `GET /models` for available models, and `POST /generate` for inference. FastAPI automatically generates interactive OpenAPI documentation at `/docs` and JSON schema at `/openapi.json`.CORS EnabledConfigured with permissive CORS headers to allow requests from any origin. Essential for browser-based clients like [Yuuki Chat](https://github.com/YuuKi-OS/Yuuki-chat) or web demos. |
Request ValidationPydantic models validate all inputs: `prompt` (1-4000 chars), `max_new_tokens` (1-512), `temperature` (0.1-2.0), and `top_p` (0.0-1.0). Invalid requests return structured error messages with HTTP 400/422 status codes.Response TimingEvery `/generate` response includes a `time_ms` field showing inference latency in milliseconds. Useful for performance monitoring and client-side UX (e.g., showing "Generated in 2.1s").Dockerized DeploymentMulti-stage Dockerfile that pre-downloads all three model checkpoints during the build step. This means the container starts with models already cached, eliminating cold-start delays. Optimized for HuggingFace Spaces but works anywhere Docker runs.Health ChecksBuilt-in `/health` endpoint returns server status and lists which models are currently loaded in memory. Docker health check configured to auto-restart on failures. |
| **Training Details** | | | |:--|:--| | Base model | GPT-2 (124M parameters) | | Training type | Continued pre-training | | Hardware | Snapdragon 685, CPU only | | Training time | 50+ hours | | Progress | 2,000 / 37,500 steps (5.3%) | | Cost | $0.00 | | **Quality Scores (Checkpoint 2000)** | Language | Score | |:---------|:------| | Agda | 55 / 100 | | C | 20 / 100 | | Assembly | 15 / 100 | | Python | 8 / 100 | |