Spaces:

akhaliq
/

MobileLLM-Pro

Running on Zero

App Files Files Community

raphael-gl HF Staff commited on Oct 28

Commit

3046b33

verified ·

1 Parent(s): a221e28

feat(optim): load the model and tokenizer outside of the spaces wrapped method

Browse files

On one side we lose the lazy init, but we benefit from the tensor packing on zero, so the model has a smaller memory footprint when idle. Besides, this way, callers do not consume their gpu quota to actually load the model. Is already downloaded, loaded in memory and prepared for serving

Files changed (1) hide show

app.py +2 -1

app.py CHANGED Viewed

@@ -54,13 +54,14 @@ def _history_to_messages(history: List[Tuple[str, str]]) -> List[Dict[str, str]]
             msgs.append({"role": "assistant", "content": bot_msg})
     return msgs
 @spaces.GPU(duration=120)
 def generate_stream(message: str, history: List[Tuple[str, str]]):
     """
     Minimal streaming chat function for gr.ChatInterface.
     Uses instruct chat template. No token UI. No extra controls.
     """
-    _ensure_loaded()
     messages = _history_to_messages(history) + [{"role": "user", "content": message}]
     inputs = _tokenizer.apply_chat_template(

             msgs.append({"role": "assistant", "content": bot_msg})
     return msgs
+_ensure_loaded()
 @spaces.GPU(duration=120)
 def generate_stream(message: str, history: List[Tuple[str, str]]):
     """
     Minimal streaming chat function for gr.ChatInterface.
     Uses instruct chat template. No token UI. No extra controls.
     """
     messages = _history_to_messages(history) + [{"role": "user", "content": message}]
     inputs = _tokenizer.apply_chat_template(