Can I run this with vLLM?

by DrXaviere - opened 1 day ago

1 day ago

Can I run this with vLLM?
Also, I have A100 80gb x 2.
In requirement section, I need 4 of those, but as the model is 32B, I guess one A100 80gb might be fine.

kaki-paper

1 day ago

https://github.com/vllm-project/vllm/pull/31471

I am currently working on it :)

and I guess you can probably run this model with A100 80gb x 2.

appleeji

HyperCLOVA X org 1 day ago

If you use Omniserve, the vision encoder and the LLM run as separate services, so you need to cap vLLM’s GPU memory usage. By default vLLM will try to use almost all available GPU memory, so on 2× A100 80GB I’d run the 32B model with tensor parallelism 2 and set --gpu-memory-utilization to around 0.7. That way you leave some headroom on each GPU and can still run the vision encoder on the same GPUs.
(https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment