Can I run this with vLLM?

#1
by DrXaviere - opened

Can I run this with vLLM?
Also, I have A100 80gb x 2.
In requirement section, I need 4 of those, but as the model is 32B, I guess one A100 80gb might be fine.

https://github.com/vllm-project/vllm/pull/31471

I am currently working on it :)

and I guess you can probably run this model with A100 80gb x 2.

HyperCLOVA X org

If you use Omniserve, the vision encoder and the LLM run as separate services, so you need to cap vLLM’s GPU memory usage. By default vLLM will try to use almost all available GPU memory, so on 2× A100 80GB I’d run the 32B model with tensor parallelism 2 and set --gpu-memory-utilization to around 0.7. That way you leave some headroom on each GPU and can still run the vision encoder on the same GPUs.
(https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe)

Sign up or log in to comment