does not run

#4
by MikaSouthworth - opened

using: https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7133-affdb0d compiled from source since there is no cuda or general linux version. it would be helpful to include commands to run this locally - in the .md file. it seems --offline works, but it is not helpful to be sent to an online version. then: it does not run. it compiled on first go, flawlessly (the new version, since the old compiled eventually, but produced a garbled output. I tested not my own made ggufs, but yours. so it wasn't on my side. this thing does not run. period. and yes, I copied all related files directly into /build/bin just to exclude any sort of error. and it does not fail after loading up in vram. it fails within a split second of launch. I have a very good CPU so it gets to the point fast where it starts doing something and immediately fails. I have nvidia-smi -l running. it does not even blip

llama-mtmd-cli -m Huihui-Qwen3-VL-8B-Instr-ablit-Q4_K_M.gguf --mmproj Huihui-Qwen3-VL-8B-Instr-ablit-mmproj-f16.gguf -c 4096 --n-gpu-layers 5 --offline --image test_image2.jpg -p "Describe this image."

print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.68 GiB (4.90 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'Huihui-Qwen3-VL-8B-Instr-ablit-Q4_K_M.gguf', try reducing --n-gpu-layers if you're running out of VRAM
fish: Job 1, 'llama-mtmd-cli -m Huihui-Qwen3-…' terminated by signal SIGSEGV (Address boundary error)

same happened with 1 layer on GPU and 0 layers. I have 192gb RAM and a 4090. I compiled against CUDA, just like I always do and I never had issues with models, except the VL series now.

MikaSouthworth changed discussion title from do you even test these? to does not run

Compilation under Windows

cd llama.cpp-tr-qwen3-vl-6-b7106-495c611
mkdir build
cd build
cmake -A x64 .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF
cmake --build . --config Release -t llama-cli llama-quantize llama-gguf-split llama-mtmd-cli

Download

cd llama.cpp-tr-qwen3-vl-6-b7106-495c611
hf download huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated --local-dir ./huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated

Convert

cd llama.cpp-tr-qwen3-vl-6-b7106-495c611
python convert_hf_to_gguf.py huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated --outfile huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated/ggml-model-f16.gguf --outtype f16
python convert_hf_to_gguf.py huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated --outfile huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated/mmproj-model-f16.gguf --outtype f16 --mmproj

Quantize

cd llama.cpp-tr-qwen3-vl-6-b7106-495c611

build\bin\Release\llama-quantize huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated/ggml-model-f16.gguf  huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated/ggml-model-Q4_K_M.gguf Q4_K_M

Chat with image

cd llama.cpp-tr-qwen3-vl-6-b7106-495c611

build\bin\Release\llama-mtmd-cli -m huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated/ggml-model-Q4_K_M.gguf --mmproj huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated/mmproj-ggml-model-f16.gguf -c 4096 --image png/cc.png -p "Describe this image." 

Chat

cd llama.cpp-tr-qwen3-vl-6-b7106-495c611
build\bin\Release\llama-cli -m huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated/ggml-model-Q4_K_M.gguf -c 40960

I don't run windows. I am on linux. but not ubuntu or a debian distro (Arch)

using: https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7133-affdb0d compiled from source since there is no cuda or general linux version. it would be helpful to include commands to run this locally - in the .md file. it seems --offline works, but it is not helpful to be sent to an online version. then: it does not run. it compiled on first go, flawlessly (the new version, since the old compiled eventually, but produced a garbled output. I tested not my own made ggufs, but yours. so it wasn't on my side. this thing does not run. period. and yes, I copied all related files directly into /build/bin just to exclude any sort of error. and it does not fail after loading up in vram. it fails within a split second of launch. I have a very good CPU so it gets to the point fast where it starts doing something and immediately fails. I have nvidia-smi -l running. it does not even blip

llama-mtmd-cli -m Huihui-Qwen3-VL-8B-Instr-ablit-Q4_K_M.gguf --mmproj Huihui-Qwen3-VL-8B-Instr-ablit-mmproj-f16.gguf -c 4096 --n-gpu-layers 5 --offline --image test_image2.jpg -p "Describe this image."

print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.68 GiB (4.90 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'Huihui-Qwen3-VL-8B-Instr-ablit-Q4_K_M.gguf', try reducing --n-gpu-layers if you're running out of VRAM
fish: Job 1, 'llama-mtmd-cli -m Huihui-Qwen3-…' terminated by signal SIGSEGV (Address boundary error)

same happened with 1 layer on GPU and 0 layers. I have 192gb RAM and a 4090. I compiled against CUDA, just like I always do and I never had issues with models, except the VL series now.

Try using an absolute path.

using: https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7133-affdb0d compiled from source since there is no cuda or general linux version. it would be helpful to include commands to run this locally - in the .md file. it seems --offline works, but it is not helpful to be sent to an online version. then: it does not run. it compiled on first go, flawlessly (the new version, since the old compiled eventually, but produced a garbled output. I tested not my own made ggufs, but yours. so it wasn't on my side. this thing does not run. period. and yes, I copied all related files directly into /build/bin just to exclude any sort of error. and it does not fail after loading up in vram. it fails within a split second of launch. I have a very good CPU so it gets to the point fast where it starts doing something and immediately fails. I have nvidia-smi -l running. it does not even blip

llama-mtmd-cli -m Huihui-Qwen3-VL-8B-Instr-ablit-Q4_K_M.gguf --mmproj Huihui-Qwen3-VL-8B-Instr-ablit-mmproj-f16.gguf -c 4096 --n-gpu-layers 5 --offline --image test_image2.jpg -p "Describe this image."

print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.68 GiB (4.90 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'Huihui-Qwen3-VL-8B-Instr-ablit-Q4_K_M.gguf', try reducing --n-gpu-layers if you're running out of VRAM
fish: Job 1, 'llama-mtmd-cli -m Huihui-Qwen3-…' terminated by signal SIGSEGV (Address boundary error)

same happened with 1 layer on GPU and 0 layers. I have 192gb RAM and a 4090. I compiled against CUDA, just like I always do and I never had issues with models, except the VL series now.

Try using an absolute path.

It is recommended to use the version we are converting to now. https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

Same error llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl' and I'm using the latest ollama docker image, version is 0.12.9. I create the model with Modelfile but just cannot run the model.

FROM /models/mmproj-model-f16.gguf
FROM /models/ggml-model-q8_0.gguf
...

Ollama is not compatible with the latest version of llama.cpp

Sign up or log in to comment