🚀 Starting Agent Loop Tool Efficiency Test
🚀 Starting Agent Loop Tool Efficiency Test
📊 Configuration:
Base URL: http://localhost:9099/v1
Model: nanbeige4.1-3b-Q5_K_M_06
Test Cases: 17
Output: results/agent_test_results_nanbeige4.1-3b-Q5_K_M_06_20260215_193411.json
Log File: logs/agent_test_logs_nanbeige4.1-3b-Q5_K_M_06_20260215_193411.log
🔄 Running agent tests...
Starting agent test suite with 17 test cases
Running agent test: zero_general_question
Running agent test: medium_view_and_add
Running agent test: zero_weather_question
Running agent test: simple_add_iphone
Running agent test: complex_shopping_workflow
Running agent test: simple_search_electronics
Running agent test: simple_checkout
Running agent test: simple_view_cart
Running agent test: simple_remove_product
Running agent test: medium_search_and_add
Running agent test: medium_search_category_and_add
Running agent test: medium_remove_and_add
Running agent test: complex_cart_management
Running agent test: complex_gift_shopping
Running agent test: zero_thank_you
Running agent test: zero_greeting
Running agent test: zero_capabilities
✅ Tests completed in 13m36.0942623s
📈 Agent Test Results
Total Tests: 17
✅ Passed: 14
❌ Failed: 3
⏱️ Total LLM Time: 1h39m47.1450554s
⏱️ Average Time per Request: 2m33.516539882s
📋 Test Case Results:
Test Case: zero_general_question
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 9.4502925s
Tool Calls: 0
Test Case: zero_thank_you
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 22.1211213s
Tool Calls: 0
Test Case: zero_greeting
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 34.4542576s
Tool Calls: 0
Test Case: simple_add_iphone
Status: ✅ PASSED
Matched Path: direct_add
Response Time: 2m32.0190135s
Tool Calls: 1
Tools Used: add_to_cart
Test Case: zero_weather_question
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 2m55.0826707s
Tool Calls: 0
Test Case: simple_remove_product
Status: ✅ PASSED
Matched Path: direct_remove
Response Time: 3m42.4184824s
Tool Calls: 1
Tools Used: remove_from_cart
Test Case: complex_gift_shopping
Status: ❌ FAILED
Response Time: 4m4.5459472s
Tool Calls: 0
Test Case: simple_checkout
Status: ✅ PASSED
Matched Path: direct_checkout
Response Time: 4m29.4751461s
Tool Calls: 1
Tools Used: checkout
Test Case: zero_capabilities
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 5m27.8562081s
Tool Calls: 0
Test Case: medium_search_category_and_add
Status: ✅ PASSED
Matched Path: search_then_add
Response Time: 6m14.5257217s
Tool Calls: 2
Tools Used: search_products, add_to_cart
Test Case: medium_search_and_add
Status: ✅ PASSED
Matched Path: search_by_query
Response Time: 6m39.6903709s
Tool Calls: 2
Tools Used: search_products, add_to_cart
Test Case: simple_view_cart
Status: ✅ PASSED
Matched Path: view_cart
Response Time: 7m19.196443s
Tool Calls: 1
Tools Used: view_cart
Test Case: simple_search_electronics
Status: ✅ PASSED
Matched Path: search_by_category
Response Time: 8m20.0095853s
Tool Calls: 1
Tools Used: search_products
Test Case: medium_remove_and_add
Status: ✅ PASSED
Matched Path: remove_then_add
Response Time: 10m0.6238038s
Tool Calls: 2
Tools Used: remove_from_cart, add_to_cart
Test Case: medium_view_and_add
Status: ❌ FAILED
Response Time: 10m26.690716s
Tool Calls: 5
Tools Used: view_cart, search_products, search_products, search_products, search_products
Test Case: complex_cart_management
Status: ✅ PASSED
Matched Path: cart_organization
Response Time: 12m52.9361652s
Tool Calls: 3
Tools Used: view_cart, remove_from_cart, add_to_cart
Test Case: complex_shopping_workflow
Status: ❌ FAILED
Response Time: 13m36.093204s
Tool Calls: 5
Tools Used: search_products, add_to_cart, add_to_cart, view_cart, checkout
❌ Failed Tests Details:
Test Case: complex_gift_shopping
Expected Tool Variants: 2
Variant 1 (gift_shopping_workflow): 5 tools
Variant 2 (gift_shopping_workflow): 5 tools
Actual Tool Calls: 0
Response Time: 4m4.5459472s
Test Case: medium_view_and_add
Expected Tool Variants: 2
Variant 1 (view_then_add): 2 tools
Variant 2 (view_search_then_add): 3 tools
Actual Tool Calls: 5
1. view_cart
2. search_products
3. search_products
4. search_products
5. search_products
Response Time: 10m26.690716s
Test Case: complex_shopping_workflow
Expected Tool Variants: 4
Variant 1 (full_workflow_with_iphone): 4 tools
Variant 2 (full_workflow_with_headphones): 4 tools
Variant 3 (full_workflow_with_headphones_and_iphone): 5 tools
Variant 4 (full_workflow_with_iphone_and_headphones): 5 tools
Actual Tool Calls: 5
1. search_products
2. add_to_cart
3. add_to_cart
4. view_cart
5. checkout
Response Time: 13m36.093204s
📊 Overall Success Rate: 82.35%
llama-server --port 9099 -ngl 99 -fa on -c 16000 --temp 0.6 -m X:\path\to\nanbeige4.1-3b-Q5_K_M.gguf
87 tokens/s on RTX 3060 12GB // https://github.com/docker/model-test
Great small function calling model. Anything above 80% is worth keeping.