Update README.md
Browse files
README.md
CHANGED
|
@@ -93,13 +93,14 @@ model = convert_to_float8_training(model, config=config)
|
|
| 93 |
| tensorboard-data-server | 0.7.2 |
|
| 94 |
| wandb | 0.22.1 |
|
| 95 |
|
|
|
|
| 96 |
## Job Details
|
| 97 |
| model | Job ID | Runtime (mins) | Nodes | GPUs | Node-hour | GPU-hour | micro-batch | batch-size | gradient_accumulation | total_batch_size |
|
| 98 |
| ---------------------------------------- | -------- | -------------- | ----- | ---- | --------- | ---------- | ----------- | ---------- | --------------------- | ---------------- |
|
| 99 |
| Llama-3.1-8B-Instruct_w16a8_rw | 31768103 | 115.75 | 1 | 4 | **1.929** | **7.716** | 2 | 2 | 4 | 32 |
|
| 100 |
| Llama-3.1-8B-Instruct_w16a8_rw_with_gw_hp| 31837629 | 109.00 | 1 | 4 | **1.816** | **7.266** | 2 | 2 | 4 | 32 |
|
| 101 |
| Llama-3.1-8B-Instruct-w16a8-mxtw | 31768031 | 64.00 | 1 | 4 | **1.066** | **4.266** | 2 | 2 | 4 | 32 |
|
| 102 |
-
| Llama-3.1-8B-Instruct-w16a16-tw | 31768074 | 138.75 | 1 | 4 | **2,312** | **9,25**
|
| 103 |
| Llama-3.1-8B-Instruct-w16a8-1node-bs8 | 31768093 | 123.75 | 1 | 4 | **2.062** | **8,250** | 2 | 2 | 4 | 32 |
|
| 104 |
| Llama-3.1-8B-Instruct-w16a16-4nodes-bs32 | 31478433 | 31.75 | 4 | 4 | **2.117** | **8.467** | 4 | 4 | 8 | 512 |
|
| 105 |
| Llama-3.1-8B-Instruct-w16a8-4nodes-bs32 | 31478468 | 39.75 | 4 | 4 | **2.650** | **10.600** | 4 | 4 | 8 | 512 |
|
|
@@ -107,29 +108,29 @@ model = convert_to_float8_training(model, config=config)
|
|
| 107 |
| Llama-3.1-8B-Instruct-w16a8-8nodes-bs32 | 31476844 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 4 | 8 | 1024 |
|
| 108 |
| Llama-3.1-8B-Instruct-w16a16-8nodes-bs64 | 31476914 | 22.00 | 8 | 4 | **2.933** | **11.733** | 4 | 8 | 8 | 1024 |
|
| 109 |
| Llama-3.1-8B-Instruct-w16a8-8nodes-bs64 | 31476844 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 8 | 8 | 1024 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
### *Training Time Analysision*
|
| 112 |
| Model | Training Time (mins) | Memory Allocated (avg %) | GPU Utilization (avg %) | Speed vs bf16 |
|
| 113 |
| :-------------------------------------------------- | --------------------: | -----------------------: | -----------------------: | -------------: |
|
| 114 |
-
| **Llama-3.1-8B-Instruct_w16a16-tw**
|
| 115 |
-
| **Llama-3.1-8B-Instruct-w16a8-1node-bs8**
|
| 116 |
| **Llama-3.1-8B-Instruct_w16a8_rw** | 115.75364 | 69.6132 | 97.7689% | 19.87% |
|
| 117 |
| **Llama-3.1-8B-Instruct_w16a8_rw_with_gw_hp** | 109.00364 | 69.4806 | 97.3312% | 27.33% |
|
| 118 |
| **Llama-3.1-8B-Instruct-w16a8-mxtw** | 64.00328 | 68.8982 | 95.5661% | 116.82% |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
## **Performance Evaluation**
|
| 122 |
### *2-models trained on 1Node with fp8 recipes*
|
| 123 |
| Loss metric results for w16a16 & rowwise_with_gw_hp recipe| Memory allocation for w16a16 & rowwise_with_gw_hp recipe | Utilization for w16a16 & rowwise_with_gw_hp recipe |
|
| 124 |
|---------|---------|---------|
|
| 125 |
|  |  |  |
|
| 126 |
|
| 127 |
-
|
| 128 |
| perplexity metric results for bfp16 && bfp16-fp8 configurations | Accuracy metric results for bfp16 && bfp16-fp8 configurations | Loss metric results for bfp16 && bfp16-fp8 configurations | Memory allocation for bfp16 && bfp16-fp8 configurations | Utilization for bfp16 && bfp16-fp8 configurations |
|
| 129 |
|:--:|:--:|:--:|:--:|:--:|
|
| 130 |
|  |  |  |  |  |
|
| 131 |
|
| 132 |
-
|
| 133 |
| Model | Max Loss (train) | Min Loss (train) | Avg Loss (train) | Final Loss (train) | ± Std (train) | Max Loss (val) | Min Loss (val) | Avg Loss (val) | Final Loss (val) | ± Std (val) |
|
| 134 |
| ---------------------------------------------- | ---------------- | ---------------- | ---------------- | ------------------ | ------------- | -------------- | -------------- | -------------- | ---------------- | ----------- |
|
| 135 |
| Llama-3.1-8B-Instruct-w16a8-rw | 8 | 3.1682 | 0.5740 | 0.8118 | 0.6431 | 0.2746 | 1.0613 | 0.8394 | 0.8937 | 0.8394 | 0.0688 |
|
|
@@ -141,8 +142,12 @@ model = convert_to_float8_training(model, config=config)
|
|
| 141 |
| Llama-3.1-8B-Instruct-w16a8-4nodes-bs32 | 32 | 3.2840 | 0.7478 | 0.9748 | 0.4905 | 0.7581 | 1.0701 | 0.8430 | 0.8922 | 0.0764 | 0.8430 | 70 |
|
| 142 |
| Llama-3.1-8B-Instruct-w16a16-8nodes-bs32 | 32 | 3.2311 | 0.8448 | 1.1856 | 0.6434 | 0.8448 | 1.0257 | 0.8977 | 0.9460 | 0.0568 | 0.8977 | 35 |
|
| 143 |
| Llama-3.1-8B-Instruct-w16a8-8nodes-bs32 | 32 | 3.3003 | 0.8473 | 1.1866 | 0.6481 | 0.8473 | 1.0203 | 0.8992 | 0.9445 | 0.0539 | 0.8992 | 35 |
|
| 144 |
-
| Llama-3.1-8B-Instruct-w16a16-
|
| 145 |
-
| Llama-3.1-8B-Instruct-w16a8-8nodes-bs64 | 64 | 3.3003 | 0.8473 | 1.1866 | 0.6481 | 0.8473 | 1.0203 | 0.8992 | 0.9445 | 0.0539 | 0.8992 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
## **Implementation**
|
| 148 |
### *Gpu && Memory usage Profiling*
|
|
|
|
| 93 |
| tensorboard-data-server | 0.7.2 |
|
| 94 |
| wandb | 0.22.1 |
|
| 95 |
|
| 96 |
+
|
| 97 |
## Job Details
|
| 98 |
| model | Job ID | Runtime (mins) | Nodes | GPUs | Node-hour | GPU-hour | micro-batch | batch-size | gradient_accumulation | total_batch_size |
|
| 99 |
| ---------------------------------------- | -------- | -------------- | ----- | ---- | --------- | ---------- | ----------- | ---------- | --------------------- | ---------------- |
|
| 100 |
| Llama-3.1-8B-Instruct_w16a8_rw | 31768103 | 115.75 | 1 | 4 | **1.929** | **7.716** | 2 | 2 | 4 | 32 |
|
| 101 |
| Llama-3.1-8B-Instruct_w16a8_rw_with_gw_hp| 31837629 | 109.00 | 1 | 4 | **1.816** | **7.266** | 2 | 2 | 4 | 32 |
|
| 102 |
| Llama-3.1-8B-Instruct-w16a8-mxtw | 31768031 | 64.00 | 1 | 4 | **1.066** | **4.266** | 2 | 2 | 4 | 32 |
|
| 103 |
+
| Llama-3.1-8B-Instruct-w16a16-tw | 31768074 | 138.75 | 1 | 4 | **2,312** | **9,25** | 2 | 2 | 4 | 32 |
|
| 104 |
| Llama-3.1-8B-Instruct-w16a8-1node-bs8 | 31768093 | 123.75 | 1 | 4 | **2.062** | **8,250** | 2 | 2 | 4 | 32 |
|
| 105 |
| Llama-3.1-8B-Instruct-w16a16-4nodes-bs32 | 31478433 | 31.75 | 4 | 4 | **2.117** | **8.467** | 4 | 4 | 8 | 512 |
|
| 106 |
| Llama-3.1-8B-Instruct-w16a8-4nodes-bs32 | 31478468 | 39.75 | 4 | 4 | **2.650** | **10.600** | 4 | 4 | 8 | 512 |
|
|
|
|
| 108 |
| Llama-3.1-8B-Instruct-w16a8-8nodes-bs32 | 31476844 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 4 | 8 | 1024 |
|
| 109 |
| Llama-3.1-8B-Instruct-w16a16-8nodes-bs64 | 31476914 | 22.00 | 8 | 4 | **2.933** | **11.733** | 4 | 8 | 8 | 1024 |
|
| 110 |
| Llama-3.1-8B-Instruct-w16a8-8nodes-bs64 | 31476844 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 8 | 8 | 1024 |
|
| 111 |
+
| Llama-3.1-8B-Instruct-w16a8-rowwise_4nodes | 33477070 | 39.75 | 4 | 4 | **2.650** | **10.600** | 4 | 4 | 8 | 512 |
|
| 112 |
+
| Llama-3.1-8B-Instruct-w16a8-rowwise_with_gw_hp_4nodes | 33477179 | 37.43 | 4 | 4 | **2.495** | **9.982** | 4 | 4 | 8 | 512 |
|
| 113 |
+
| Llama-3.1-8B-Instruct-w16a8-rowwise_8nodes | 33476690 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 4 | 8 | 1024 |
|
| 114 |
+
| Llama-3.1-8B-Instruct-w16a8-rowwise_with_gw_hp_8nodes | 33476618 | 22.13 | 8 | 4 | **2.951** | **11.802** | 4 | 4 | 8 | 1024 |
|
| 115 |
|
| 116 |
### *Training Time Analysision*
|
| 117 |
| Model | Training Time (mins) | Memory Allocated (avg %) | GPU Utilization (avg %) | Speed vs bf16 |
|
| 118 |
| :-------------------------------------------------- | --------------------: | -----------------------: | -----------------------: | -------------: |
|
| 119 |
+
| **Llama-3.1-8B-Instruct_w16a16-tw** | 138.75267 | 74.4189 | 56.6059% | _ |
|
| 120 |
+
| **Llama-3.1-8B-Instruct-w16a8-1node-bs8** | 123.75267 | 68.8982 | 97.5364% | 12.11% |
|
| 121 |
| **Llama-3.1-8B-Instruct_w16a8_rw** | 115.75364 | 69.6132 | 97.7689% | 19.87% |
|
| 122 |
| **Llama-3.1-8B-Instruct_w16a8_rw_with_gw_hp** | 109.00364 | 69.4806 | 97.3312% | 27.33% |
|
| 123 |
| **Llama-3.1-8B-Instruct-w16a8-mxtw** | 64.00328 | 68.8982 | 95.5661% | 116.82% |
|
|
|
|
|
|
|
|
|
|
| 124 |
### *2-models trained on 1Node with fp8 recipes*
|
| 125 |
| Loss metric results for w16a16 & rowwise_with_gw_hp recipe| Memory allocation for w16a16 & rowwise_with_gw_hp recipe | Utilization for w16a16 & rowwise_with_gw_hp recipe |
|
| 126 |
|---------|---------|---------|
|
| 127 |
|  |  |  |
|
| 128 |
|
| 129 |
+
# *All 15-models trained on(1Node,4Noes,8Nodes with both bfp16-fp8 && bfp16 configurations and fp8 recipes)*
|
| 130 |
| perplexity metric results for bfp16 && bfp16-fp8 configurations | Accuracy metric results for bfp16 && bfp16-fp8 configurations | Loss metric results for bfp16 && bfp16-fp8 configurations | Memory allocation for bfp16 && bfp16-fp8 configurations | Utilization for bfp16 && bfp16-fp8 configurations |
|
| 131 |
|:--:|:--:|:--:|:--:|:--:|
|
| 132 |
|  |  |  |  |  |
|
| 133 |
|
|
|
|
| 134 |
| Model | Max Loss (train) | Min Loss (train) | Avg Loss (train) | Final Loss (train) | ± Std (train) | Max Loss (val) | Min Loss (val) | Avg Loss (val) | Final Loss (val) | ± Std (val) |
|
| 135 |
| ---------------------------------------------- | ---------------- | ---------------- | ---------------- | ------------------ | ------------- | -------------- | -------------- | -------------- | ---------------- | ----------- |
|
| 136 |
| Llama-3.1-8B-Instruct-w16a8-rw | 8 | 3.1682 | 0.5740 | 0.8118 | 0.6431 | 0.2746 | 1.0613 | 0.8394 | 0.8937 | 0.8394 | 0.0688 |
|
|
|
|
| 142 |
| Llama-3.1-8B-Instruct-w16a8-4nodes-bs32 | 32 | 3.2840 | 0.7478 | 0.9748 | 0.4905 | 0.7581 | 1.0701 | 0.8430 | 0.8922 | 0.0764 | 0.8430 | 70 |
|
| 143 |
| Llama-3.1-8B-Instruct-w16a16-8nodes-bs32 | 32 | 3.2311 | 0.8448 | 1.1856 | 0.6434 | 0.8448 | 1.0257 | 0.8977 | 0.9460 | 0.0568 | 0.8977 | 35 |
|
| 144 |
| Llama-3.1-8B-Instruct-w16a8-8nodes-bs32 | 32 | 3.3003 | 0.8473 | 1.1866 | 0.6481 | 0.8473 | 1.0203 | 0.8992 | 0.9445 | 0.0539 | 0.8992 | 35 |
|
| 145 |
+
| Llama-3.1-8B-Instruct-w16a16-4nodes-bs64 | 64 | 3.2311 | 0.8448 | 1.1856 | 0.6434 | 0.8448 | 1.0257 | 0.8977 | 0.9460 | 0.0568 | 0.8977 | 35 |
|
| 146 |
+
| Llama-3.1-8B-Instruct-w16a8-8nodes-bs64 | 64 | 3.3003 | 0.8473 | 1.1866 | 0.6481 | 0.8473 | 1.0203 | 0.8992 | 0.9445 | 0.0539 | 0.8992 | 17 |
|
| 147 |
+
| Llama-3.1-8B-Instruct-w16a8-rw_4nodes | 64 | 3.4517 | 0.7624 | 1.1173 | 0.7624 | 0.6891 | 1.3225 | 0.8791 | 0.9732 | 0.8791 | 0.1612 | 35 |
|
| 148 |
+
| Llama-3.1-8B-Instruct-w16a8-rw_8nodes | 64 | 3.8944 | 0.9583 | 1.6423 | 0.9583 | 1.0117 | 1.5384 | 1.0253 | 1.2103 | 1.0253 | 0.2849 | 17 |
|
| 149 |
+
| Llama-3.1-8B-Instruct-w16a8-rw_with_gw_hp_4nodes | 64 | 3.4517 | 0.7481 | 1.1091 | 0.7481 | 0.7021 | 1.3393 | 0.8660 | 0.9641 | 0.8666 | 0.1732 | 35 |
|
| 150 |
+
| Llama-3.1-8B-Instruct-w16a8-rw_with_gw_hp_8nodes | 64 | 3.9289 | 0.9702 | 1.6514 | 0.9702 | 1.0127 | 1.5537 | 1.0377 | 1.2222 | 1.0377 | 0.2877 | 17 |
|
| 151 |
|
| 152 |
## **Implementation**
|
| 153 |
### *Gpu && Memory usage Profiling*
|