nmmursit commited on
Commit
b7ca20b
·
verified ·
1 Parent(s): d645a66

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -10
README.md CHANGED
@@ -93,13 +93,14 @@ model = convert_to_float8_training(model, config=config)
93
  | tensorboard-data-server | 0.7.2 |
94
  | wandb | 0.22.1 |
95
 
 
96
  ## Job Details
97
  | model | Job ID | Runtime (mins) | Nodes | GPUs | Node-hour | GPU-hour | micro-batch | batch-size | gradient_accumulation | total_batch_size |
98
  | ---------------------------------------- | -------- | -------------- | ----- | ---- | --------- | ---------- | ----------- | ---------- | --------------------- | ---------------- |
99
  | Llama-3.1-8B-Instruct_w16a8_rw | 31768103 | 115.75 | 1 | 4 | **1.929** | **7.716** | 2 | 2 | 4 | 32 |
100
  | Llama-3.1-8B-Instruct_w16a8_rw_with_gw_hp| 31837629 | 109.00 | 1 | 4 | **1.816** | **7.266** | 2 | 2 | 4 | 32 |
101
  | Llama-3.1-8B-Instruct-w16a8-mxtw | 31768031 | 64.00 | 1 | 4 | **1.066** | **4.266** | 2 | 2 | 4 | 32 |
102
- | Llama-3.1-8B-Instruct-w16a16-tw | 31768074 | 138.75 | 1 | 4 | **2,312** | **9,25** | 2 | 2 | 4 | 32 |
103
  | Llama-3.1-8B-Instruct-w16a8-1node-bs8 | 31768093 | 123.75 | 1 | 4 | **2.062** | **8,250** | 2 | 2 | 4 | 32 |
104
  | Llama-3.1-8B-Instruct-w16a16-4nodes-bs32 | 31478433 | 31.75 | 4 | 4 | **2.117** | **8.467** | 4 | 4 | 8 | 512 |
105
  | Llama-3.1-8B-Instruct-w16a8-4nodes-bs32 | 31478468 | 39.75 | 4 | 4 | **2.650** | **10.600** | 4 | 4 | 8 | 512 |
@@ -107,29 +108,29 @@ model = convert_to_float8_training(model, config=config)
107
  | Llama-3.1-8B-Instruct-w16a8-8nodes-bs32 | 31476844 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 4 | 8 | 1024 |
108
  | Llama-3.1-8B-Instruct-w16a16-8nodes-bs64 | 31476914 | 22.00 | 8 | 4 | **2.933** | **11.733** | 4 | 8 | 8 | 1024 |
109
  | Llama-3.1-8B-Instruct-w16a8-8nodes-bs64 | 31476844 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 8 | 8 | 1024 |
 
 
 
 
110
 
111
  ### *Training Time Analysision*
112
  | Model | Training Time (mins) | Memory Allocated (avg %) | GPU Utilization (avg %) | Speed vs bf16 |
113
  | :-------------------------------------------------- | --------------------: | -----------------------: | -----------------------: | -------------: |
114
- | **Llama-3.1-8B-Instruct_w16a16-tw** | 138.75267 | 74.4189 | 56.6059% | _ |
115
- | **Llama-3.1-8B-Instruct-w16a8-1node-bs8** | 123.75267 | 68.8982 | 97.5364% | 12.11% |
116
  | **Llama-3.1-8B-Instruct_w16a8_rw** | 115.75364 | 69.6132 | 97.7689% | 19.87% |
117
  | **Llama-3.1-8B-Instruct_w16a8_rw_with_gw_hp** | 109.00364 | 69.4806 | 97.3312% | 27.33% |
118
  | **Llama-3.1-8B-Instruct-w16a8-mxtw** | 64.00328 | 68.8982 | 95.5661% | 116.82% |
119
-
120
-
121
- ## **Performance Evaluation**
122
  ### *2-models trained on 1Node with fp8 recipes*
123
  | Loss metric results for w16a16 & rowwise_with_gw_hp recipe| Memory allocation for w16a16 & rowwise_with_gw_hp recipe | Utilization for w16a16 & rowwise_with_gw_hp recipe |
124
  |---------|---------|---------|
125
  | ![lossRWGWHP](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/VAPmLlCaZPaks9SCSiGnW.png) | ![memALRWGWHP](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/WHhSPl1n2BpDhqzGl_ljh.png) | ![gpuutilsRWGWHP](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/PdiR1e2SGyTOloURHy19G.png) |
126
 
127
- ### *All 15-models trained on(1Node,4Noes,8Nodes with both bfp16-fp8 && bfp16 configurations and fp8 recipes)*
128
  | perplexity metric results for bfp16 && bfp16-fp8 configurations | Accuracy metric results for bfp16 && bfp16-fp8 configurations | Loss metric results for bfp16 && bfp16-fp8 configurations | Memory allocation for bfp16 && bfp16-fp8 configurations | Utilization for bfp16 && bfp16-fp8 configurations |
129
  |:--:|:--:|:--:|:--:|:--:|
130
  | ![perp](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/ij1hlr8E2qvdZM4uGC7lq.png) | ![acc](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/7lO8mVKPnQQkyUTw8H6GA.png) | ![train_loss](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/E73tvcC6u9VrvTIkznwU2.png) | ![memAlo](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/NsHL_yaTtnjwD1e4EHcLP.png) | ![utils](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/5mqF8xcRWuZdC_sGS9FCe.png) |
131
 
132
-
133
  | Model | Max Loss (train) | Min Loss (train) | Avg Loss (train) | Final Loss (train) | ± Std (train) | Max Loss (val) | Min Loss (val) | Avg Loss (val) | Final Loss (val) | ± Std (val) |
134
  | ---------------------------------------------- | ---------------- | ---------------- | ---------------- | ------------------ | ------------- | -------------- | -------------- | -------------- | ---------------- | ----------- |
135
  | Llama-3.1-8B-Instruct-w16a8-rw | 8 | 3.1682 | 0.5740 | 0.8118 | 0.6431 | 0.2746 | 1.0613 | 0.8394 | 0.8937 | 0.8394 | 0.0688 |
@@ -141,8 +142,12 @@ model = convert_to_float8_training(model, config=config)
141
  | Llama-3.1-8B-Instruct-w16a8-4nodes-bs32 | 32 | 3.2840 | 0.7478 | 0.9748 | 0.4905 | 0.7581 | 1.0701 | 0.8430 | 0.8922 | 0.0764 | 0.8430 | 70 |
142
  | Llama-3.1-8B-Instruct-w16a16-8nodes-bs32 | 32 | 3.2311 | 0.8448 | 1.1856 | 0.6434 | 0.8448 | 1.0257 | 0.8977 | 0.9460 | 0.0568 | 0.8977 | 35 |
143
  | Llama-3.1-8B-Instruct-w16a8-8nodes-bs32 | 32 | 3.3003 | 0.8473 | 1.1866 | 0.6481 | 0.8473 | 1.0203 | 0.8992 | 0.9445 | 0.0539 | 0.8992 | 35 |
144
- | Llama-3.1-8B-Instruct-w16a16-8nodes-bs64 | 64 | 3.2311 | 0.8448 | 1.1856 | 0.6434 | 0.8448 | 1.0257 | 0.8977 | 0.9460 | 0.0568 | 0.8977 | 35 |
145
- | Llama-3.1-8B-Instruct-w16a8-8nodes-bs64 | 64 | 3.3003 | 0.8473 | 1.1866 | 0.6481 | 0.8473 | 1.0203 | 0.8992 | 0.9445 | 0.0539 | 0.8992 | 35 |
 
 
 
 
146
 
147
  ## **Implementation**
148
  ### *Gpu && Memory usage Profiling*
 
93
  | tensorboard-data-server | 0.7.2 |
94
  | wandb | 0.22.1 |
95
 
96
+
97
  ## Job Details
98
  | model | Job ID | Runtime (mins) | Nodes | GPUs | Node-hour | GPU-hour | micro-batch | batch-size | gradient_accumulation | total_batch_size |
99
  | ---------------------------------------- | -------- | -------------- | ----- | ---- | --------- | ---------- | ----------- | ---------- | --------------------- | ---------------- |
100
  | Llama-3.1-8B-Instruct_w16a8_rw | 31768103 | 115.75 | 1 | 4 | **1.929** | **7.716** | 2 | 2 | 4 | 32 |
101
  | Llama-3.1-8B-Instruct_w16a8_rw_with_gw_hp| 31837629 | 109.00 | 1 | 4 | **1.816** | **7.266** | 2 | 2 | 4 | 32 |
102
  | Llama-3.1-8B-Instruct-w16a8-mxtw | 31768031 | 64.00 | 1 | 4 | **1.066** | **4.266** | 2 | 2 | 4 | 32 |
103
+ | Llama-3.1-8B-Instruct-w16a16-tw | 31768074 | 138.75 | 1 | 4 | **2,312** | **9,25** | 2 | 2 | 4 | 32 |
104
  | Llama-3.1-8B-Instruct-w16a8-1node-bs8 | 31768093 | 123.75 | 1 | 4 | **2.062** | **8,250** | 2 | 2 | 4 | 32 |
105
  | Llama-3.1-8B-Instruct-w16a16-4nodes-bs32 | 31478433 | 31.75 | 4 | 4 | **2.117** | **8.467** | 4 | 4 | 8 | 512 |
106
  | Llama-3.1-8B-Instruct-w16a8-4nodes-bs32 | 31478468 | 39.75 | 4 | 4 | **2.650** | **10.600** | 4 | 4 | 8 | 512 |
 
108
  | Llama-3.1-8B-Instruct-w16a8-8nodes-bs32 | 31476844 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 4 | 8 | 1024 |
109
  | Llama-3.1-8B-Instruct-w16a16-8nodes-bs64 | 31476914 | 22.00 | 8 | 4 | **2.933** | **11.733** | 4 | 8 | 8 | 1024 |
110
  | Llama-3.1-8B-Instruct-w16a8-8nodes-bs64 | 31476844 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 8 | 8 | 1024 |
111
+ | Llama-3.1-8B-Instruct-w16a8-rowwise_4nodes | 33477070 | 39.75 | 4 | 4 | **2.650** | **10.600** | 4 | 4 | 8 | 512 |
112
+ | Llama-3.1-8B-Instruct-w16a8-rowwise_with_gw_hp_4nodes | 33477179 | 37.43 | 4 | 4 | **2.495** | **9.982** | 4 | 4 | 8 | 512 |
113
+ | Llama-3.1-8B-Instruct-w16a8-rowwise_8nodes | 33476690 | 23.50 | 8 | 4 | **3.133** | **12.533** | 4 | 4 | 8 | 1024 |
114
+ | Llama-3.1-8B-Instruct-w16a8-rowwise_with_gw_hp_8nodes | 33476618 | 22.13 | 8 | 4 | **2.951** | **11.802** | 4 | 4 | 8 | 1024 |
115
 
116
  ### *Training Time Analysision*
117
  | Model | Training Time (mins) | Memory Allocated (avg %) | GPU Utilization (avg %) | Speed vs bf16 |
118
  | :-------------------------------------------------- | --------------------: | -----------------------: | -----------------------: | -------------: |
119
+ | **Llama-3.1-8B-Instruct_w16a16-tw** | 138.75267 | 74.4189 | 56.6059% | _ |
120
+ | **Llama-3.1-8B-Instruct-w16a8-1node-bs8** | 123.75267 | 68.8982 | 97.5364% | 12.11% |
121
  | **Llama-3.1-8B-Instruct_w16a8_rw** | 115.75364 | 69.6132 | 97.7689% | 19.87% |
122
  | **Llama-3.1-8B-Instruct_w16a8_rw_with_gw_hp** | 109.00364 | 69.4806 | 97.3312% | 27.33% |
123
  | **Llama-3.1-8B-Instruct-w16a8-mxtw** | 64.00328 | 68.8982 | 95.5661% | 116.82% |
 
 
 
124
  ### *2-models trained on 1Node with fp8 recipes*
125
  | Loss metric results for w16a16 & rowwise_with_gw_hp recipe| Memory allocation for w16a16 & rowwise_with_gw_hp recipe | Utilization for w16a16 & rowwise_with_gw_hp recipe |
126
  |---------|---------|---------|
127
  | ![lossRWGWHP](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/VAPmLlCaZPaks9SCSiGnW.png) | ![memALRWGWHP](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/WHhSPl1n2BpDhqzGl_ljh.png) | ![gpuutilsRWGWHP](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/PdiR1e2SGyTOloURHy19G.png) |
128
 
129
+ # *All 15-models trained on(1Node,4Noes,8Nodes with both bfp16-fp8 && bfp16 configurations and fp8 recipes)*
130
  | perplexity metric results for bfp16 && bfp16-fp8 configurations | Accuracy metric results for bfp16 && bfp16-fp8 configurations | Loss metric results for bfp16 && bfp16-fp8 configurations | Memory allocation for bfp16 && bfp16-fp8 configurations | Utilization for bfp16 && bfp16-fp8 configurations |
131
  |:--:|:--:|:--:|:--:|:--:|
132
  | ![perp](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/ij1hlr8E2qvdZM4uGC7lq.png) | ![acc](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/7lO8mVKPnQQkyUTw8H6GA.png) | ![train_loss](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/E73tvcC6u9VrvTIkznwU2.png) | ![memAlo](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/NsHL_yaTtnjwD1e4EHcLP.png) | ![utils](https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/5mqF8xcRWuZdC_sGS9FCe.png) |
133
 
 
134
  | Model | Max Loss (train) | Min Loss (train) | Avg Loss (train) | Final Loss (train) | ± Std (train) | Max Loss (val) | Min Loss (val) | Avg Loss (val) | Final Loss (val) | ± Std (val) |
135
  | ---------------------------------------------- | ---------------- | ---------------- | ---------------- | ------------------ | ------------- | -------------- | -------------- | -------------- | ---------------- | ----------- |
136
  | Llama-3.1-8B-Instruct-w16a8-rw | 8 | 3.1682 | 0.5740 | 0.8118 | 0.6431 | 0.2746 | 1.0613 | 0.8394 | 0.8937 | 0.8394 | 0.0688 |
 
142
  | Llama-3.1-8B-Instruct-w16a8-4nodes-bs32 | 32 | 3.2840 | 0.7478 | 0.9748 | 0.4905 | 0.7581 | 1.0701 | 0.8430 | 0.8922 | 0.0764 | 0.8430 | 70 |
143
  | Llama-3.1-8B-Instruct-w16a16-8nodes-bs32 | 32 | 3.2311 | 0.8448 | 1.1856 | 0.6434 | 0.8448 | 1.0257 | 0.8977 | 0.9460 | 0.0568 | 0.8977 | 35 |
144
  | Llama-3.1-8B-Instruct-w16a8-8nodes-bs32 | 32 | 3.3003 | 0.8473 | 1.1866 | 0.6481 | 0.8473 | 1.0203 | 0.8992 | 0.9445 | 0.0539 | 0.8992 | 35 |
145
+ | Llama-3.1-8B-Instruct-w16a16-4nodes-bs64 | 64 | 3.2311 | 0.8448 | 1.1856 | 0.6434 | 0.8448 | 1.0257 | 0.8977 | 0.9460 | 0.0568 | 0.8977 | 35 |
146
+ | Llama-3.1-8B-Instruct-w16a8-8nodes-bs64 | 64 | 3.3003 | 0.8473 | 1.1866 | 0.6481 | 0.8473 | 1.0203 | 0.8992 | 0.9445 | 0.0539 | 0.8992 | 17 |
147
+ | Llama-3.1-8B-Instruct-w16a8-rw_4nodes | 64 | 3.4517 | 0.7624 | 1.1173 | 0.7624 | 0.6891 | 1.3225 | 0.8791 | 0.9732 | 0.8791 | 0.1612 | 35 |
148
+ | Llama-3.1-8B-Instruct-w16a8-rw_8nodes | 64 | 3.8944 | 0.9583 | 1.6423 | 0.9583 | 1.0117 | 1.5384 | 1.0253 | 1.2103 | 1.0253 | 0.2849 | 17 |
149
+ | Llama-3.1-8B-Instruct-w16a8-rw_with_gw_hp_4nodes | 64 | 3.4517 | 0.7481 | 1.1091 | 0.7481 | 0.7021 | 1.3393 | 0.8660 | 0.9641 | 0.8666 | 0.1732 | 35 |
150
+ | Llama-3.1-8B-Instruct-w16a8-rw_with_gw_hp_8nodes | 64 | 3.9289 | 0.9702 | 1.6514 | 0.9702 | 1.0127 | 1.5537 | 1.0377 | 1.2222 | 1.0377 | 0.2877 | 17 |
151
 
152
  ## **Implementation**
153
  ### *Gpu && Memory usage Profiling*