Update README.md
Browse files
README.md
CHANGED
|
@@ -15,6 +15,9 @@ In addition to sharing the model weights, we provide the core designs, engineeri
|
|
| 15 |
- **Language(s):** English; Chinese; Other languages
|
| 16 |
- **License:** Apache 2.0
|
| 17 |
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
|
| 20 |
## Bias, Risks, and Limitations
|
|
@@ -68,7 +71,7 @@ We adopt the architecture of FLM-101B as the backbone for Tele-FLM, with several
|
|
| 68 |
Consequently, Tele-FLM is largely compatible with Llama architecturally.
|
| 69 |
To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM and released it as open source.
|
| 70 |
|
| 71 |
-
In the pre-training stage, we employ μP for optimal hyperparameter search. The μP model (Tele-FLM_μP) is architecturally identical to Tele-FLM except for the model width
|
| 72 |
The architecture of Tele-FLM and Tele-FLM_μP is listed below.
|
| 73 |
For more details of μP, please refer to our technical report and the original Tensor Program papers.
|
| 74 |
|
|
@@ -83,7 +86,7 @@ For more details of μP, please refer to our technical report and the original T
|
|
| 83 |
### Training Hyperparameters
|
| 84 |
|
| 85 |
Due to the smaller size, Tele-FLM_μP allows for significantly more experimental runs within fixed time and resource constraints.
|
| 86 |
-
We searched
|
| 87 |
|
| 88 |
|
| 89 |
| Searched Hyperparameters ||| Non-Searched Hyperparameters ||
|
|
@@ -146,9 +149,8 @@ The parallel training setup for Tele-FLM is configured as follows: tensor parall
|
|
| 146 |
| Tele-FLM | 71.13 | 65.48 | 66.98 | 66.25 | 92.57 | 64.38 |
|
| 147 |
|
| 148 |
|
| 149 |
-
## Tech report
|
| 150 |
-
For more detailed capabilities of Tele-FLM, see [Tele-FLM Technical Report](https://arxiv.org/pdf/2404.16645)
|
| 151 |
|
|
|
|
| 152 |
If you find our work helpful, please consider citing it.
|
| 153 |
```
|
| 154 |
@misc{li2024teleflm,
|
|
|
|
| 15 |
- **Language(s):** English; Chinese; Other languages
|
| 16 |
- **License:** Apache 2.0
|
| 17 |
|
| 18 |
+
## Tech report
|
| 19 |
+
|
| 20 |
+
[Tele-FLM Technical Report](https://arxiv.org/pdf/2404.16645)
|
| 21 |
|
| 22 |
|
| 23 |
## Bias, Risks, and Limitations
|
|
|
|
| 71 |
Consequently, Tele-FLM is largely compatible with Llama architecturally.
|
| 72 |
To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM and released it as open source.
|
| 73 |
|
| 74 |
+
In the pre-training stage, we employ μP for optimal hyperparameter search. The μP model (Tele-FLM_μP) is architecturally identical to Tele-FLM except for the model width.
|
| 75 |
The architecture of Tele-FLM and Tele-FLM_μP is listed below.
|
| 76 |
For more details of μP, please refer to our technical report and the original Tensor Program papers.
|
| 77 |
|
|
|
|
| 86 |
### Training Hyperparameters
|
| 87 |
|
| 88 |
Due to the smaller size, Tele-FLM_μP allows for significantly more experimental runs within fixed time and resource constraints.
|
| 89 |
+
We searched seven hyperparameters for pretraining. All the hyperparameters are shown below.
|
| 90 |
|
| 91 |
|
| 92 |
| Searched Hyperparameters ||| Non-Searched Hyperparameters ||
|
|
|
|
| 149 |
| Tele-FLM | 71.13 | 65.48 | 66.98 | 66.25 | 92.57 | 64.38 |
|
| 150 |
|
| 151 |
|
|
|
|
|
|
|
| 152 |
|
| 153 |
+
## Citation
|
| 154 |
If you find our work helpful, please consider citing it.
|
| 155 |
```
|
| 156 |
@misc{li2024teleflm,
|