Update README.md
Browse files
README.md
CHANGED
|
@@ -19,43 +19,12 @@ widget:
|
|
| 19 |
example_title: "Question Answering"
|
| 20 |
---
|
| 21 |
|
| 22 |
-
<h1>TOGETHER RESEARCH<h1/>
|
| 23 |
-
|
| 24 |
-
***!!! Be careful, this repo is still under construction. The content might change recently. !!!***
|
| 25 |
-
|
| 26 |
-
# Model Summary
|
| 27 |
-
|
| 28 |
-
We present Together-GPT-J-6B-ProxAdam-50x, capable of following human instructions and conduct zero/few-shot inference.
|
| 29 |
-
The model trained in a decentralized fashion with ProxAdam optimizer, requiring only 2% cross-machine communication compared to vanilla data parallel training.
|
| 30 |
-
|
| 31 |
# Quick Start
|
| 32 |
|
| 33 |
```python
|
| 34 |
from transformers import pipeline
|
| 35 |
|
| 36 |
-
pipe = pipeline(model='togethercomputer/
|
| 37 |
|
| 38 |
pipe("Where is Zurich? Ans:")
|
| 39 |
-
```
|
| 40 |
-
|
| 41 |
-
# Training Data
|
| 42 |
-
|
| 43 |
-
We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
|
| 44 |
-
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
| 45 |
-
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
| 46 |
-
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
| 47 |
-
- [the pile](https://huggingface.co/datasets/the_pile)
|
| 48 |
-
|
| 49 |
-
The pile is used to keep the general ability of GPT-J.
|
| 50 |
-
Others are instruction-tuning datasets.
|
| 51 |
-
|
| 52 |
-
# Hyperparameters
|
| 53 |
-
|
| 54 |
-
We used AdamW with a learning rate of 1e-5 and global batch size of 64, and train for 5k steps.
|
| 55 |
-
We used mix-precision training where the activation is in FP16 while the optimizer states are kept in FP32.
|
| 56 |
-
We truncate the input sequence to 2048 tokens, and for input sequence that contains less than 2048 tokens, we concatenate multiple sequences into one long sequence to improve the data efficiency.
|
| 57 |
-
|
| 58 |
-
# Infrastructure
|
| 59 |
-
|
| 60 |
-
We used [the Together Research Computer](https://together.xyz/) to conduct training.
|
| 61 |
-
Specifically, we used 4 data parallel workers, each containing 2 \* A100 80GB GPUs.
|
|
|
|
| 19 |
example_title: "Question Answering"
|
| 20 |
---
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
# Quick Start
|
| 23 |
|
| 24 |
```python
|
| 25 |
from transformers import pipeline
|
| 26 |
|
| 27 |
+
pipe = pipeline(model='togethercomputer/GPT-JT-6B-v0')
|
| 28 |
|
| 29 |
pipe("Where is Zurich? Ans:")
|
| 30 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|