Update README.md
Browse filesUpdate Citation in README.md
README.md
CHANGED
|
@@ -1,91 +1,91 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
library_name: transformers
|
| 4 |
-
tags:
|
| 5 |
-
- robotics
|
| 6 |
-
- vla
|
| 7 |
-
- diffusion
|
| 8 |
-
- multimodal
|
| 9 |
-
- pretraining
|
| 10 |
-
language:
|
| 11 |
-
- en
|
| 12 |
-
pipeline_tag: robotics
|
| 13 |
-
---
|
| 14 |
-
# CogACT-Base
|
| 15 |
-
|
| 16 |
-
CogACT is a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a componentized VLA architecture that has a specialized action module conditioned on VLM output. CogACT-Base employs a [DiT-Base](https://github.com/facebookresearch/DiT) model as the action module.
|
| 17 |
-
|
| 18 |
-
All our [code](https://github.com/microsoft/CogACT), [pre-trained model weights](https://huggingface.co/CogACT), are licensed under the MIT license.
|
| 19 |
-
|
| 20 |
-
Please refer to our [project page](https://cogact.github.io/) and [paper](https://cogact.github.io/CogACT_paper.pdf) for more details.
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
## Model Summary
|
| 24 |
-
|
| 25 |
-
- **Developed by:** The CogACT consisting of researchers from [Microsoft Research Asia](https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/).
|
| 26 |
-
- **Model type:** Vision-Language-Action (language, image => robot actions)
|
| 27 |
-
- **Language(s) (NLP):** en
|
| 28 |
-
- **License:** MIT
|
| 29 |
-
- **Model components:**
|
| 30 |
-
+ **Vision Backbone**: DINOv2 ViT-L/14 and SigLIP ViT-So400M/14
|
| 31 |
-
+ **Language Model**: Llama-2
|
| 32 |
-
+ **Action Model**: DiT-Base
|
| 33 |
-
- **Pretraining Dataset:** A subset of [Open X-Embodiment](https://robotics-transformer-x.github.io/)
|
| 34 |
-
- **Repository:** [https://github.com/microsoft/CogACT](https://github.com/microsoft/CogACT)
|
| 35 |
-
- **Paper:** [CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation](https://cogact.github.io/CogACT_paper.pdf)
|
| 36 |
-
- **Project Page:** [https://cogact.github.io/](https://cogact.github.io/)
|
| 37 |
-
|
| 38 |
-
## Uses
|
| 39 |
-
CogACT takes a language instruction and a single view RGB image as input and predicts the next 16 normalized robot actions (consisting of the 7-DoF end effector deltas
|
| 40 |
-
of the form ``x, y, z, roll, pitch, yaw, gripper``). These actions should be unnormalized and integrated by our ``Adaptive Action Ensemble``(Optional). Unnormalization and ensemble depend on the dataset statistics.
|
| 41 |
-
|
| 42 |
-
CogACT models can be used zero-shot to control robots for setups seen in the [Open-X](https://robotics-transformer-x.github.io/) pretraining mixture. They can also be fine-tuned for new tasks and robot setups with an extremely small amount of demonstrations. See [our repository](https://github.com/microsoft/CogACT) for more information.
|
| 43 |
-
|
| 44 |
-
Here is a simple example for inference.
|
| 45 |
-
|
| 46 |
-
```python
|
| 47 |
-
# Please clone and install dependencies in our repo
|
| 48 |
-
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
|
| 49 |
-
|
| 50 |
-
from PIL import Image
|
| 51 |
-
from vla import load_vla
|
| 52 |
-
import torch
|
| 53 |
-
|
| 54 |
-
model = load_vla(
|
| 55 |
-
'CogACT/CogACT-Base',
|
| 56 |
-
load_for_training=False,
|
| 57 |
-
action_model_type='DiT-B',
|
| 58 |
-
future_action_window_size=15,
|
| 59 |
-
)
|
| 60 |
-
# about 30G Memory in fp32;
|
| 61 |
-
|
| 62 |
-
# (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16
|
| 63 |
-
|
| 64 |
-
model.to('cuda:0').eval()
|
| 65 |
-
|
| 66 |
-
image: Image.Image = <input_your_image>
|
| 67 |
-
prompt = "move sponge near apple" # input your prompt
|
| 68 |
-
|
| 69 |
-
# Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e. fractal20220817_data)
|
| 70 |
-
actions, _ = model.predict_action(
|
| 71 |
-
image,
|
| 72 |
-
prompt,
|
| 73 |
-
unnorm_key='fractal20220817_data', # input your unnorm_key of dataset
|
| 74 |
-
cfg_scale = 1.5, # cfg from 1.5 to 7 also performs well
|
| 75 |
-
use_ddim = True, # use DDIM sampling
|
| 76 |
-
num_ddim_steps = 10, # number of steps for DDIM sampling
|
| 77 |
-
)
|
| 78 |
-
|
| 79 |
-
# results in 7-DoF actions of 16 steps with shape [16, 7]
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
## Citation
|
| 83 |
-
|
| 84 |
-
```bibtex
|
| 85 |
-
@article{li2024cogact,
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
}
|
| 91 |
```
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: transformers
|
| 4 |
+
tags:
|
| 5 |
+
- robotics
|
| 6 |
+
- vla
|
| 7 |
+
- diffusion
|
| 8 |
+
- multimodal
|
| 9 |
+
- pretraining
|
| 10 |
+
language:
|
| 11 |
+
- en
|
| 12 |
+
pipeline_tag: robotics
|
| 13 |
+
---
|
| 14 |
+
# CogACT-Base
|
| 15 |
+
|
| 16 |
+
CogACT is a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a componentized VLA architecture that has a specialized action module conditioned on VLM output. CogACT-Base employs a [DiT-Base](https://github.com/facebookresearch/DiT) model as the action module.
|
| 17 |
+
|
| 18 |
+
All our [code](https://github.com/microsoft/CogACT), [pre-trained model weights](https://huggingface.co/CogACT), are licensed under the MIT license.
|
| 19 |
+
|
| 20 |
+
Please refer to our [project page](https://cogact.github.io/) and [paper](https://cogact.github.io/CogACT_paper.pdf) for more details.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
## Model Summary
|
| 24 |
+
|
| 25 |
+
- **Developed by:** The CogACT consisting of researchers from [Microsoft Research Asia](https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/).
|
| 26 |
+
- **Model type:** Vision-Language-Action (language, image => robot actions)
|
| 27 |
+
- **Language(s) (NLP):** en
|
| 28 |
+
- **License:** MIT
|
| 29 |
+
- **Model components:**
|
| 30 |
+
+ **Vision Backbone**: DINOv2 ViT-L/14 and SigLIP ViT-So400M/14
|
| 31 |
+
+ **Language Model**: Llama-2
|
| 32 |
+
+ **Action Model**: DiT-Base
|
| 33 |
+
- **Pretraining Dataset:** A subset of [Open X-Embodiment](https://robotics-transformer-x.github.io/)
|
| 34 |
+
- **Repository:** [https://github.com/microsoft/CogACT](https://github.com/microsoft/CogACT)
|
| 35 |
+
- **Paper:** [CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation](https://cogact.github.io/CogACT_paper.pdf)
|
| 36 |
+
- **Project Page:** [https://cogact.github.io/](https://cogact.github.io/)
|
| 37 |
+
|
| 38 |
+
## Uses
|
| 39 |
+
CogACT takes a language instruction and a single view RGB image as input and predicts the next 16 normalized robot actions (consisting of the 7-DoF end effector deltas
|
| 40 |
+
of the form ``x, y, z, roll, pitch, yaw, gripper``). These actions should be unnormalized and integrated by our ``Adaptive Action Ensemble``(Optional). Unnormalization and ensemble depend on the dataset statistics.
|
| 41 |
+
|
| 42 |
+
CogACT models can be used zero-shot to control robots for setups seen in the [Open-X](https://robotics-transformer-x.github.io/) pretraining mixture. They can also be fine-tuned for new tasks and robot setups with an extremely small amount of demonstrations. See [our repository](https://github.com/microsoft/CogACT) for more information.
|
| 43 |
+
|
| 44 |
+
Here is a simple example for inference.
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
# Please clone and install dependencies in our repo
|
| 48 |
+
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
|
| 49 |
+
|
| 50 |
+
from PIL import Image
|
| 51 |
+
from vla import load_vla
|
| 52 |
+
import torch
|
| 53 |
+
|
| 54 |
+
model = load_vla(
|
| 55 |
+
'CogACT/CogACT-Base',
|
| 56 |
+
load_for_training=False,
|
| 57 |
+
action_model_type='DiT-B',
|
| 58 |
+
future_action_window_size=15,
|
| 59 |
+
)
|
| 60 |
+
# about 30G Memory in fp32;
|
| 61 |
+
|
| 62 |
+
# (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16
|
| 63 |
+
|
| 64 |
+
model.to('cuda:0').eval()
|
| 65 |
+
|
| 66 |
+
image: Image.Image = <input_your_image>
|
| 67 |
+
prompt = "move sponge near apple" # input your prompt
|
| 68 |
+
|
| 69 |
+
# Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e. fractal20220817_data)
|
| 70 |
+
actions, _ = model.predict_action(
|
| 71 |
+
image,
|
| 72 |
+
prompt,
|
| 73 |
+
unnorm_key='fractal20220817_data', # input your unnorm_key of dataset
|
| 74 |
+
cfg_scale = 1.5, # cfg from 1.5 to 7 also performs well
|
| 75 |
+
use_ddim = True, # use DDIM sampling
|
| 76 |
+
num_ddim_steps = 10, # number of steps for DDIM sampling
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
# results in 7-DoF actions of 16 steps with shape [16, 7]
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
## Citation
|
| 83 |
+
|
| 84 |
+
```bibtex
|
| 85 |
+
@article{li2024cogact,
|
| 86 |
+
title={CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation},
|
| 87 |
+
author={Li, Qixiu and Liang, Yaobo and Wang, Zeyu and Luo, Lin and Chen, Xi and Liao, Mozheng and Wei, Fangyun and Deng, Yu and Xu, Sicheng and Zhang, Yizhong and others},
|
| 88 |
+
journal={arXiv preprint arXiv:2411.19650},
|
| 89 |
+
year={2024}
|
| 90 |
+
}
|
| 91 |
```
|