|
|
--- |
|
|
library_name: stable-baselines3 |
|
|
tags: |
|
|
- PandaReachDense-v3 |
|
|
- deep-reinforcement-learning |
|
|
- reinforcement-learning |
|
|
- stable-baselines3 |
|
|
model-index: |
|
|
- name: A2C |
|
|
results: |
|
|
- task: |
|
|
type: reinforcement-learning |
|
|
name: reinforcement-learning |
|
|
dataset: |
|
|
name: PandaReachDense-v3 |
|
|
type: PandaReachDense-v3 |
|
|
metrics: |
|
|
- type: mean_reward |
|
|
value: -0.24 +/- 0.13 |
|
|
name: mean_reward |
|
|
verified: false |
|
|
--- |
|
|
# A2C Agent for PandaReachDense-v3 |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This repository contains a trained Advantage Actor-Critic (A2C) reinforcement learning agent designed to solve the PandaReachDense-v3 environment from PyBullet Gym. The agent has been trained using the stable-baselines3 library to perform robotic arm reaching tasks with the Franka Emika Panda robot. |
|
|
|
|
|
### Model Details |
|
|
|
|
|
- **Algorithm**: A2C (Advantage Actor-Critic) |
|
|
- **Environment**: PandaReachDense-v3 (PyBullet) |
|
|
- **Framework**: Stable-Baselines3 |
|
|
- **Task Type**: Continuous Control |
|
|
- **Action Space**: Continuous (7-dimensional joint control) |
|
|
- **Observation Space**: Multi-dimensional state representation including joint positions, velocities, and target coordinates |
|
|
|
|
|
### Environment Overview |
|
|
|
|
|
PandaReachDense-v3 is a robotic manipulation task where: |
|
|
- **Objective**: Control a 7-DOF Franka Panda robotic arm to reach target positions |
|
|
- **Reward Structure**: Dense reward based on distance to target and achievement of goal |
|
|
- **Difficulty**: Continuous control with high-dimensional action and observation spaces |
|
|
|
|
|
## Performance |
|
|
|
|
|
The trained A2C agent achieves the following performance metrics: |
|
|
|
|
|
- **Mean Reward**: -0.24 ± 0.13 |
|
|
- **Performance Context**: This represents strong performance for this environment, where typical untrained baselines often achieve rewards around -3.5 |
|
|
- **Training Stability**: The relatively low standard deviation indicates consistent performance across evaluation episodes |
|
|
|
|
|
### Performance Analysis |
|
|
|
|
|
The achieved mean reward of -0.37 demonstrates significant improvement over random baselines. In the PandaReachDense-v3 environment, rewards are typically negative and approach zero as the agent becomes more proficient at reaching targets. The substantial improvement from the baseline of approximately -3.5 indicates the agent has successfully learned to: |
|
|
|
|
|
- Navigate the robotic arm efficiently toward target positions |
|
|
- Minimize unnecessary movements and energy expenditure |
|
|
- Achieve consistent reaching behavior across varied target locations |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation Requirements |
|
|
|
|
|
```bash |
|
|
pip install stable-baselines3[extra] |
|
|
pip install huggingface-sb3 |
|
|
pip install pybullet |
|
|
pip install gym |
|
|
``` |
|
|
|
|
|
### Loading and Using the Model |
|
|
|
|
|
```python |
|
|
import gym |
|
|
import pybullet_envs |
|
|
from stable_baselines3 import A2C |
|
|
from huggingface_sb3 import load_from_hub |
|
|
|
|
|
# Load the trained model |
|
|
model = load_from_hub( |
|
|
repo_id="Adilbai/a2c-PandaReachDense-v3", |
|
|
filename="a2c-PandaReachDense-v3.zip" |
|
|
) |
|
|
|
|
|
# Create the environment |
|
|
env = gym.make("PandaReachDense-v3") |
|
|
|
|
|
# Evaluate the model |
|
|
obs = env.reset() |
|
|
for i in range(1000): |
|
|
action, _states = model.predict(obs, deterministic=True) |
|
|
obs, reward, done, info = env.step(action) |
|
|
env.render() # Optional: visualize the agent |
|
|
if done: |
|
|
obs = env.reset() |
|
|
|
|
|
env.close() |
|
|
``` |
|
|
|
|
|
### Advanced Usage: Fine-tuning |
|
|
|
|
|
```python |
|
|
import gym |
|
|
import pybullet_envs |
|
|
from stable_baselines3 import A2C |
|
|
from huggingface_sb3 import load_from_hub |
|
|
|
|
|
# Load the pre-trained model |
|
|
model = load_from_hub( |
|
|
repo_id="Adilbai/a2c-PandaReachDense-v3", |
|
|
filename="a2c-PandaReachDense-v3.zip" |
|
|
) |
|
|
|
|
|
# Create environment for fine-tuning |
|
|
env = gym.make("PandaReachDense-v3") |
|
|
|
|
|
# Continue training (fine-tuning) |
|
|
model.set_env(env) |
|
|
model.learn(total_timesteps=100000) |
|
|
|
|
|
# Save the fine-tuned model |
|
|
model.save("fine_tuned_a2c_panda") |
|
|
``` |
|
|
|
|
|
### Evaluation Script |
|
|
|
|
|
```python |
|
|
import gym |
|
|
import numpy as np |
|
|
import pybullet_envs |
|
|
from stable_baselines3 import A2C |
|
|
from huggingface_sb3 import load_from_hub |
|
|
|
|
|
def evaluate_model(model, env, num_episodes=10): |
|
|
"""Evaluate the model performance over multiple episodes""" |
|
|
episode_rewards = [] |
|
|
|
|
|
for episode in range(num_episodes): |
|
|
obs = env.reset() |
|
|
episode_reward = 0 |
|
|
done = False |
|
|
|
|
|
while not done: |
|
|
action, _states = model.predict(obs, deterministic=True) |
|
|
obs, reward, done, info = env.step(action) |
|
|
episode_reward += reward |
|
|
|
|
|
episode_rewards.append(episode_reward) |
|
|
print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}") |
|
|
|
|
|
mean_reward = np.mean(episode_rewards) |
|
|
std_reward = np.std(episode_rewards) |
|
|
|
|
|
print(f"\nEvaluation Results:") |
|
|
print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}") |
|
|
|
|
|
return episode_rewards |
|
|
|
|
|
# Load and evaluate the model |
|
|
model = load_from_hub( |
|
|
repo_id="Adilbai/a2c-PandaReachDense-v3", |
|
|
filename="a2c-PandaReachDense-v3.zip" |
|
|
) |
|
|
|
|
|
env = gym.make("PandaReachDense-v3") |
|
|
rewards = evaluate_model(model, env, num_episodes=20) |
|
|
env.close() |
|
|
``` |
|
|
|
|
|
## Training Information |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
The model was trained using A2C with the following key characteristics: |
|
|
- **Policy**: Multi-layer perceptron (MLP) for both actor and critic networks |
|
|
- **Environment**: PandaReachDense-v3 with dense reward shaping |
|
|
- **Training Framework**: Stable-Baselines3 |
|
|
|
|
|
### Training Environment |
|
|
|
|
|
- **Observation Space**: Continuous state representation including: |
|
|
- Joint positions and velocities |
|
|
- End-effector position |
|
|
- Target position |
|
|
- Distance to target |
|
|
- **Action Space**: 7-dimensional continuous control (joint torques/positions) |
|
|
- **Reward Function**: Dense reward based on distance to target with sparse completion bonus |
|
|
|
|
|
## Limitations and Considerations |
|
|
|
|
|
- **Environment Specificity**: Model is specifically trained for PandaReachDense-v3 and may not generalize to other robotic tasks |
|
|
- **Simulation Gap**: Trained in simulation; real-world deployment would require domain adaptation |
|
|
- **Deterministic Evaluation**: Performance metrics based on deterministic policy evaluation |
|
|
- **Hardware Requirements**: Real-time inference requires modest computational resources |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{a2c_panda_reach_2024, |
|
|
title={A2C Agent for PandaReachDense-v3}, |
|
|
author={Adilbai}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/Adilbai/a2c-PandaReachDense-v3}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is distributed under the MIT License. See the repository for full license details. |
|
|
|