File size: 6,548 Bytes

---
library_name: stable-baselines3
tags:
- PandaReachDense-v3
- deep-reinforcement-learning
- reinforcement-learning
- stable-baselines3
model-index:
- name: A2C
  results:
  - task:
      type: reinforcement-learning
      name: reinforcement-learning
    dataset:
      name: PandaReachDense-v3
      type: PandaReachDense-v3
    metrics:
    - type: mean_reward
      value: -0.24 +/- 0.13
      name: mean_reward
      verified: false
---
# A2C Agent for PandaReachDense-v3

## Model Description

This repository contains a trained Advantage Actor-Critic (A2C) reinforcement learning agent designed to solve the PandaReachDense-v3 environment from PyBullet Gym. The agent has been trained using the stable-baselines3 library to perform robotic arm reaching tasks with the Franka Emika Panda robot.

### Model Details

- **Algorithm**: A2C (Advantage Actor-Critic)
- **Environment**: PandaReachDense-v3 (PyBullet)
- **Framework**: Stable-Baselines3
- **Task Type**: Continuous Control
- **Action Space**: Continuous (7-dimensional joint control)
- **Observation Space**: Multi-dimensional state representation including joint positions, velocities, and target coordinates

### Environment Overview

PandaReachDense-v3 is a robotic manipulation task where:
- **Objective**: Control a 7-DOF Franka Panda robotic arm to reach target positions
- **Reward Structure**: Dense reward based on distance to target and achievement of goal
- **Difficulty**: Continuous control with high-dimensional action and observation spaces

## Performance

The trained A2C agent achieves the following performance metrics:

- **Mean Reward**: -0.24 ± 0.13
- **Performance Context**: This represents strong performance for this environment, where typical untrained baselines often achieve rewards around -3.5
- **Training Stability**: The relatively low standard deviation indicates consistent performance across evaluation episodes

### Performance Analysis

The achieved mean reward of -0.37 demonstrates significant improvement over random baselines. In the PandaReachDense-v3 environment, rewards are typically negative and approach zero as the agent becomes more proficient at reaching targets. The substantial improvement from the baseline of approximately -3.5 indicates the agent has successfully learned to:

- Navigate the robotic arm efficiently toward target positions
- Minimize unnecessary movements and energy expenditure
- Achieve consistent reaching behavior across varied target locations

## Usage

### Installation Requirements

```bash
pip install stable-baselines3[extra]
pip install huggingface-sb3
pip install pybullet
pip install gym
```

### Loading and Using the Model

```python
import gym
import pybullet_envs
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub

# Load the trained model
model = load_from_hub(
    repo_id="Adilbai/a2c-PandaReachDense-v3",
    filename="a2c-PandaReachDense-v3.zip"
)

# Create the environment
env = gym.make("PandaReachDense-v3")

# Evaluate the model
obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()  # Optional: visualize the agent
    if done:
        obs = env.reset()

env.close()
```

### Advanced Usage: Fine-tuning

```python
import gym
import pybullet_envs
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub

# Load the pre-trained model
model = load_from_hub(
    repo_id="Adilbai/a2c-PandaReachDense-v3",
    filename="a2c-PandaReachDense-v3.zip"
)

# Create environment for fine-tuning
env = gym.make("PandaReachDense-v3")

# Continue training (fine-tuning)
model.set_env(env)
model.learn(total_timesteps=100000)

# Save the fine-tuned model
model.save("fine_tuned_a2c_panda")
```

### Evaluation Script

```python
import gym
import numpy as np
import pybullet_envs
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub

def evaluate_model(model, env, num_episodes=10):
    """Evaluate the model performance over multiple episodes"""
    episode_rewards = []
    
    for episode in range(num_episodes):
        obs = env.reset()
        episode_reward = 0
        done = False
        
        while not done:
            action, _states = model.predict(obs, deterministic=True)
            obs, reward, done, info = env.step(action)
            episode_reward += reward
        
        episode_rewards.append(episode_reward)
        print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")
    
    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)
    
    print(f"\nEvaluation Results:")
    print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
    
    return episode_rewards

# Load and evaluate the model
model = load_from_hub(
    repo_id="Adilbai/a2c-PandaReachDense-v3",
    filename="a2c-PandaReachDense-v3.zip"
)

env = gym.make("PandaReachDense-v3")
rewards = evaluate_model(model, env, num_episodes=20)
env.close()
```

## Training Information

### Hyperparameters

The model was trained using A2C with the following key characteristics:
- **Policy**: Multi-layer perceptron (MLP) for both actor and critic networks
- **Environment**: PandaReachDense-v3 with dense reward shaping
- **Training Framework**: Stable-Baselines3

### Training Environment

- **Observation Space**: Continuous state representation including:
  - Joint positions and velocities
  - End-effector position
  - Target position
  - Distance to target
- **Action Space**: 7-dimensional continuous control (joint torques/positions)
- **Reward Function**: Dense reward based on distance to target with sparse completion bonus

## Limitations and Considerations

- **Environment Specificity**: Model is specifically trained for PandaReachDense-v3 and may not generalize to other robotic tasks
- **Simulation Gap**: Trained in simulation; real-world deployment would require domain adaptation
- **Deterministic Evaluation**: Performance metrics based on deterministic policy evaluation
- **Hardware Requirements**: Real-time inference requires modest computational resources

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{a2c_panda_reach_2024,
  title={A2C Agent for PandaReachDense-v3},
  author={Adilbai},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Adilbai/a2c-PandaReachDense-v3}}
}
```

## License

This model is distributed under the MIT License. See the repository for full license details.