Skip to content

Deep Deterministic Policy Gradient (DDPG)

Overview

DDPG is a popular DRL algorithm for continuous control. It runs reasonably fast by leveraging vector (parallel) environments and naturally works well with different action spaces, therefore supporting a variety of games. It also has good sample efficiency compared to algorithms such as DQN.

Original paper:

Reference resources:

Implemented Variants

Variants Implemented Description
ddpg_continuous_action.py, docs For continuous action space. Also implemented Mujoco-specific code-level optimizations

Below are our single-file implementations of PPO:

ddpg_continuous_action.py

The ppo.py has the following features:

  • For continuous action space. Also implemented Mujoco-specific code-level optimizations
  • Works with the Box observation space of low-level features
  • Works with the Box (continuous) action space

Usage

poetry install
poetry install -E pybullet
python cleanrl/ddpg_continuous_action.py --help
python cleanrl/ddpg_continuous_action.py --env-id HopperBulletEnv-v0
poetry install -E mujoco # only works in Linux
python cleanrl/ddpg_continuous_action.py --env-id Hopper-v3

Implementation details

Our ddpg_continuous_action.py is based on the OurDDPG.py from sfujim/TD3, which presents the the following implementation difference from (Lillicrap et al., 2016)1:

  1. ddpg_continuous_action.py uses a gaussian exploration noise \(\mathcal{N}(0, 0.1)\), while (Lillicrap et al., 2016)1 uses Ornstein-Uhlenbeck process with \(\theta=0.15\) and \(\sigma=0.2\).

  2. ddpg_continuous_action.py runs the experiments using the openai/gym MuJoCo environments, while (Lillicrap et al., 2016)1 uses their proprietary MuJoCo environments.

  3. ddpg_continuous_action.py uses the following architecture:

    class QNetwork(nn.Module):
        def __init__(self, env):
            super(QNetwork, self).__init__()
            self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod() + np.prod(env.single_action_space.shape), 256)
            self.fc2 = nn.Linear(256, 256)
            self.fc3 = nn.Linear(256, 1)
    
        def forward(self, x, a):
            x = torch.cat([x, a], 1)
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x
    
    
    class Actor(nn.Module):
        def __init__(self, env):
            super(Actor, self).__init__()
            self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 256)
            self.fc2 = nn.Linear(256, 256)
            self.fc_mu = nn.Linear(256, np.prod(env.single_action_space.shape))
    
        def forward(self, x):
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            return torch.tanh(self.fc_mu(x))
    
    while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)1 uses the following architecture (difference highlighted):

    class QNetwork(nn.Module):
        def __init__(self, env):
            super(QNetwork, self).__init__()
            self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 400)
            self.fc2 = nn.Linear(400 + np.prod(env.single_action_space.shape), 300)
            self.fc3 = nn.Linear(300, 1)
    
        def forward(self, x, a):
            x = F.relu(self.fc1(x))
            x = torch.cat([x, a], 1)
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x
    
    
    class Actor(nn.Module):
        def __init__(self, env):
            super(Actor, self).__init__()
            self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 400)
            self.fc2 = nn.Linear(400, 300)
            self.fc_mu = nn.Linear(300, np.prod(env.single_action_space.shape))
    
        def forward(self, x):
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            return torch.tanh(self.fc_mu(x))
    
  4. ddpg_continuous_action.py uses the following learning rates:

    q_optimizer = optim.Adam(list(qf1.parameters()), lr=3e-4)
    actor_optimizer = optim.Adam(list(actor.parameters()), lr=3e-4)
    
    while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)1 uses the following learning rates:

    q_optimizer = optim.Adam(list(qf1.parameters()), lr=1e-4)
    actor_optimizer = optim.Adam(list(actor.parameters()), lr=1e-3)
    
  5. ddpg_continuous_action.py uses --batch-size=256 --tau=0.005, while (Lillicrap et al., 2016, see Appendix 7 EXPERIMENT DETAILS)1 uses --batch-size=64 --tau=0.001

Experiment results

PR vwxyzjn/cleanrl#120 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/ppo.

Below are the average episodic returns for ppo.py. To ensure the quality of the implementation, we compared the results against openai/baselies' PPO.

Environment ppo.py openai/baselies' PPO
CartPole-v1 488.75 ± 18.40 497.54 ± 4.02
Acrobot-v1 -82.48 ± 5.93 -81.82 ± 5.58
MountainCar-v0 -200.00 ± 0.00 -200.00 ± 0.00

Learning curves:

Tracked experiments and game play videos:


  1. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N.M., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. CoRR, abs/1509.02971. https://arxiv.org/abs/1509.02971 

Back to top