Agents

DQN

class aitraineree.agents.dqn.DQNAgent(obs_space: DataSpace, action_space: DataSpace, network_fn: Callable[[], NetworkType] = None, network_class: Type[NetworkTypeClass] = None, state_transform: Callable | None = None, reward_transform: Callable | None = None, **kwargs)

Deep Q-Learning Network (DQN).

The agent is not a vanilla DQN, although can be configured as such. The default config includes dual dueling nets and the priority experience buffer. Learning is also delayed by slowly copying to target nets (via tau parameter). Although NStep is implemented the default value is 1-step reward.

There is also a specific implementation of the DQN called the Rainbow which differs to this implementation by working on the discrete space projection of the Q(s,a) function.

__init__(obs_space: DataSpace, action_space: DataSpace, network_fn: Callable[[], NetworkType] = None, network_class: Type[NetworkTypeClass] = None, state_transform: Callable | None = None, reward_transform: Callable | None = None, **kwargs)

Initiates the DQN agent.

Parameters:

obs_space (DataSpace) – Dataspace describing the input.
action_space (DataSpace) – Dataspace describing the output.
network_fn (optional func) – Function used to instantiate a network used by the agent.
network_class (optional cls) – Class of network that is instantiated with internal params to create network.
state_transform (optional func) – Function to transform (encode) state before used by the network.
reward_transform (optional func) – Function to transform reward before use.

Keyword Arguments:

hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (64, 64).
lr (float) – Learning rate value. Default: 3e-4.
gamma (float) – Discount factor. Default: 0.99.
tau (float) – Soft-copy factor. Default: 0.002.
update_freq (int) – Number of steps between each learning step. Default 1.
batch_size (int) – Number of samples to use at each learning step. Default: 80.
buffer_size (int) – Number of most recent samples to keep in memory for learning. Default: 1e5.
warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.
number_updates (int) – How many times to use learning step in the learning phase. Default: 1.
max_grad_norm (float) – Maximum norm of the gradient used in learning. Default: 10.
using_double_q (bool) – Whether to use Double Q Learning network. Default: True.
n_steps (int) – Number of lookahead steps when estimating reward. See NStepBuffer. Default: 3.

act(experience: Experience, eps: float = 0.0) → Experience

Returns actions for given obs as per current policy.

Parameters:

experience (Experience) – current observation
eps (optional float) – epsilon, for epsilon-greedy action selection. Default 0.

Returns:

Categorical value for the action.

static from_state(state: AgentState) → AgentBase

get_network_state() → NetworkState

get_state() → AgentState: Provides agent’s internal state.

learn(experiences: dict[str, list]) → None

Updates agent’s networks based on provided experience.

Parameters:: experiences – Samples experiences from the experience buffer.

load_buffer(path: str) → None

Loads data into the buffer from provided file path.

Parameters:: path – String path indicating where the buffer is stored.

load_state(*, path: str | None = None, state: AgentState | None = None) → None

Loads state from a file under provided path.

Parameters:: path – String path indicating where the state is stored.

log_metrics(data_logger: DataLogger, step: int, full_log: bool = False)

Uses provided DataLogger to provide agent’s metrics.

Parameters:

data_logger (DataLogger) – Instance of the SummaryView, e.g. torch.utils.tensorboard.SummaryWritter.
step (int) – Ordering value, e.g. episode number.
full_log (bool) – Whether to all available information. Useful to log with lesser frequency.

property loss: dict[str, float]

model: str = 'DQN'

reset(): Resets data not associated with learning.

save_buffer(path: str) → None

Saves data from the buffer into a file under provided path.

Parameters:: path – String path where to write the buffer.

save_state(path: str)

Saves agent’s state into a file.

Parameters:: path – String path where to write the state.

set_buffer(buffer_state: BufferState) → None

set_network(network_state: NetworkState) → None

state_dict() → dict[str, dict]

Describes agent’s networks.

Returns:: (dict) Provides actors and critics states.
Return type:: state

step(exp: Experience) → None

Letting the agent to take a step.

On some steps the agent will initiate learning step. This is dependent on the update_freq value.

Parameters:

obs (ObservationType) – Observation.
action (int) – Discrete action associated with observation.
reward (float) – Reward obtained for taking action at state.
next_obs (ObservationType) – Observation in a state where the action took.
done – (bool) Whether in terminal (end of episode) state.

Rainbow

class aitraineree.agents.rainbow.RainbowAgent(obs_space: DataSpace, action_space: DataSpace, state_transform: Callable | None = None, reward_transform: Callable | None = None, **kwargs)

Rainbow agent as described in [1].

Rainbow is a DQN agent with some improvements that were suggested before 2017. As mentioned by the authors it’s not exhaustive improvement but all changes are in relatively separate areas so their connection makes sense. These improvements are: * Priority Experience Replay * Multi-step * Double Q net * Dueling nets * NoisyNet * CategoricalNet for Q estimate

Consider this class as a particular version of the DQN agent.

[1] “Rainbow: Combining Improvements in Deep Reinforcement Learning” by Hessel et al. (DeepMind team): https://arxiv.org/abs/1710.02298

__init__(obs_space: DataSpace, action_space: DataSpace, state_transform: Callable | None = None, reward_transform: Callable | None = None, **kwargs)

A wrapper over the DQN thus majority of the logic is in the DQNAgent. Special treatment is required because the Rainbow agent uses categorical nets which operate on probability distributions. Each action is taken as the estimate from such distributions.

Parameters:

obs_space (DataSpace) – Dataspace describing the input.
action_space (DataSpace) – Dataspace describing the output.
state_transform (optional func)
reward_transform (optional func)

Keyword Arguments:

pre_network_fn (function that takes input_shape and returns network) – Used to preprocess state before it is used in the value- and advantage-function in the dueling nets.
hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (100, 100).
(default (lr) – 1e-3): Learning rate value.
gamma (float) – Discount factor. Default: 0.99.
tau (float) – Soft-copy factor. Default: 0.002.
update_freq (int) – Number of steps between each learning step. Default 1.
batch_size (int) – Number of samples to use at each learning step. Default: 80.
buffer_size (int) – Number of most recent samples to keep in memory for learning. Default: 1e5.
warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.
number_updates (int) – How many times to use learning step in the learning phase. Default: 1.
max_grad_norm (float) – Maximum norm of the gradient used in learning. Default: 10.
using_double_q (bool) – Whether to use Double Q Learning network. Default: True.
n_steps (int) – Number of lookahead steps when estimating reward. See NStepBuffer. Default: 3.
v_min (float) – Lower bound for distributional value V. Default: -10.
v_max (float) – Upper bound for distributional value V. Default: 10.
num_atoms (int) – Number of atoms (discrete states) in the value V distribution. Default: 21.

act(experience: Experience, eps: float = 0.0) → Experience

Returns actions for given state as per current policy.

Parameters:

state – Current available state from the environment.
epislon – Epsilon value in the epislon-greedy policy.

static from_state(state: AgentState) → AgentBase

get_network_state() → NetworkState

get_state() → AgentState: Provides agent’s internal state.

learn(experiences: dict[str, list]) → None

Parameters:: experiences – Contains all experiences for the agent. Typically sampled from the memory buffer. Five keys are expected, i.e. state, action, reward, next_state, done. Each key contains a array and all arrays have to have the same length.

load_buffer(path: str) → None

Loads data into the buffer from provided file path.

Parameters:: path – String path indicating where the buffer is stored.

load_state(path: str) → None

Loads state from a file under provided path.

Parameters:: path – String path indicating where the state is stored.

log_metrics(data_logger: DataLogger, step: int, full_log: bool = False)

property loss

model: str = 'Rainbow'

save_buffer(path: str) → None

Saves data from the buffer into a file under provided path.

Parameters:: path – String path where to write the buffer.

save_state(path: str) → None

Saves agent’s state into a file.

Parameters:: path – String path where to write the state.

set_buffer(buffer_state: BufferState) → None

set_network(network_state: NetworkState) → None

state_dict() → dict[str, dict]

Returns agent’s state dictionary.

Returns:: State dicrionary for internal networks.

step(experience: Experience) → None

Letting the agent to take a step.

On some steps the agent will initiate learning step. This is dependent on the update_freq value.

Parameters:

obs (ObservationType) – Observation.
action (int) – Discrete action associated with observation.
reward (float) – Reward obtained for taking action at state.
next_obs (ObservationType) – Observation in a state where the action took.
done – (bool) Whether in terminal (end of episode) state.

PPO

class aitraineree.agents.ppo.PPOAgent(obs_space: DataSpace, action_space: DataSpace, **kwargs)

Proximal Policy Optimization (PPO) [1] is an online policy gradient method that could be considered as an implementation-wise simplified version of the Trust Region Policy Optimization (TRPO).

[1] “Proximal Policy Optimization Algorithms” (2017) by J. Schulman, F. Wolski,

Dhariwal, A. Radford, O. Klimov. https://arxiv.org/abs/1707.06347

__init__(obs_space: DataSpace, action_space: DataSpace, **kwargs)

Parameters:

obs_space (DataSpace) – Dataspace describing the input.
action_space (DataSpace) – Dataspace describing the output.

Keyword Arguments:

hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (128, 128).
is_discrete (bool) – Whether return discrete action. Default: False.
using_kl_div (bool) – Whether to use KL divergence in loss. Default: False.
using_gae (bool) – Whether to use General Advantage Estimator. Default: True.
gae_lambda (float) – Value of lambda in GAE. Default: 0.96.
actor_lr (float) – Learning rate for the actor (policy). Default: 0.0003.
critic_lr (float) – Learning rate for the critic (value function). Default: 0.001.
gamma (float) – Discount value. Default: 0.99.
ppo_ratio_clip (float) – Policy ratio clipping value. Default: 0.25.
num_epochs (int) – Number of time to learn from samples. Default: 1.
rollout_length (int) – Number of actions to take before update. Default: 48.
batch_size (int) – Number of samples used in learning. Default: rollout_length.
actor_number_updates (int) – Number of times policy losses are propagated. Default: 10.
critic_number_updates (int) – Number of times value losses are propagated. Default: 10.
entropy_weight (float) – Weight of the entropy term in the loss. Default: 0.005.
max_grad_norm_actor (float) –
max_grad_norm_critic (float) – Maximum norm value for critic gradient. Default: 100.
using_curiosity (bool) – Whether to use Intrinsic Curiosity Module. Default: False.
curiosity_feature_dim (int) – Feature embedding dimension for ICM. Default: 64.
curiosity_hidden_layers (tuple of ints) – Hidden layers for ICM sub-networks. Default: (128,).
curiosity_beta (float) – Weight for forward vs inverse loss in ICM. Default: 0.2.
curiosity_eta (float) – Scaling factor for intrinsic reward. Default: 0.01.
curiosity_lr (float) – Learning rate for the ICM. Default: 0.001.

act(experience: Experience, noise: float = 0.0) → Experience

Acting on the observations. Returns action.

Parameters:

experience (Experience) – current state
noise (float) – epsilon, for epsilon-greedy action selection

Returns:

Experience updated with action taken.

compute_policy_loss(samples)

compute_value_loss(samples)

static from_state(state: AgentState) → AgentBase

get_network_state() → NetworkState

get_state() → AgentState: Returns agent’s internal state

learn(samples)

load_state(path: str): Reads the whole agent state from a local file.

log_metrics(data_logger: DataLogger, step: int, full_log: bool = False)

logger = <Logger PPO (WARNING)>

property loss: dict[str, float]

model: str = 'PPO'

save_state(path: str): Saves the whole agent state into a local file.

set_buffer(buffer_state: BufferState) → None

set_network(network_state: NetworkState) → None

step(experience: Experience) → None

Step agent’s internal learning mechanisms.

Updates buffer with currenct experience and increments learning counter. When the learning counter hits rollout_length when we commence learning session. The learning counter isn’t updated when the agent is in test mode.

train_agent(): Main loop that initiates the training.

DDPG

class aitraineree.agents.ddpg.DDPGAgent(obs_space: DataSpace, action_space: DataSpace, noise_scale: float = 0.2, noise_sigma: float = 1.0, **kwargs)

Deep Deterministic Policy Gradients (DDPG).

Instead of popular Ornstein-Uhlenbeck (OU) process for noise this agent uses Gaussian noise.

This agent is intended for continuous tasks.

__init__(obs_space: DataSpace, action_space: DataSpace, noise_scale: float = 0.2, noise_sigma: float = 1.0, **kwargs)

Parameters:

obs_space (DataSpace) – Dataspace describing the input.
action_space (DataSpace) – Dataspace describing the output.
noise_scale (float) – Added noise amplitude. Default: 0.2.
noise_sigma (float) – Added noise variance. Default: 1.0.

Keyword Arguments:

hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (64, 64).
gamma (float) – Discount value. Default: 0.99.
tau (float) – Soft-copy factor. Default: 0.002.
actor_lr (float) – Learning rate for the actor (policy). Default: 0.0003.
critic_lr (float) – Learning rate for the critic (value function). Default: 0.0003.
max_grad_norm_actor (float) –
max_grad_norm_critic (float) – Maximum norm value for critic gradient. Default: 10.
batch_size (int) – Number of samples used in learning. Default: 64.
buffer_size (int) – Maximum number of samples to store. Default: 1e6.
warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.
update_freq (int) – Number of steps between each learning step. Default 1.
number_updates (int) – How many times to use learning step in the learning phase. Default: 1.

act(experience: Experience, noise: float = 0.0) → Experience

Acting on the observations. Returns action.

Parameters:

obs (array_like) – current state
eps (optional float) – epsilon, for epsilon-greedy action selection. Default 0.

Returns:

(list float) Action values.

Return type:

action

property action_max

property action_min

compute_policy_loss(states) → Tensor

Compute Policy loss based on provided states.

Loss = Mean(-Q(s, _a) ), where _a is actor’s estimate based on state, _a = Actor(s).

compute_value_loss(states, actions, next_states, rewards, dones)

static from_state(state: AgentState) → AgentBase

get_network_state() → NetworkState

get_state() → AgentState: Returns agent’s internal state

learn(experiences) → None: Update critics and actors

load_state(*, path: str | None = None, agent_state: dict | None = None): Reads the whole agent state from a local file.

log_metrics(data_logger: DataLogger, step: int, full_log: bool = False)

property loss: dict[str, float]

model: str = 'DDPG'

reset_agent() → None

save_state(path: str) → None: Saves the whole agent state into a local file.

set_buffer(buffer_state: BufferState) → None

set_network(network_state: NetworkState) → None

state_dict() → dict[str, dict]

Describes agent’s networks.

Returns:: (dict) Provides actors and critics states.
Return type:: state

step(experience: Experience) → None

D3PG

class aitraineree.agents.d3pg.D3PGAgent(obs_space: DataSpace, action_space: DataSpace, **kwargs)

Distributional DDPG (D3PG) [1].

It’s closely related to, and sits in-between, D4PG and DDPG. Compared to D4PG it lacks the multi actors support. It extends the DDPG agent with: 1. Distributional critic update. 2. N-step returns. 3. Prioritization of the experience replay (PER).

[1] “Distributed Distributional Deterministic Policy Gradients”: (2018, ICLR) by G. Barth-Maron & M. Hoffman et al.

__init__(obs_space: DataSpace, action_space: DataSpace, **kwargs)

Parameters:

obs_space (DataSpace) – Dataspace describing the input.
action_space (DataSpace) – Dataspace describing the output.

Keyword Arguments:

hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (128, 128).
gamma (float) – Discount value. Default: 0.99.
tau (float) – Soft-copy factor. Default: 0.02.
actor_lr (float) – Learning rate for the actor (policy). Default: 0.0003.
critic_lr (float) – Learning rate for the critic (value function). Default: 0.0003.
actor_hidden_layers (tuple of ints) – Shape of network for actor. Default: hideen_layers.
critic_hidden_layers (tuple of ints) – Shape of network for critic. Default: hideen_layers.
max_grad_norm_actor (float) –
max_grad_norm_critic (float) – Maximum norm value for critic gradient. Default: 100.
num_atoms (int) – Number of discrete values for the value distribution. Default: 51.
v_min (float) – Value distribution minimum (left most) value. Default: -10.
v_max (float) – Value distribution maximum (right most) value. Default: 10.
n_steps (int) – Number of steps (N-steps) for the TD. Defualt: 3.
batch_size (int) – Number of samples used in learning. Default: 64.
buffer_size (int) – Maximum number of samples to store. Default: 1e6.
warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.
update_freq (int) – Number of steps between each learning step. Default 1.
action_scale (float) – Multipler value for action. Default: 1.

act(experience: Experience, epsilon: float = 0.0) → Experience

Returns actions for given observation as per current policy.

Parameters:

obs – Current available observation from the environment.
epislon – Epsilon value in the epislon-greedy policy.

property action_max

property action_min

compute_policy_loss(states)

compute_value_loss(states, actions, next_states, rewards, dones, indices=None)

get_state(): Returns agent’s internal state

learn(experiences): Update critics and actors

load_state(path: str): Reads the whole agent state from a local file.

log_metrics(data_logger: DataLogger, step: int, full_log: bool = False)

property loss: dict[str, float]

model: str = 'D3PG'

save_state(path: str): Saves the whole agent state into a local file.

state_dict() → dict[str, dict]

Describes agent’s networks.

Returns:: (dict) Provides actors and critics states.
Return type:: state

step(experience: Experience) → None

D4PG

class aitraineree.agents.d4pg.D4PGAgent(obs_space: DataSpace, action_space: DataSpace, **kwargs)

Distributed Distributional DDPG (D4PG) [1].

Extends the DDPG agent with: 1. Distributional critic update. 2. The use of distributed parallel actors. 3. N-step returns. 4. Prioritization of the experience replay (PER).

[1] “Distributed Distributional Deterministic Policy Gradients”: (2018, ICLR) by G. Barth-Maron & M. Hoffman et al.

__init__(obs_space: DataSpace, action_space: DataSpace, **kwargs)

Parameters:

obs_space (DataSpace) – Dataspace describing the input.
action_space (DataSpace) – Dataspace describing the output.

Keyword Arguments:

hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (128, 128).
gamma (float) – Discount value. Default: 0.99.
tau (float) – Soft-copy factor. Default: 0.02.
actor_lr (float) – Learning rate for the actor (policy). Default: 0.0003.
critic_lr (float) – Learning rate for the critic (value function). Default: 0.0003.
actor_hidden_layers (tuple of ints) – Shape of network for actor. Default: hideen_layers.
critic_hidden_layers (tuple of ints) – Shape of network for critic. Default: hideen_layers.
max_grad_norm_actor (float) –
max_grad_norm_critic (float) – Maximum norm value for critic gradient. Default: 100.
num_atoms (int) – Number of discrete values for the value distribution. Default: 51.
v_min (float) – Value distribution minimum (left most) value. Default: -10.
v_max (float) – Value distribution maximum (right most) value. Default: 10.
n_steps (int) – Number of steps (N-steps) for the TD. Defualt: 3.
batch_size (int) – Number of samples used in learning. Default: 64.
buffer_size (int) – Maximum number of samples to store. Default: 1e6.
warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.
update_freq (int) – Number of steps between each learning step. Default 1.
number_updates (int) – How many times to use learning step in the learning phase. Default: 1.
num_workers (int) – Number of workers that will assume this agent. Default: 1.

act(experience: Experience, epsilon: float = 0.0) → Experience

Returns actions for given observation as per current policy.

Parameters:

obs – Current available observation from the environment.
epislon – Epsilon value in the epislon-greedy policy.

property action_max

property action_min

compute_policy_loss(states)

compute_value_loss(states, actions, next_states, rewards, dones, indices=None)

get_state(): Returns agent’s internal state

learn(experiences): Update critics and actors

load_state(path: str): Reads the whole agent state from a local file.

log_metrics(data_logger: DataLogger, step, full_log=False)

property loss: dict[str, float]

model: str = 'D4PG'

save_state(path: str): Saves the whole agent state into a local file.

state_dict() → dict[str, dict]

Describes agent’s networks.

Returns:: (dict) Provides actors and critics states.
Return type:: state

step(experience: Experience)

SAC

class aitraineree.agents.sac.SACAgent(obs_space: DataSpace, action_space: DataSpace, **kwargs)

Soft Actor-Critic.

Uses stochastic policy and dual value network (two critics).

Based on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor” by Haarnoja et al. (2018) (http://arxiv.org/abs/1801.01290).

__init__(obs_space: DataSpace, action_space: DataSpace, **kwargs)

Parameters:

obs_space (DataSpace) – Dataspace describing the input.
action_space (DataSpace) – Dataspace describing the output.

Keyword Arguments:

hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (128, 128).
gamma (float) – Discount value. Default: 0.99.
tau (float) – Soft copy fraction. Default: 0.02.
batch_size (int) – Number of samples in a batch. Default: 64.
buffer_size (int) – Size of the prioritized experience replay buffer. Default: 1e6.
warm_up – (default: 0) Number of samples that needs to be observed before starting to learn. Default: 0.
update_freq (int) – Number of samples between policy updates. Default: 1.
number_updates (int) – Number of times of batch sampling/training per update_freq. Default: 1.
alpha (float) – Weight of log probs in value function. Default: 0.2.
alpha_lr (Optional float) – If not None, it adds alpha as a training parameters with alpha_lr as its learning rate. Default: None.
action_scale (float) – Scale for returned action values. Default: 1.
max_grad_norm_alpha (float) – Gradient clipping for the alpha. Default: 1.
max_grad_norm_actor (float) – Gradient clipping for the actor. Default: 10.
max_grad_norm_critic (float) – Gradient clipping for the critic. Default: 10.
device – Defaults to CUDA if available. Default: CUDA if available.

act(experience: Experience, epsilon: float = 0.0, deterministic: bool = False) → Experience

Acting on the observations. Returns action.

Parameters:

obs (array_like) – current state
eps (float) – epsilon, for epsilon-greedy action selection
deterministic (optional bool) – Whether to use deterministic policy. Only has effect in train mode. In test mode all actions are deterministic.

Returns:

(list float) Action values.

Return type:

action

property action_max

property action_min

property alpha

compute_policy_loss(states)

compute_value_loss(states, actions, rewards, next_states, dones) → tuple[Tensor, Tensor]

static from_state(state: AgentState) → AgentBase

get_network_state() → NetworkState

get_state() → AgentState: Returns agent’s internal state

learn(samples): update the critics and actors of all the agents

load_state(path: str): Reads the whole agent state from a local file.

log_metrics(data_logger: DataLogger, step: int, full_log: bool = False)

property loss: dict[str, float]

model: str = 'SAC'

reset_agent() → None

save_state(path: str): Saves the whole agent state into a local file.

set_buffer(buffer_state: BufferState) → None

set_network(network_state: NetworkState) → None

state_dict() → dict[str, dict]: Returns network’s weights in order: Actor, TargetActor, Critic, TargetCritic

step(experience: Experience) → None