Agents

DQN

class ai_traineree.agents.dqn.DQNAgent(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, network_fn: Optional[Callable[[], ai_traineree.networks.NetworkType]] = None, network_class: Optional[Type[ai_traineree.networks.NetworkTypeClass]] = None, state_transform: Optional[Callable] = None, reward_transform: Optional[Callable] = None, **kwargs)

Deep Q-Learning Network (DQN).

The agent is not a vanilla DQN, although can be configured as such. The default config includes dual dueling nets and the priority experience buffer. Learning is also delayed by slowly copying to target nets (via tau parameter). Although NStep is implemented the default value is 1-step reward.

There is also a specific implementation of the DQN called the Rainbow which differs to this implementation by working on the discrete space projection of the Q(s,a) function.

__init__(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, network_fn: Optional[Callable[[], ai_traineree.networks.NetworkType]] = None, network_class: Optional[Type[ai_traineree.networks.NetworkTypeClass]] = None, state_transform: Optional[Callable] = None, reward_transform: Optional[Callable] = None, **kwargs)

Initiates the DQN agent.

Parameters
  • obs_space (DataSpace) – Dataspace describing the input.

  • action_space (DataSpace) – Dataspace describing the output.

  • network_fn (optional func) – Function used to instantiate a network used by the agent.

  • network_class (optional cls) – Class of network that is instantiated with internal params to create network.

  • state_transform (optional func) – Function to transform (encode) state before used by the network.

  • reward_transform (optional func) – Function to transform reward before use.

Keyword Arguments
  • hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (64, 64).

  • lr (float) – Learning rate value. Default: 3e-4.

  • gamma (float) – Discount factor. Default: 0.99.

  • tau (float) – Soft-copy factor. Default: 0.002.

  • update_freq (int) – Number of steps between each learning step. Default 1.

  • batch_size (int) – Number of samples to use at each learning step. Default: 80.

  • buffer_size (int) – Number of most recent samples to keep in memory for learning. Default: 1e5.

  • warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.

  • number_updates (int) – How many times to use learning step in the learning phase. Default: 1.

  • max_grad_norm (float) – Maximum norm of the gradient used in learning. Default: 10.

  • using_double_q (bool) – Whether to use Double Q Learning network. Default: True.

  • n_steps (int) – Number of lookahead steps when estimating reward. See NStepBuffer. Default: 3.

act(experience: ai_traineree.types.experience.Experience, eps: float = 0.0) ai_traineree.types.experience.Experience

Returns actions for given obs as per current policy.

Parameters
  • experience (Experience) – current observation

  • eps (optional float) – epsilon, for epsilon-greedy action selection. Default 0.

Returns

Categorical value for the action.

action_space: ai_traineree.types.dataspace.DataSpace
static from_state(state: ai_traineree.types.state.AgentState) ai_traineree.agents.AgentBase
get_network_state() ai_traineree.types.state.NetworkState
get_state() ai_traineree.types.state.AgentState

Provides agent’s internal state.

learn(experiences: Dict[str, list]) None

Updates agent’s networks based on provided experience.

Parameters

experiences – Samples experiences from the experience buffer.

load_buffer(path: str) None

Loads data into the buffer from provided file path.

Parameters

path – String path indicating where the buffer is stored.

load_state(*, path: Optional[str] = None, state: Optional[ai_traineree.types.state.AgentState] = None) None

Loads state from a file under provided path.

Parameters

path – String path indicating where the state is stored.

log_metrics(data_logger: ai_traineree.loggers.data_logger.DataLogger, step: int, full_log: bool = False)

Uses provided DataLogger to provide agent’s metrics.

Parameters
  • data_logger (DataLogger) – Instance of the SummaryView, e.g. torch.utils.tensorboard.SummaryWritter.

  • step (int) – Ordering value, e.g. episode number.

  • full_log (bool) – Whether to all available information. Useful to log with lesser frequency.

property loss: Dict[str, float]
model: str = 'DQN'
obs_space: ai_traineree.types.dataspace.DataSpace
reset()

Resets data not associated with learning.

save_buffer(path: str) None

Saves data from the buffer into a file under provided path.

Parameters

path – String path where to write the buffer.

save_state(path: str)

Saves agent’s state into a file.

Parameters

path – String path where to write the state.

set_buffer(buffer_state: ai_traineree.types.state.BufferState) None
set_network(network_state: ai_traineree.types.state.NetworkState) None
state_dict() Dict[str, dict]

Describes agent’s networks.

Returns

(dict) Provides actors and critics states.

Return type

state

step(exp: ai_traineree.types.experience.Experience) None

Letting the agent to take a step.

On some steps the agent will initiate learning step. This is dependent on the update_freq value.

Parameters
  • obs (ObservationType) – Observation.

  • action (int) – Discrete action associated with observation.

  • reward (float) – Reward obtained for taking action at state.

  • next_obs (ObservationType) – Observation in a state where the action took.

  • done – (bool) Whether in terminal (end of episode) state.

Rainbow

class ai_traineree.agents.rainbow.RainbowAgent(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, state_transform: Optional[Callable] = None, reward_transform: Optional[Callable] = None, **kwargs)

Rainbow agent as described in [1].

Rainbow is a DQN agent with some improvements that were suggested before 2017. As mentioned by the authors it’s not exhaustive improvement but all changes are in relatively separate areas so their connection makes sense. These improvements are: * Priority Experience Replay * Multi-step * Double Q net * Dueling nets * NoisyNet * CategoricalNet for Q estimate

Consider this class as a particular version of the DQN agent.

[1] “Rainbow: Combining Improvements in Deep Reinforcement Learning” by Hessel et al. (DeepMind team)

https://arxiv.org/abs/1710.02298

__init__(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, state_transform: Optional[Callable] = None, reward_transform: Optional[Callable] = None, **kwargs)

A wrapper over the DQN thus majority of the logic is in the DQNAgent. Special treatment is required because the Rainbow agent uses categorical nets which operate on probability distributions. Each action is taken as the estimate from such distributions.

Parameters
  • obs_space (DataSpace) – Dataspace describing the input.

  • action_space (DataSpace) – Dataspace describing the output.

  • state_transform (optional func) –

  • reward_transform (optional func) –

Keyword Arguments
  • pre_network_fn (function that takes input_shape and returns network) – Used to preprocess state before it is used in the value- and advantage-function in the dueling nets.

  • hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (100, 100).

  • (default (lr) – 1e-3): Learning rate value.

  • gamma (float) – Discount factor. Default: 0.99.

  • tau (float) – Soft-copy factor. Default: 0.002.

  • update_freq (int) – Number of steps between each learning step. Default 1.

  • batch_size (int) – Number of samples to use at each learning step. Default: 80.

  • buffer_size (int) – Number of most recent samples to keep in memory for learning. Default: 1e5.

  • warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.

  • number_updates (int) – How many times to use learning step in the learning phase. Default: 1.

  • max_grad_norm (float) – Maximum norm of the gradient used in learning. Default: 10.

  • using_double_q (bool) – Whether to use Double Q Learning network. Default: True.

  • n_steps (int) – Number of lookahead steps when estimating reward. See NStepBuffer. Default: 3.

  • v_min (float) – Lower bound for distributional value V. Default: -10.

  • v_max (float) – Upper bound for distributional value V. Default: 10.

  • num_atoms (int) – Number of atoms (discrete states) in the value V distribution. Default: 21.

act(experience: ai_traineree.types.experience.Experience, eps: float = 0.0) ai_traineree.types.experience.Experience

Returns actions for given state as per current policy.

Parameters
  • state – Current available state from the environment.

  • epislon – Epsilon value in the epislon-greedy policy.

action_space: ai_traineree.types.dataspace.DataSpace
static from_state(state: ai_traineree.types.state.AgentState) ai_traineree.agents.AgentBase
get_network_state() ai_traineree.types.state.NetworkState
get_state() ai_traineree.types.state.AgentState

Provides agent’s internal state.

learn(experiences: Dict[str, List]) None
Parameters

experiences – Contains all experiences for the agent. Typically sampled from the memory buffer. Five keys are expected, i.e. state, action, reward, next_state, done. Each key contains a array and all arrays have to have the same length.

load_buffer(path: str) None

Loads data into the buffer from provided file path.

Parameters

path – String path indicating where the buffer is stored.

load_state(path: str) None

Loads state from a file under provided path.

Parameters

path – String path indicating where the state is stored.

log_metrics(data_logger: ai_traineree.loggers.data_logger.DataLogger, step: int, full_log: bool = False)
property loss
model: str = 'Rainbow'
obs_space: ai_traineree.types.dataspace.DataSpace
save_buffer(path: str) None

Saves data from the buffer into a file under provided path.

Parameters

path – String path where to write the buffer.

save_state(path: str) None

Saves agent’s state into a file.

Parameters

path – String path where to write the state.

set_buffer(buffer_state: ai_traineree.types.state.BufferState) None
set_network(network_state: ai_traineree.types.state.NetworkState) None
state_dict() Dict[str, dict]

Returns agent’s state dictionary.

Returns

State dicrionary for internal networks.

step(experience: ai_traineree.types.experience.Experience) None

Letting the agent to take a step.

On some steps the agent will initiate learning step. This is dependent on the update_freq value.

Parameters
  • obs (ObservationType) – Observation.

  • action (int) – Discrete action associated with observation.

  • reward (float) – Reward obtained for taking action at state.

  • next_obs (ObservationType) – Observation in a state where the action took.

  • done – (bool) Whether in terminal (end of episode) state.

PPO

class ai_traineree.agents.ppo.PPOAgent(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, **kwargs)

Proximal Policy Optimization (PPO) [1] is an online policy gradient method that could be considered as an implementation-wise simplified version of the Trust Region Policy Optimization (TRPO).

[1] “Proximal Policy Optimization Algorithms” (2017) by J. Schulman, F. Wolski,
  1. Dhariwal, A. Radford, O. Klimov. https://arxiv.org/abs/1707.06347

__init__(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, **kwargs)
Parameters
  • obs_space (DataSpace) – Dataspace describing the input.

  • action_space (DataSpace) – Dataspace describing the output.

Keyword Arguments
  • hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (128, 128).

  • is_discrete (bool) – Whether return discrete action. Default: False.

  • using_kl_div (bool) – Whether to use KL divergence in loss. Default: False.

  • using_gae (bool) – Whether to use General Advantage Estimator. Default: True.

  • gae_lambda (float) – Value of lambda in GAE. Default: 0.96.

  • actor_lr (float) – Learning rate for the actor (policy). Default: 0.0003.

  • critic_lr (float) – Learning rate for the critic (value function). Default: 0.001.

  • gamma (float) – Discount value. Default: 0.99.

  • ppo_ratio_clip (float) – Policy ratio clipping value. Default: 0.25.

  • num_epochs (int) – Number of time to learn from samples. Default: 1.

  • rollout_length (int) – Number of actions to take before update. Default: 48.

  • batch_size (int) – Number of samples used in learning. Default: rollout_length.

  • actor_number_updates (int) – Number of times policy losses are propagated. Default: 10.

  • critic_number_updates (int) – Number of times value losses are propagated. Default: 10.

  • entropy_weight (float) – Weight of the entropy term in the loss. Default: 0.005.

  • max_grad_norm_actor (float) –

  • max_grad_norm_critic (float) – Maximum norm value for critic gradient. Default: 100.

act(experience: ai_traineree.types.experience.Experience, noise: float = 0.0) ai_traineree.types.experience.Experience

Acting on the observations. Returns action.

Parameters
  • experience (Experience) – current state

  • noise (float) – epsilon, for epsilon-greedy action selection

Returns

Experience updated with action taken.

action_space: ai_traineree.types.dataspace.DataSpace
compute_policy_loss(samples)
compute_value_loss(samples)
static from_state(state: ai_traineree.types.state.AgentState) ai_traineree.agents.AgentBase
get_network_state() ai_traineree.types.state.NetworkState
get_state() ai_traineree.types.state.AgentState

Returns agent’s internal state

learn(samples)
load_state(path: str)

Reads the whole agent state from a local file.

log_metrics(data_logger: ai_traineree.loggers.data_logger.DataLogger, step: int, full_log: bool = False)
logger = <Logger PPO (WARNING)>
property loss: Dict[str, float]
model: str = 'PPO'
obs_space: ai_traineree.types.dataspace.DataSpace
save_state(path: str)

Saves the whole agent state into a local file.

set_buffer(buffer_state: ai_traineree.types.state.BufferState) None
set_network(network_state: ai_traineree.types.state.NetworkState) None
step(experience: ai_traineree.types.experience.Experience) None

Step agent’s internal learning mechanisms.

Updates buffer with currenct experience and increments learning counter. When the learning counter hits rollout_length when we commence learning session. The learning counter isn’t updated when the agent is in test mode.

train_agent()

Main loop that initiates the training.

DDPG

class ai_traineree.agents.ddpg.DDPGAgent(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, noise_scale: float = 0.2, noise_sigma: float = 1.0, **kwargs)

Deep Deterministic Policy Gradients (DDPG).

Instead of popular Ornstein-Uhlenbeck (OU) process for noise this agent uses Gaussian noise.

This agent is intended for continuous tasks.

__init__(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, noise_scale: float = 0.2, noise_sigma: float = 1.0, **kwargs)
Parameters
  • obs_space (DataSpace) – Dataspace describing the input.

  • action_space (DataSpace) – Dataspace describing the output.

  • noise_scale (float) – Added noise amplitude. Default: 0.2.

  • noise_sigma (float) – Added noise variance. Default: 1.0.

Keyword Arguments
  • hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (64, 64).

  • gamma (float) – Discount value. Default: 0.99.

  • tau (float) – Soft-copy factor. Default: 0.002.

  • actor_lr (float) – Learning rate for the actor (policy). Default: 0.0003.

  • critic_lr (float) – Learning rate for the critic (value function). Default: 0.0003.

  • max_grad_norm_actor (float) –

  • max_grad_norm_critic (float) – Maximum norm value for critic gradient. Default: 10.

  • batch_size (int) – Number of samples used in learning. Default: 64.

  • buffer_size (int) – Maximum number of samples to store. Default: 1e6.

  • warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.

  • update_freq (int) – Number of steps between each learning step. Default 1.

  • number_updates (int) – How many times to use learning step in the learning phase. Default: 1.

act(experience: ai_traineree.types.experience.Experience, noise: float = 0.0) ai_traineree.types.experience.Experience

Acting on the observations. Returns action.

Parameters
  • obs (array_like) – current state

  • eps (optional float) – epsilon, for epsilon-greedy action selection. Default 0.

Returns

(list float) Action values.

Return type

action

property action_max
property action_min
action_space: ai_traineree.types.dataspace.DataSpace
compute_policy_loss(states) torch.Tensor

Compute Policy loss based on provided states.

Loss = Mean(-Q(s, _a) ), where _a is actor’s estimate based on state, _a = Actor(s).

compute_value_loss(states, actions, next_states, rewards, dones)
static from_state(state: ai_traineree.types.state.AgentState) ai_traineree.agents.AgentBase
get_network_state() ai_traineree.types.state.NetworkState
get_state() ai_traineree.types.state.AgentState

Returns agent’s internal state

learn(experiences) None

Update critics and actors

load_state(*, path: Optional[str] = None, agent_state: Optional[dict] = None)

Reads the whole agent state from a local file.

log_metrics(data_logger: ai_traineree.loggers.data_logger.DataLogger, step: int, full_log: bool = False)
property loss: Dict[str, float]
model: str = 'DDPG'
obs_space: ai_traineree.types.dataspace.DataSpace
reset_agent() None
save_state(path: str) None

Saves the whole agent state into a local file.

set_buffer(buffer_state: ai_traineree.types.state.BufferState) None
set_network(network_state: ai_traineree.types.state.NetworkState) None
state_dict() Dict[str, dict]

Describes agent’s networks.

Returns

(dict) Provides actors and critics states.

Return type

state

step(experience: ai_traineree.types.experience.Experience) None

D3PG

class ai_traineree.agents.d3pg.D3PGAgent(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, **kwargs)

Distributional DDPG (D3PG) [1].

It’s closely related to, and sits in-between, D4PG and DDPG. Compared to D4PG it lacks the multi actors support. It extends the DDPG agent with: 1. Distributional critic update. 2. N-step returns. 3. Prioritization of the experience replay (PER).

[1] “Distributed Distributional Deterministic Policy Gradients”

(2018, ICLR) by G. Barth-Maron & M. Hoffman et al.

__init__(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, **kwargs)
Parameters
  • obs_space (DataSpace) – Dataspace describing the input.

  • action_space (DataSpace) – Dataspace describing the output.

Keyword Arguments
  • hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (128, 128).

  • gamma (float) – Discount value. Default: 0.99.

  • tau (float) – Soft-copy factor. Default: 0.02.

  • actor_lr (float) – Learning rate for the actor (policy). Default: 0.0003.

  • critic_lr (float) – Learning rate for the critic (value function). Default: 0.0003.

  • actor_hidden_layers (tuple of ints) – Shape of network for actor. Default: hideen_layers.

  • critic_hidden_layers (tuple of ints) – Shape of network for critic. Default: hideen_layers.

  • max_grad_norm_actor (float) –

  • max_grad_norm_critic (float) – Maximum norm value for critic gradient. Default: 100.

  • num_atoms (int) – Number of discrete values for the value distribution. Default: 51.

  • v_min (float) – Value distribution minimum (left most) value. Default: -10.

  • v_max (float) – Value distribution maximum (right most) value. Default: 10.

  • n_steps (int) – Number of steps (N-steps) for the TD. Defualt: 3.

  • batch_size (int) – Number of samples used in learning. Default: 64.

  • buffer_size (int) – Maximum number of samples to store. Default: 1e6.

  • warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.

  • update_freq (int) – Number of steps between each learning step. Default 1.

  • action_scale (float) – Multipler value for action. Default: 1.

act(experience: ai_traineree.types.experience.Experience, epsilon: float = 0.0) ai_traineree.types.experience.Experience

Returns actions for given observation as per current policy.

Parameters
  • obs – Current available observation from the environment.

  • epislon – Epsilon value in the epislon-greedy policy.

property action_max
property action_min
action_space: ai_traineree.types.dataspace.DataSpace
compute_policy_loss(states)
compute_value_loss(states, actions, next_states, rewards, dones, indices=None)
get_state()

Returns agent’s internal state

learn(experiences)

Update critics and actors

load_state(path: str)

Reads the whole agent state from a local file.

log_metrics(data_logger: ai_traineree.loggers.data_logger.DataLogger, step: int, full_log: bool = False)
property loss: Dict[str, float]
model: str = 'D3PG'
obs_space: ai_traineree.types.dataspace.DataSpace
save_state(path: str)

Saves the whole agent state into a local file.

state_dict() Dict[str, dict]

Describes agent’s networks.

Returns

(dict) Provides actors and critics states.

Return type

state

step(experience: ai_traineree.types.experience.Experience) None

D4PG

class ai_traineree.agents.d4pg.D4PGAgent(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, **kwargs)

Distributed Distributional DDPG (D4PG) [1].

Extends the DDPG agent with: 1. Distributional critic update. 2. The use of distributed parallel actors. 3. N-step returns. 4. Prioritization of the experience replay (PER).

[1] “Distributed Distributional Deterministic Policy Gradients”

(2018, ICLR) by G. Barth-Maron & M. Hoffman et al.

__init__(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, **kwargs)
Parameters
  • obs_space (DataSpace) – Dataspace describing the input.

  • action_space (DataSpace) – Dataspace describing the output.

Keyword Arguments
  • hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (128, 128).

  • gamma (float) – Discount value. Default: 0.99.

  • tau (float) – Soft-copy factor. Default: 0.02.

  • actor_lr (float) – Learning rate for the actor (policy). Default: 0.0003.

  • critic_lr (float) – Learning rate for the critic (value function). Default: 0.0003.

  • actor_hidden_layers (tuple of ints) – Shape of network for actor. Default: hideen_layers.

  • critic_hidden_layers (tuple of ints) – Shape of network for critic. Default: hideen_layers.

  • max_grad_norm_actor (float) –

  • max_grad_norm_critic (float) – Maximum norm value for critic gradient. Default: 100.

  • num_atoms (int) – Number of discrete values for the value distribution. Default: 51.

  • v_min (float) – Value distribution minimum (left most) value. Default: -10.

  • v_max (float) – Value distribution maximum (right most) value. Default: 10.

  • n_steps (int) – Number of steps (N-steps) for the TD. Defualt: 3.

  • batch_size (int) – Number of samples used in learning. Default: 64.

  • buffer_size (int) – Maximum number of samples to store. Default: 1e6.

  • warm_up (int) – Number of samples to observe before starting any learning step. Default: 0.

  • update_freq (int) – Number of steps between each learning step. Default 1.

  • number_updates (int) – How many times to use learning step in the learning phase. Default: 1.

  • num_workers (int) – Number of workers that will assume this agent. Default: 1.

act(experience: ai_traineree.types.experience.Experience, epsilon: float = 0.0) ai_traineree.types.experience.Experience

Returns actions for given observation as per current policy.

Parameters
  • obs – Current available observation from the environment.

  • epislon – Epsilon value in the epislon-greedy policy.

property action_max
property action_min
action_space: ai_traineree.types.dataspace.DataSpace
compute_policy_loss(states)
compute_value_loss(states, actions, next_states, rewards, dones, indices=None)
get_state()

Returns agent’s internal state

learn(experiences)

Update critics and actors

load_state(path: str)

Reads the whole agent state from a local file.

log_metrics(data_logger: ai_traineree.loggers.data_logger.DataLogger, step, full_log=False)
property loss: Dict[str, float]
model: str = 'D4PG'
obs_space: ai_traineree.types.dataspace.DataSpace
save_state(path: str)

Saves the whole agent state into a local file.

state_dict() Dict[str, dict]

Describes agent’s networks.

Returns

(dict) Provides actors and critics states.

Return type

state

step(experience: ai_traineree.types.experience.Experience)

SAC

class ai_traineree.agents.sac.SACAgent(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, **kwargs)

Soft Actor-Critic.

Uses stochastic policy and dual value network (two critics).

Based on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor” by Haarnoja et al. (2018) (http://arxiv.org/abs/1801.01290).

__init__(obs_space: ai_traineree.types.dataspace.DataSpace, action_space: ai_traineree.types.dataspace.DataSpace, **kwargs)
Parameters
  • obs_space (DataSpace) – Dataspace describing the input.

  • action_space (DataSpace) – Dataspace describing the output.

Keyword Arguments
  • hidden_layers (tuple of ints) – Shape of the hidden layers in fully connected network. Default: (128, 128).

  • gamma (float) – Discount value. Default: 0.99.

  • tau (float) – Soft copy fraction. Default: 0.02.

  • batch_size (int) – Number of samples in a batch. Default: 64.

  • buffer_size (int) – Size of the prioritized experience replay buffer. Default: 1e6.

  • warm_up – (default: 0) Number of samples that needs to be observed before starting to learn. Default: 0.

  • update_freq (int) – Number of samples between policy updates. Default: 1.

  • number_updates (int) – Number of times of batch sampling/training per update_freq. Default: 1.

  • alpha (float) – Weight of log probs in value function. Default: 0.2.

  • alpha_lr (Optional float) – If not None, it adds alpha as a training parameters with alpha_lr as its learning rate. Default: None.

  • action_scale (float) – Scale for returned action values. Default: 1.

  • max_grad_norm_alpha (float) – Gradient clipping for the alpha. Default: 1.

  • max_grad_norm_actor (float) – Gradient clipping for the actor. Default: 10.

  • max_grad_norm_critic (float) – Gradient clipping for the critic. Default: 10.

  • device – Defaults to CUDA if available. Default: CUDA if available.

act(experience: ai_traineree.types.experience.Experience, epsilon: float = 0.0, deterministic: bool = False) ai_traineree.types.experience.Experience

Acting on the observations. Returns action.

Parameters
  • obs (array_like) – current state

  • eps (float) – epsilon, for epsilon-greedy action selection

  • deterministic (optional bool) – Whether to use deterministic policy. Only has effect in train mode. In test mode all actions are deterministic.

Returns

(list float) Action values.

Return type

action

property action_max
property action_min
action_space: ai_traineree.types.dataspace.DataSpace
property alpha
compute_policy_loss(states)
compute_value_loss(states, actions, rewards, next_states, dones) Tuple[torch.Tensor, torch.Tensor]
static from_state(state: ai_traineree.types.state.AgentState) ai_traineree.agents.AgentBase
get_network_state() ai_traineree.types.state.NetworkState
get_state() ai_traineree.types.state.AgentState

Returns agent’s internal state

learn(samples)

update the critics and actors of all the agents

load_state(path: str)

Reads the whole agent state from a local file.

log_metrics(data_logger: ai_traineree.loggers.data_logger.DataLogger, step: int, full_log: bool = False)
property loss: Dict[str, float]
model: str = 'SAC'
obs_space: ai_traineree.types.dataspace.DataSpace
reset_agent() None
save_state(path: str)

Saves the whole agent state into a local file.

set_buffer(buffer_state: ai_traineree.types.state.BufferState) None
set_network(network_state: ai_traineree.types.state.NetworkState) None
state_dict() Dict[str, dict]

Returns network’s weights in order: Actor, TargetActor, Critic, TargetCritic

step(experience: ai_traineree.types.experience.Experience) None