PolicyGradient class.
Macros:
PolicyGradient environment definition
Random number generator CapyRandom rng;
Number of possible action size_t nbAction;
Parameters for actions probability evaluation CapyVec paramAction;
Parameters for value evaluation CapyVec paramValue;
Current state features value CapyVec curState;
Output vector for the actions probability evaluation CapyVec actionsProb;
Destructor void (*destruct)(void);
Step the environment Input: action: the applied action
Output and side effect(s):
Update the current state according to the action, and return the transition CapyPGTransition (*step)(size_t const action);
Set the current state to an initial state
Output and side effect(s):
The current state is set to an intial state void (*setToInitialState)(void);
Check if the current state is an end state
Output and side effect(s):
Return true if the current state is an end state, else false bool (*isEndState)(void);
Get an action for a given state according to their probabilities
Input argument(s):
state: the state to use for evaluation
Output and side effect(s):
Return the selected action. size_t (*getAction)(CapyVec const* const state);
Get the action with highest probability for a given state
Input argument(s):
state: the state to use for evaluation
Output and side effect(s):
Return the selected action. size_t (*getBestAction)(CapyVec const* const state);
Evaluate the action probabilities
Input argument(s):
state: the state used for evaluation
actionsProb: the evaluated actions probability
Output and side effect(s):
'actionsProb' is updated. void (*getActionsProb)( CapyVec const* const state, CapyVec* const actionsProb);
Evaluate the value
Input argument(s):
state: the state used for evaluation
Output and side effect(s):
Return the evaluated value double (*getValue)(CapyVec const* const state);
Evaluate the gradient of values
Input argument(s):
state: the state used for evaluation
gradValue: the result gradient
Output and side effect(s):
'gradValue' is updated. void (*getGradientValue)( CapyVec const* const state, CapyVec* const gradValue);
Evaluate the gradient of actions probability
Input argument(s):
state: the state used for evaluation
iAction: the action to be evaluated
gradProb: the result gradient
Output and side effect(s):
'gradProb' is updated. void (*getGradientActionsProb)( CapyVec const* const state, size_t const iAction, CapyVec* const gradProb);
Evaluate the gradient of actions log probability
Input argument(s):
state: the state used for evaluation
iAction: the action to be evaluated
gradProb: the result gradient
Output and side effect(s):
'gradProb' is updated. void (*getGradientActionsLogProb)( CapyVec const* const state, size_t const iAction, CapyVec* const gradProb);
Enumerations:
None.
Typedefs:
CapyPGEnvironment object
Struct CapyPGTransition :
Struct CapyPGTransition's properties:
'from' state
Action
'to' state
Reward
Struct CapyPGTransition's methods:
None.
Struct CapyPGTransitionRecorder :
Struct CapyPGTransitionRecorder's properties:
Number of transition
Size of the recorder memory (in transition number)
Recorded transitions
Struct CapyPGTransitionRecorder's methods:
Destructor
Reset the recorder
Output and side effect(s):
'nbTransition' is reset to 0.
Record one transition
Input argument(s):
transition: the transition to be recorded
Output and side effect(s):
A copy of the transition is added to the end of 'transitions' which is realloced if necessary, 'nbTransition' and 'nbMaxTransition' are updated as necessary.
Struct CapyPolicyGradient :
Struct CapyPolicyGradient's properties:
The trained environment
Learning rate for action probabilities (in ]0,1], should be small, default: 0.01)
Learning rate for state value (in ]0,1], should be small, default: 0.01)
Discount rate (in ]0,1], default: 0.9)
Max number of step when sampling trajectory (default: 1000)
Average reward during training
Average final reward during training
Average number of step per episode during training
Clipping coefficient for PPO (in ]0,+inf[, default: 0.2, the lower the more stable but the slower learning)
Gradient descent for the action probabilities (adam)
Gradient descent for the state values (standard)
Struct CapyPolicyGradient's methods:
Destructor
Learn the weights of action probabilities and state value functions using the reinforce with baseline algorithm Inputs: nbEpisode: number of training episode
Output and side effect(s):
The environment's action probabilities parameters and state values parameters are updated.
Learn the weights of action probabilities and state value functions using the proximal policy optimisation algorithm Inputs: nbEpisode: number of training episode
Output and side effect(s):
The environment's action probabilities parameters and state values parameters are updated.
Functions:
Create a CapyPGTransition
Input argument(s):
nbFeature: number of features describing an environment
Output and side effect(s):
Return a CapyPGTransition
Destruct a CapyPGTransition
Create a CapyPGEnvironment
Input argument(s):
nbFeature: number of features describing an environment
nbAction: number of possible actions
nbParamAction: number of parameters for actions probability evaluation
nbParamValue: number of parameters for value evaluation
seed: seed for the random number generator
Output and side effect(s):
Return a CapyPGEnvironment
Allocate memory for a new CapyPGEnvironment and create it
Input argument(s):
nbFeature: number of features describing an environment
nbAction: number of possible actions
nbParamAction: number of parameters for actions probability evaluation
nbParamValue: number of parameters for value evaluation
seed: seed for the random number generator
Output and side effect(s):
Return a CapyPGEnvironment
Exception(s):
May raise CapyExc_MallocFailed.
Free the memory used by a CapyPGEnvironment* and reset '*that' to NULL
Input argument(s):
that: a pointer to the CapyPGEnvironment to free
Create a CapyPGTransitionRecorder
Output and side effect(s):
Return a CapyPGTransitionRecorder
Create a CapyPolicyGradient Inputs: env: the environment to train
Output and side effect(s):
Return a CapyPolicyGradient
Allocate memory for a new CapyPolicyGradient and create it Inputs: env: the environment to train
Output and side effect(s):
Return a CapyPolicyGradient
Exception(s):
May raise CapyExc_MallocFailed.
Free the memory used by a CapyPolicyGradient* and reset '*that' to NULL
Input argument(s):
that: a pointer to the CapyPolicyGradient to free