MarkovDecisionProcess class.
Macros:
MarkovDecisionProcess policy definition Number of states size_t nbState;
States value double* values;
States optimal action size_t* actions;
Destructor void (*destruct)(void);
Get the action for a given state
Input argument(s):
state: the state
Output and side effect(s):
Return the action size_t (*getAction)(size_t const state);
Get the probability that a given action is selected given a state
Input argument(s):
state: the state
action: the action
Output and side effect(s):
Return the probability in [0,1] double (*getProbAction)( size_t const state, size_t const action);
MarkovDecisionProcess environment definition
Destructor void (*destruct)(void);
Get the result action for a given state
Input argument(s):
fromState: the 'from' state
action: the applied action
Output and side effect(s):
Return the result state size_t (*step)( size_t const fromState, size_t const action);
Enumerations:
None.
Typedefs:
CapyMDPPolicy object
CapyMDPEnvironment object
Struct CapyMDPPolicyEpsilonSoft :
Struct CapyMDPPolicyEpsilonSoft's properties:
Inherits CapyMDPPolicy
Random number generator
Epsilon constant for action selection
Number of action
Struct CapyMDPPolicyEpsilonSoft's methods:
Destructor for the parent class
Struct CapyMDPTransition :
Struct CapyMDPTransition's properties:
Index of the origin state of the transition
Index of the action of the transition
Index of the termination state of the transition
Probability of transition
Reward for transitioning through that transition
Action value of the transition
Number of time this transition has been visited
Number of time a transition with same fromState and action has been visited
Struct CapyMDPTransition's methods:
None.
Struct CapyMDPTransitionRecorder :
Struct CapyMDPTransitionRecorder's properties:
Number of transition
Size of the recorder memory (in transition number)
Recorded transitions
Struct CapyMDPTransitionRecorder's methods:
Destructor
Reset the recorder
Output and side effect(s):
'nbTransition' is reset to 0.
Record one transition
Input argument(s):
transition: the transition to be recorded
Output and side effect(s):
A copy of the transition is added to the end of 'transitions' which is realloced if necessary, 'nbTransition' and 'nbMaxTransition' are updated as necessary.
Struct CapyMarkovDecisionProcess :
Struct CapyMarkovDecisionProcess's properties:
Number of state
Number of action
Number of transition (nbState * nbAction * nbState)
Transition definition
Start state flags
End state flags
Optimal policy
Index of the current state (default: 0)
Number of step executed (reset by setCurState() and incremented by step())
Maximum number of step to avoid infinite stepping (default: 1e9)
Pseudo random generator to step the process (initialised with current time)
Discount factor (default 0.9, in [0,1], the lower the more influent the near future rewards compare to far future ones)
Epsilon value for convergence during search for optimal policy (default: 1e-6)
Flag to select between "first visit" and "each visit" during montecarlo search (default: false)
Refrence to the environment modelised by the MDP (default: NULL)
Struct CapyMarkovDecisionProcess's methods:
Destructor
Get a transition
Input argument(s):
fromState: index of the origin state
action: index of the action
toState: index of the termination state
Output and side effect(s):
Return a reference to the transition
Set the current state
Input argument(s):
state: index of the current state
Output and side effect(s):
The current state is set and the number of step is reset
Get the current state
Output and side effect(s):
Return the index of the current state
Get the number of step
Output and side effect(s):
Return the number of step
Step the MDP according to its transitions definition
Output and side effect(s):
The current state and the number of step are updated. Return the transition.
Step the MDP according to a given policy
Output and side effect(s):
The current state and the number of step are updated. Return the transition. If the MDP's environment is known it is used to get the result state.
Initialise the pseudo random generator
Input argument(s):
seed: the seed
Output and side effect(s):
The pseudo random generator is reset.
Search the optimal policy (given that the MDP's transitions are all set with the correct transitions probabilities and rewards)
Output and side effect(s):
Calculate the optimal policy, update 'optimalPolicy' which is also used as the initial policy for the search
Get the expected sum of reward
Input argument(s):
nbRun: number of run used to calculate the expected reward
Output and side effect(s):
Return the expected sum of reward, or 0.0 and raise CapyExc_UndefinedExecution if the MDP can't reach an end state within that->nbMaxIter. The start state is selected at random. Randomly select the transitions according to their probabilities.
Get the expected sum of reward from a given start state
Input argument(s):
fromState: the start state
nbRun: number of run used to calculate the expected reward
Output and side effect(s):
Return the expected sum of reward, or 0.0 and raise CapyExc_UndefinedExecution if the MDP can't reach an end state within that->nbMaxIter. Randomly select the transitions according to their probabilities.
Get the expected sum of reward using a given policy
Input argument(s):
nbRun: number of run used to calculate the expected reward
policy: the policy
Output and side effect(s):
Return the expected sum of reward, or 0.0 and raise CapyExc_UndefinedExecution if the MDP can't reach an end state within that->nbMaxIter. The start state is selected at random. Select the transitions according to the given policy.
Get the expected sum of reward from a given start state using a given policy
Input argument(s):
fromState: the start state
nbRun: number of run used to calculate the expected reward
policy: the policy
Output and side effect(s):
Return the expected sum of reward, or 0.0 and raise CapyExc_UndefinedExecution if the MDP can't reach an end state within that->nbMaxIter. Select the transitions according to the given policy.
Record a trajectory through the MDP given an initial state and a policy
Input argument(s):
recorder: the recorder
startState: the initial state of the trajectory
policy: the policy used to select transitions
Output and side effect(s):
The recorder is reset and updated with the trajectory. The trajectory stops when encountering an end state, or when it reaches 'that->nbMaxStep'. The current state of the MDP is modified.
Search the optimal policy using Q-Learning (converges to the optimal policy by exploring the environment instead of using transitions probabilities, only needs the transition rewards; uses an epsilon-soft policy to explore the transitions)
Input argument(s):
epsilon: exploration coefficient (in ]0, 1])
alpha: learning rate (in ]0, 1])
nbEpisode: number of training episodes
Output and side effect(s):
Calculate the optimal policy, update 'optimalPolicy' which is also used as the initial policy for the search.
Get a random start state
Output and side effect(s):
Return one of the start states. If there are no start states, return 0 by default.
Functions:
Create a CapyMDPPolicy
Input argument(s):
nbState: the number of state
Output and side effect(s):
Return a CapyMDPPolicy
Allocate memory for a new CapyMDPPolicy and create it
Input argument(s):
nbState: the number of state
Output and side effect(s):
Return a CapyMDPPolicy
Exception(s):
May raise CapyExc_MallocFailed.
Free the memory used by a CapyMDPPolicy* and reset '*that' to NULL
Input argument(s):
that: a pointer to the CapyMDPPolicy to free
Create a new CapyMDPPolicyEpsilonSoft
Input argument(s):
nbState: the number of state
nbAction: the number of action
epsilon: the epsilon constant for the action selection
Output and side effect(s):
Return a CapyMDPPolicyEpsilonSoft
Allocate memory for a new CapyMDPPolicyEpsilonSoft and create it
Input argument(s):
nbState: the number of state
nbAction: the number of action
epsilon: the epsilon constant for the action selection
Output and side effect(s):
Return a CapyMDPPolicyEpsilonSoft
Exception(s):
May raise CapyExc_MallocFailed.
Free the memory used by a CapyMDPPolicyEpsilonSoft* and reset '*that' to NULL
Input argument(s):
that: a pointer to the CapyMDPPolicyEpsilonSoft to free
Create a CapyMDPTransitionRecorder
Output and side effect(s):
Return a CapyMDPTransitionRecorder
Create a CapyMDPEnvironment
Output and side effect(s):
Return a CapyMDPEnvironment
Allocate memory for a new CapyMDPEnvironment and create it
Output and side effect(s):
Return a CapyMDPEnvironment
Exception(s):
May raise CapyExc_MallocFailed.
Free the memory used by a CapyMDPEnvironment* and reset '*that' to NULL
Input argument(s):
that: a pointer to the CapyMDPEnvironment to free
Create a CapyMarkovDecisionProcess
Input argument(s):
nbState: the number of state
nbAction: the number of action
Output and side effect(s):
Return a CapyMarkovDecisionProcess
Allocate memory for a new CapyMarkovDecisionProcess and create it
Input argument(s):
nbState: the number of state
nbAction: the number of action
Output and side effect(s):
Return a CapyMarkovDecisionProcess
Exception(s):
May raise CapyExc_MallocFailed.
Free the memory used by a CapyMarkovDecisionProcess* and reset '*that' to NULL
Input argument(s):
that: a pointer to the CapyMarkovDecisionProcess to free