LibCapy - markovdecisionprocess

MarkovDecisionProcess class.

Macros:

MarkovDecisionProcess policy definition Number of states size_t nbState;

States value double* values;

States optimal action size_t* actions;

Destructor void (*destruct)(void);

Get the action for a given state

Input argument(s):

state: the state

Output and side effect(s):

Return the action size_t (*getAction)(size_t const state);

Get the probability that a given action is selected given a state

Input argument(s):

state: the state
action: the action

Output and side effect(s):

Return the probability in [0,1] double (*getProbAction)( size_t const state, size_t const action);

MarkovDecisionProcess environment definition

Destructor void (*destruct)(void);

Get the result action for a given state

Input argument(s):

fromState: the 'from' state
action: the applied action

Output and side effect(s):

Return the result state size_t (*step)( size_t const fromState, size_t const action);

Enumerations:

None.

Typedefs:

CapyMDPPolicy object

CapyMDPEnvironment object

Struct CapyMDPPolicyEpsilonSoft :

Struct CapyMDPPolicyEpsilonSoft's properties:

Inherits CapyMDPPolicy

Random number generator

Epsilon constant for action selection

Number of action

Struct CapyMDPPolicyEpsilonSoft's methods:

Destructor for the parent class

Struct CapyMDPTransition :

Struct CapyMDPTransition's properties:

Index of the origin state of the transition

Index of the action of the transition

Index of the termination state of the transition

Probability of transition

Reward for transitioning through that transition

Action value of the transition

Number of time this transition has been visited

Number of time a transition with same fromState and action has been visited

Struct CapyMDPTransition's methods:

None.

Struct CapyMDPTransitionRecorder :

Struct CapyMDPTransitionRecorder's properties:

Number of transition

Size of the recorder memory (in transition number)

Recorded transitions

Struct CapyMDPTransitionRecorder's methods:

Destructor

Reset the recorder

Output and side effect(s):

'nbTransition' is reset to 0.

Record one transition

Input argument(s):

transition: the transition to be recorded

Output and side effect(s):

A copy of the transition is added to the end of 'transitions' which is realloced if necessary, 'nbTransition' and 'nbMaxTransition' are updated as necessary.

Struct CapyMarkovDecisionProcess :

Struct CapyMarkovDecisionProcess's properties:

Number of state

Number of action

Number of transition (nbState * nbAction * nbState)

Transition definition

Start state flags

End state flags

Optimal policy

Index of the current state (default: 0)

Number of step executed (reset by setCurState() and incremented by step())

Maximum number of step to avoid infinite stepping (default: 1e9)

Pseudo random generator to step the process (initialised with current time)

Discount factor (default 0.9, in [0,1], the lower the more influent the near future rewards compare to far future ones)

Epsilon value for convergence during search for optimal policy (default: 1e-6)

Flag to select between "first visit" and "each visit" during montecarlo search (default: false)

Refrence to the environment modelised by the MDP (default: NULL)

Struct CapyMarkovDecisionProcess's methods:

Destructor

Get a transition

Input argument(s):

fromState: index of the origin state
action: index of the action
toState: index of the termination state

Output and side effect(s):

Return a reference to the transition

Set the current state

Input argument(s):

state: index of the current state

Output and side effect(s):

The current state is set and the number of step is reset

Get the current state

Output and side effect(s):

Return the index of the current state

Get the number of step

Output and side effect(s):

Return the number of step

Step the MDP according to its transitions definition

Output and side effect(s):

The current state and the number of step are updated. Return the transition.

Step the MDP according to a given policy

Output and side effect(s):

The current state and the number of step are updated. Return the transition. If the MDP's environment is known it is used to get the result state.

Initialise the pseudo random generator

Input argument(s):

seed: the seed

Output and side effect(s):

The pseudo random generator is reset.

Search the optimal policy (given that the MDP's transitions are all set with the correct transitions probabilities and rewards)

Output and side effect(s):

Calculate the optimal policy, update 'optimalPolicy' which is also used as the initial policy for the search

Get the expected sum of reward

Input argument(s):

nbRun: number of run used to calculate the expected reward

Output and side effect(s):

Return the expected sum of reward, or 0.0 and raise CapyExc_UndefinedExecution if the MDP can't reach an end state within that->nbMaxIter. The start state is selected at random. Randomly select the transitions according to their probabilities.

Get the expected sum of reward from a given start state

Input argument(s):

fromState: the start state
nbRun: number of run used to calculate the expected reward

Output and side effect(s):

Return the expected sum of reward, or 0.0 and raise CapyExc_UndefinedExecution if the MDP can't reach an end state within that->nbMaxIter. Randomly select the transitions according to their probabilities.

Get the expected sum of reward using a given policy

Input argument(s):

nbRun: number of run used to calculate the expected reward
policy: the policy

Output and side effect(s):

Return the expected sum of reward, or 0.0 and raise CapyExc_UndefinedExecution if the MDP can't reach an end state within that->nbMaxIter. The start state is selected at random. Select the transitions according to the given policy.

Get the expected sum of reward from a given start state using a given policy

Input argument(s):

fromState: the start state
nbRun: number of run used to calculate the expected reward
policy: the policy

Output and side effect(s):

Return the expected sum of reward, or 0.0 and raise CapyExc_UndefinedExecution if the MDP can't reach an end state within that->nbMaxIter. Select the transitions according to the given policy.

Record a trajectory through the MDP given an initial state and a policy

Input argument(s):

recorder: the recorder
startState: the initial state of the trajectory
policy: the policy used to select transitions

Output and side effect(s):

The recorder is reset and updated with the trajectory. The trajectory stops when encountering an end state, or when it reaches 'that->nbMaxStep'. The current state of the MDP is modified.

Search the optimal policy using Q-Learning (converges to the optimal policy by exploring the environment instead of using transitions probabilities, only needs the transition rewards; uses an epsilon-soft policy to explore the transitions)

Input argument(s):

epsilon: exploration coefficient (in ]0, 1])
alpha: learning rate (in ]0, 1])
nbEpisode: number of training episodes

Output and side effect(s):

Calculate the optimal policy, update 'optimalPolicy' which is also used as the initial policy for the search.

Get a random start state

Output and side effect(s):

Return one of the start states. If there are no start states, return 0 by default.

Functions:

Create a CapyMDPPolicy

Input argument(s):

nbState: the number of state

Output and side effect(s):

Return a CapyMDPPolicy

Allocate memory for a new CapyMDPPolicy and create it

Input argument(s):

nbState: the number of state

Output and side effect(s):

Return a CapyMDPPolicy

Exception(s):

May raise CapyExc_MallocFailed.

Free the memory used by a CapyMDPPolicy* and reset '*that' to NULL

Input argument(s):

that: a pointer to the CapyMDPPolicy to free

Create a new CapyMDPPolicyEpsilonSoft

Input argument(s):

nbState: the number of state
nbAction: the number of action
epsilon: the epsilon constant for the action selection

Output and side effect(s):

Return a CapyMDPPolicyEpsilonSoft

Allocate memory for a new CapyMDPPolicyEpsilonSoft and create it

Input argument(s):

nbState: the number of state
nbAction: the number of action
epsilon: the epsilon constant for the action selection

Output and side effect(s):

Return a CapyMDPPolicyEpsilonSoft

Exception(s):

May raise CapyExc_MallocFailed.

Free the memory used by a CapyMDPPolicyEpsilonSoft* and reset '*that' to NULL

Input argument(s):

that: a pointer to the CapyMDPPolicyEpsilonSoft to free

Create a CapyMDPTransitionRecorder

Output and side effect(s):

Return a CapyMDPTransitionRecorder

Create a CapyMDPEnvironment

Output and side effect(s):

Return a CapyMDPEnvironment

Allocate memory for a new CapyMDPEnvironment and create it

Output and side effect(s):

Return a CapyMDPEnvironment

Exception(s):

May raise CapyExc_MallocFailed.

Free the memory used by a CapyMDPEnvironment* and reset '*that' to NULL

Input argument(s):

that: a pointer to the CapyMDPEnvironment to free

Create a CapyMarkovDecisionProcess

Input argument(s):

nbState: the number of state
nbAction: the number of action

Output and side effect(s):

Return a CapyMarkovDecisionProcess

Allocate memory for a new CapyMarkovDecisionProcess and create it

Input argument(s):

nbState: the number of state
nbAction: the number of action

Output and side effect(s):

Return a CapyMarkovDecisionProcess

Exception(s):

May raise CapyExc_MallocFailed.

Free the memory used by a CapyMarkovDecisionProcess* and reset '*that' to NULL

Input argument(s):

that: a pointer to the CapyMarkovDecisionProcess to free

2025-04-08
in LibCapy,
0 views
Copyright 2021-2025 Baillehache Pascal