Reinforcement learning

Introduction

At time $t$ , the agent perceives the environment to be in state $s_t$ and decides to perform action $a_t$ as a result in the next time step $t+1$ the environment state changes to $s_{t+1}$ and the agent receives a reward $r_{t+1}$ . The goal of the agent is to maximize the total reward.

Markov decision process

\bold{P} = (S, A, P, R, [\gamma, \mu_0])

$S$ : state space
$A$ : action space
$P$ : transition model $P(s'|s, a)$ , $P: S \times A \times S \rightarrow [0, 1]$ .
It's the probability of reaching a certain state by starting from another state and performing some action.
$R$ : reward function $R(s, a)$ , $R: S \times A \rightarrow \mathbb{R}$ .
It's the reward that you get by performing some action from a state $s$ .
$\gamma$ : discount factor, from an intuitive perspective it evaluates how much future rewards compare with current rewards. The larger, the more important future rewards.
$\mu_0$ : initial state distribution $\mu_0(s)$ , it's the probability that agent-environment interaction starts in a given state.

Note that all the items in $\bold{P}$ are requested in problem modeling exercises with the only exception of $\gamma$ .

Notes

We're in a single agent environment.
Why markov decision process? It's telling that the transition model satisfies the markov process, which means that the next state does not depend on the history of all states and actions, but only on the current ones.
Important distinction with what we've seen so far: in reinforcement learning $P$ and $R$ are unknown, while in search the definition of the problem is known to the agent. This means that the agent has to interact with the environment in order to know how to play with it.

Policy function

$\pi$ is the policy function, defined as such: $\pi(a|s), \pi : S \times A \rightarrow [0, 1]$ . It is the probability distribution that an agent perform a certain action $a \in A$ in a given state $s \in S$ . in exercises starting from a Markov decision process we'll have to build the Q-table using the Bellman equation.

Expected discounted comulative reward

Since the goal of the agent is to maximize the total reward, we need to define it. Ofter it is defined as the total expected discounted comulative reward, which is the expected value of the sum of all possible interactions with the environment over an unlimited period of time, of the discounted reward got from the agent:

V^{\pi}(s)=\mathbb{E}\left[\sum_{t=0}^{+\infty} \gamma^{t} R\left(s_{t}, a_{t}\right) \mid s_{0}=s, a_{t} \in \pi\left(., s_{t}\right)\right]

So basically we're summing rewards at given time instants and weighting them with the discound factor ( $\gamma$ ) raised to the power of $t$ . Since $\gamma < 0$ we're basically giving less and less weight to the future rewards.

Q-Function (or state action value function)

It's an alternative value function, very similar to the total reward function:

Q^{\pi}(s, a)=\mathbb{E}\left[\sum_{t=0}^{+\infty} \gamma^{t} R\left(s_{t}, a_{t}\right) \mid s_{0}=s, a_{0}=a, a_{t} \in \pi\left(., s_{t}\right)\right]

Instead it's function of a pair, and it has the constraint of letting the agent play action $a$ at the beginning.

Policy optimality

$a^*(s) \in \operatorname{argmax}_{a} Q*(s, a)$ : optimal action to be player from $s$ . In exercises we will not learn the optimal policy, but we will learn the optimal Q-Function before (by looking at each state and finding $a^*$ ) and then we'll find $\pi$ .

A policy $\pi^*$ is optimal if and only if it maximizes the expected discounted cumulative reward:
$\pi^{*} \in \operatorname{argmax}_{\pi} V^{\pi}(s)$
Therefore, we denote:
$V^{}(s):=V^{\pi^{}}(s)$ the value function of the optimal policy
$Q^{}(s, a):=Q^{\pi^{}}(s, a)$ the value action function of the optimal policy The optimal action to be played in a gives state s, given by $\pi^{*}(s)$, can be defined in terms of the optimal qfunction:
$\pi^{*}(s) \in \operatorname{argmax}_{a} Q^{*}(s, a)$
In other words, we fix the state and then find the optimal action that maximizes the q-function.
From Value function and Q-f... | BookStack (bassopaolo.com)

Bellman equation

It is a recursive way to define the optimal Q-Function:

$Q^{*}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right)\operatorname{max}_{a^{\prime} \in A} Q^{*}\left(s^{\prime}, a^{\prime}\right)$

Discrete version

$Q^{*}(s, a) = (1-\alpha) Q^{*}(s', a)+\alpha\left(r_{t+1}+\gamma \operatorname{argmax}_{a^{\prime} \in A} Q^{*}\left(s', a\right)\right)$

it's used to compute an estimation of the optimal Q-Function. Obviously, since the function is recursive, the $Q^*$ on the right are the previous values of the function. Some notes:

$r_{t+1}$ is the incoming reward from going to state $s_{t+1}$
$Q^{*}\left(s_{t+1}, a\right))$ is a single sample estimate
$\alpha$ is the learning rate

PreviousAdversarial search NextConstraint Satisfaction Problems (CSP)

Last updated 3 years ago

hashtagIntroduction

hashtagMarkov decision process

hashtagNotes

hashtagPolicy function

hashtagExpected discounted comulative reward

hashtagQ-Function (or state action value function)

hashtagPolicy optimality

hashtagBellman equation

hashtagDiscrete version