Reinforcement learning

Introduction

Schermata 2022-02-09 alle 12.19.15

At time tt, the agent perceives the environment to be in state sts_t and decides to perform action ata_t as a result in the next time step t+1t+1 the environment state changes to st+1s_{t+1} and the agent receives a reward rt+1r_{t+1}. The goal of the agent is to maximize the total reward.

Markov decision process

P=(S,A,P,R,[γ,μ0])\bold{P} = (S, A, P, R, [\gamma, \mu_0])
  • SS: state space

  • AA: action space

  • PP: transition model P(ss,a)P(s'|s, a), P:S×A×S[0,1]P: S \times A \times S \rightarrow [0, 1].

    It's the probability of reaching a certain state by starting from another state and performing some action.

  • RR: reward function R(s,a)R(s, a), R:S×ARR: S \times A \rightarrow \mathbb{R}.

    It's the reward that you get by performing some action from a state ss.

  • γ\gamma: discount factor, from an intuitive perspective it evaluates how much future rewards compare with current rewards. The larger, the more important future rewards.

  • μ0\mu_0: initial state distribution μ0(s)\mu_0(s), it's the probability that agent-environment interaction starts in a given state.

Note that all the items in P\bold{P} are requested in problem modeling exercises with the only exception of γ\gamma.

Notes

  • We're in a single agent environment.

  • Why markov decision process? It's telling that the transition model satisfies the markov process, which means that the next state does not depend on the history of all states and actions, but only on the current ones.

  • Important distinction with what we've seen so far: in reinforcement learning PP and RR are unknown, while in search the definition of the problem is known to the agent. This means that the agent has to interact with the environment in order to know how to play with it.

Policy function

π\pi is the policy function, defined as such: π(as),π:S×A[0,1]\pi(a|s), \pi : S \times A \rightarrow [0, 1]. It is the probability distribution that an agent perform a certain action aAa \in A in a given state sSs \in S. in exercises starting from a Markov decision process we'll have to build the Q-table using the Bellman equation.

Expected discounted comulative reward

Since the goal of the agent is to maximize the total reward, we need to define it. Ofter it is defined as the total expected discounted comulative reward, which is the expected value of the sum of all possible interactions with the environment over an unlimited period of time, of the discounted reward got from the agent:

Vπ(s)=E[t=0+γtR(st,at)s0=s,atπ(.,st)]V^{\pi}(s)=\mathbb{E}\left[\sum_{t=0}^{+\infty} \gamma^{t} R\left(s_{t}, a_{t}\right) \mid s_{0}=s, a_{t} \in \pi\left(., s_{t}\right)\right]

So basically we're summing rewards at given time instants and weighting them with the discound factor (γ\gamma) raised to the power of tt. Since γ<0\gamma < 0 we're basically giving less and less weight to the future rewards.

Q-Function (or state action value function)

It's an alternative value function, very similar to the total reward function:

Qπ(s,a)=E[t=0+γtR(st,at)s0=s,a0=a,atπ(.,st)]Q^{\pi}(s, a)=\mathbb{E}\left[\sum_{t=0}^{+\infty} \gamma^{t} R\left(s_{t}, a_{t}\right) \mid s_{0}=s, a_{0}=a, a_{t} \in \pi\left(., s_{t}\right)\right]

Instead it's function of a pair, and it has the constraint of letting the agent play action aa at the beginning.

Policy optimality

a(s)argmaxaQ(s,a)a^*(s) \in \operatorname{argmax}_{a} Q*(s, a): optimal action to be player from ss. In exercises we will not learn the optimal policy, but we will learn the optimal Q-Function before (by looking at each state and finding aa^*) and then we'll find π\pi.

A policy π\pi^* is optimal if and only if it maximizes the expected discounted cumulative reward:

πargmaxπVπ(s)\pi^{*} \in \operatorname{argmax}_{\pi} V^{\pi}(s)

Therefore, we denote:

  • $V^{}(s):=V^{\pi^{}}(s)$ the value function of the optimal policy

  • $Q^{}(s, a):=Q^{\pi^{}}(s, a)$ the value action function of the optimal policy The optimal action to be played in a gives state s, given by $\pi^{*}(s)$, can be defined in terms of the optimal qfunction:

π(s)argmaxaQ(s,a)\pi^{*}(s) \in \operatorname{argmax}_{a} Q^{*}(s, a)

In other words, we fix the state and then find the optimal action that maximizes the q-function.

From Value function and Q-f... | BookStack (bassopaolo.com)arrow-up-right

Bellman equation

It is a recursive way to define the optimal Q-Function:

Q(s,a)=R(s,a)+γsSP(ss,a)maxaAQ(s,a)Q^{*}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right)\operatorname{max}_{a^{\prime} \in A} Q^{*}\left(s^{\prime}, a^{\prime}\right)

Discrete version

Q(s,a)=(1α)Q(s,a)+α(rt+1+γargmaxaAQ(s,a))Q^{*}(s, a) = (1-\alpha) Q^{*}(s', a)+\alpha\left(r_{t+1}+\gamma \operatorname{argmax}_{a^{\prime} \in A} Q^{*}\left(s', a\right)\right)

it's used to compute an estimation of the optimal Q-Function. Obviously, since the function is recursive, the QQ^* on the right are the previous values of the function. Some notes:

  • rt+1r_{t+1} is the incoming reward from going to state st+1s_{t+1}

  • Q(st+1,a))Q^{*}\left(s_{t+1}, a\right)) is a single sample estimate

  • α\alpha is the learning rate

Last updated