Reinforcement learning
Introduction

At time t, the agent perceives the environment to be in state st and decides to perform action at as a result in the next time step t+1 the environment state changes to st+1 and the agent receives a reward rt+1. The goal of the agent is to maximize the total reward.
Markov decision process
S: state space
A: action space
P: transition model P(s′∣s,a), P:S×A×S→[0,1].
It's the probability of reaching a certain state by starting from another state and performing some action.
R: reward function R(s,a), R:S×A→R.
It's the reward that you get by performing some action from a state s.
γ: discount factor, from an intuitive perspective it evaluates how much future rewards compare with current rewards. The larger, the more important future rewards.
μ0: initial state distribution μ0(s), it's the probability that agent-environment interaction starts in a given state.
Note that all the items in P are requested in problem modeling exercises with the only exception of γ.
Notes
We're in a single agent environment.
Why markov decision process? It's telling that the transition model satisfies the markov process, which means that the next state does not depend on the history of all states and actions, but only on the current ones.
Important distinction with what we've seen so far: in reinforcement learning P and R are unknown, while in search the definition of the problem is known to the agent. This means that the agent has to interact with the environment in order to know how to play with it.
Policy function
π is the policy function, defined as such: π(a∣s),π:S×A→[0,1]. It is the probability distribution that an agent perform a certain action a∈A in a given state s∈S. in exercises starting from a Markov decision process we'll have to build the Q-table using the Bellman equation.
Expected discounted comulative reward
Since the goal of the agent is to maximize the total reward, we need to define it. Ofter it is defined as the total expected discounted comulative reward, which is the expected value of the sum of all possible interactions with the environment over an unlimited period of time, of the discounted reward got from the agent:
So basically we're summing rewards at given time instants and weighting them with the discound factor (γ) raised to the power of t. Since γ<0 we're basically giving less and less weight to the future rewards.
Q-Function (or state action value function)
It's an alternative value function, very similar to the total reward function:
Instead it's function of a pair, and it has the constraint of letting the agent play action a at the beginning.
Policy optimality
a∗(s)∈argmaxaQ∗(s,a): optimal action to be player from s. In exercises we will not learn the optimal policy, but we will learn the optimal Q-Function before (by looking at each state and finding a∗) and then we'll find π.
A policy π∗ is optimal if and only if it maximizes the expected discounted cumulative reward:
π∗∈argmaxπVπ(s)Therefore, we denote:
$V^{}(s):=V^{\pi^{}}(s)$ the value function of the optimal policy
$Q^{}(s, a):=Q^{\pi^{}}(s, a)$ the value action function of the optimal policy The optimal action to be played in a gives state s, given by $\pi^{*}(s)$, can be defined in terms of the optimal qfunction:
π∗(s)∈argmaxaQ∗(s,a)In other words, we fix the state and then find the optimal action that maximizes the q-function.
Bellman equation
It is a recursive way to define the optimal Q-Function:
Q∗(s,a)=R(s,a)+γ∑s′∈SP(s′∣s,a)maxa′∈AQ∗(s′,a′)
Discrete version
Q∗(s,a)=(1−α)Q∗(s′,a)+α(rt+1+γargmaxa′∈AQ∗(s′,a))
it's used to compute an estimation of the optimal Q-Function. Obviously, since the function is recursive, the Q∗ on the right are the previous values of the function. Some notes:
rt+1 is the incoming reward from going to state st+1
Q∗(st+1,a)) is a single sample estimate
α is the learning rate
Last updated