The structure of this series of posts very closely follows the teachings of the course by David Silver . All credits go to him. I have added a few notes of my own in places that helped my understanding.
A reward Rt is a (scalar) feedback signal which indicates how well an agent is doing at step t.
The agent’s job is to maximize cumulative reward. The agent only receives the reward that the environment gives to it based on its actions.
Reinforcement learning is based on the reward hypothesis
All goals can be described by the maximization of expected cumulative reward.
Some examples from the slides:
As noted above, our main goal is to maximize the total future reward.
We know from life, that actions have long term consequences and hence rewards may be delayed (such as you may have to invest money today(hence reducing the total money you have) to receive dividends in the future)
Hence, in some cases it is better to make short term sacrifices to gain more long term rewards so we have to plan ahead.
The image here is the gist of reinforcement learning. At every step, the agent executes an action in the environment, receives an observation from the environment(observes the consequences of the environment) and, receives a reward for its actions.
This is a continuous loop making RL a time series of actions, observations and rewards. The time series defines the experience of the agent which is in-turn the data that we use in RL.
Except for performing actions, the agent has no control over the environment.
The history is the sequence of observations, actions and rewards denoted by Ht = O1; R1; A1; :::; At−1; Ot; Rt until time t that the agent has seen. (the data that we talked about in the previous section)
The action taken by the agent at the next time step t+1
depends on the history and so do the observations and rewards emitted by the environment.
State is a function of the history that is used to determine what happens next.
St = f (Ht)
The environment state Set is the environment’s private representation i.e. whatever data the environment uses to pick the next observation/reward
The agent state Sat is the agent’s internal representation
It can be any function of history: Sat = f (Ht) which t uses to make a decision about the next action.
We saw 2 definitions of state. Here is a mathematical formulation to represent the state using markov states (These will be explained in more detail in the next part. this is just a gist of the topic)
A state St is Markov if and only if:
P[St+1 | St] = P[St+1 | S1, ….. , St]
Which tells us, in plain English, that just the state at time t contains all the information from the history(“The future is independent of the past given the present”). Once the present is known, the history can be thrown away.
The environment state and the history (agent state) is markov.
Agent directly observes environment state
Now, unlike fully observable environment, the agent state != environment state and in this case, the agent indirectly observes the environment. Formally, this is a partially observable Markov decision process (POMDP) Example: A robot with camera vision isn’t told its absolute location.
Agent must construct its own state representation Sat, e.g. by using
A typical RL agent may include one of these components:
Stochastic: π(a | s) = P[At = a | St = s] (The probability of taking a particular action, conditioned on being at a particular state at a time t) |
These are not always required for the environment we do have model free algorithms, off-policy algorithms etc., which we will come to later.
Here is an example with a maze.
The environment is fully observable and the actions are deterministic.
Reward: -1 per time step (basically tells the agent to finish the task as soon as possible)
Actions: N, S, E, W
State: The agent’s location.
This is what a deterministic optimal policy of an agent would look like. In every state, it tells the agent the best possible action in order to receive maximum reward.
As defined before, it tells the agent how much reward can be expected from a state or by taking a particular action.
The agent, based on the optimal policy and the value function, the agent builds a model of how the environment works.
The grid shown represents the transition model and the numbers represent the immediate rewards the agent expects.
RL is based on trial and error and we want the agent to discover the optimal policy by interacting with the environment but, at the same time we don’t want to lose on the rewards by trying out actions that are not known to the agent.
Exploring finds out more information on the environment.
Exploitation is when an agent follows the policy that is known to it to maximize the reward.
Hence we now see that it is important to explore as well as exploit and there is a trade-off between the two.
Ideas? comments? suggestions for improvement?
Feel free to reach me on my E-mail