Algorithms in Reinforcement Learning

Imalka Prasadini
The Startup
Published in
4 min readNov 10, 2020

--

Photo by Matan Segev from Pexels

In my last article, I have discussed on reinforcement learning briefly. Today let’s talk about some algorithms in reinforcement learning.

To achieve the optimal policy, which is used in reinforcement leaning, there have algorithms in reinforcement learning. In this article I am going to explain some main reinforcement learning algorithms briefly.

Tabular Methods

To learn the optimal action in unknown environment, Q-learning is the simple algorithm in reinforcement learning. Without having a model of an environment, it can learn the optimal and long-term action. This is off-policy. And there have two policies called target policy and behavior policy. Tabular methods give correct policies and functions in tables. In Q-learning, to find optimal action value function, behavior policy can be achieved using policy iteration. To do that Policy evaluation and policy improvement is required.

Approximate Solution Methods

In tabular methods, exact values with tables are given. With the complexity of environment, the computational power become high. So, there is no efficiency of using tabular methods for large scale environments. As a solution for these problems the approximation can be used. By using approximation, the hidden state and scalability problems can be absorbed.

As parameterized function approximator, the action value function can be declared. The parameterized function approximator is,

and θ is the parameter. θ can be updated using the semi-gradient SARSA. SARSA denotes State-Action-Reward-State-Action which is an algorithm used to learn policy in Markov decision process.

In SARSA, the behavior policy is equal to target policy. That means SARSA is on-policy method. That means SARSA is on-policy method.

is approximated by SARSA algorithm. Therefore, the policy should be nearly insatiable. Normally, performance of on-policy methods is better than off-policy methods. Off-policy methods like Q-learning can be worked with function approximation. However, this can be caused the problem of unreliability. Divergence and risk of instability is occurred, when connecting bootstrapping, function approximation and off-policy. However, selecting one from these three is not easy. Because to computation and scalability, function approximation is really important. For data efficiency and computational, bootstrapping is needed. To find best policy, the off policy is really important.

Monte Carlo Method and Temporal-Difference Learning Method

Both of these learn directly from experiences in episodes. In Monte Carlo method, complete episodes are learned by sampling. It waits till the episode is going to be end. Monte Carlo method has great variance. And the optimal behavior is learned by interacting with model of environment directly. However, Monte Carlo method needs full episodes to solve a problem. However, Monte Carlo never destroy the Markov property.

In temporal Different method, temporal-bootstrapping and sampling to learn episodes. It never waits until the end of episodes. So, here use incomplete episodes. Temporal Difference learning methods have law variance. The main different among these is, in temporal difference leaning the Markov property is destroyed. This bind the dynamic programming and Monte Carlo methods.

Policy Based Reinforcement Learning

An optimal policy without value function is found directly in this method. This is very effective and also it has good convergence properties. Also, it can learn randomly determined polices also. However, the policy-based methods are optimization problems. This occurs where parameterized policy is updated to maximize and the return using either gradient-free optimization technique or gradient-based optimization technique. Both of these are used by neural networks to estimate policy correctly. The performance with small number of parameters is done by gradient free methods well. And non-differentiable policies can be optimized well. A generalized policy iteration is implemented by Actor-critic methods between policy improvement and evaluation. A policy and estimated value function can be called as an actor and a critic.

Deep Q-Network Algorithm

In Deep Q-Network algorithm, there have reinforcement learning with training deep neural networks with. Raw visual inputs are used to work these DQN. And huge various environments are used as working environment. There has instability problem when connecting reinforcement learning and function approximation. Target networks and experience replay are used by DQN to solve those problems. Cyclic transactions are stored by experience replay memory. This supports to an agent to sample and train from previous experience. In DeepMind, DQN algorithm uses uniform sampling and later it gives effective learning algorithm. Experience replay also known as model-free technique. Second stabilizing technique is target network which breaks the relationship between Q-network and target. MDP assumes that an agent has full vision about the current state. Partially Observable Markov Decision Process thinks that agent get partial observations of state and actions. However, to adapt change of observations can be used deep recurrent Q-network (DRQN) better than deep Q-network.

I think now you have some idea regarding reinforcement learning algorithms. See you on next….!!!!.

--

--