首页 > 其他 > 详细

Temporal-Difference Learning for Prediction

时间:2019-07-30 11:24:54      阅读:62      评论:0      收藏:0      [点我收藏+]

In Monte Carlo Learning, we‘ve got the estimation of value function:

技术分享图片

 

Gt is the episode return from time t, which can be calculated by:

技术分享图片

Please recall, Gt can be only calculated at the end of a given episode. This reveals a disadvantage of Monte Carlo Learning: have to wait until the end of episodes.

 

TD(0) algorithm replace Gt of the equation to the immediate reward and estimated value function of the next state:

技术分享图片

The algorithm updates the Estimated State-Value Function at time t+1, because everything in the equation is determined. This means we will wait until the agent reaching the next state, so that the agent can get the immediate reward Rt+1 and know which state the system will transition to at time t+1.

 

The equations below are State-Value Function for Dynamic Programming, in which the whole environment is known. Compare to these equations:

技术分享图片

TD algorithm is quite like 6.4 Bellman Equation, but it does not take expectation. Instead, it uses the knowledge till now to estimate how much reward I am going to get from this state. The whole algorithm can be demonstrated as:

技术分享图片

 

TD Target, TD Error

Bias/ Viriance trade-off

Bootstraping

Temporal-Difference Learning for Prediction

原文:https://www.cnblogs.com/rhyswang/p/11230642.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!