首页 > 其他 > 详细

github RL: DP

时间:2018-07-31 14:16:56      阅读:165      评论:0      收藏:0      [点我收藏+]

这是github上RL练习的笔记

https://github.com/dennybritz/reinforcement-learning/tree/master/DP

Implement Policy Evaluation in Python (Gridworld)

首先观察opai env.P的构造

env: OpenAI env. env.P represents the transition probabilities of the environment.
            env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
            env.nS is a number of states in the environment. 
            env.nA is a number of actions in the environment.

技术分享图片

回忆policy evaluation的迭代公式:

技术分享图片

使用向量进行计算

R_pi = np.zeros(shape=(env.nS))
P_pi = np.zeros(shape=(env.nS,env.nS))
v_pi = np.zeros(shape=(env.nS))
for s,s_item in env.P.items():
    for a,a_item in s_item.items():
        for dis in a_item:
            prob,next_state,reward,_ = dis
            R_pi[s] += policy[s,a] * reward
            P_pi[s,next_state] += policy[s,a] * prob
v_change = np.ones(shape=(env.nS,env.nS))
while (np.abs(v_change) > theta).any():
    v_change = R_pi + discount_factor * np.dot(P_pi,v_pi) - v_pi
    v_pi += v_change 

首先展开env.P计算R和P,之后进行迭代至收敛

 

github RL: DP

原文:https://www.cnblogs.com/esoteric/p/9395261.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!