github RL: DP

时间：2018-07-31 14:16:56 阅读：168 评论：0 收藏：0 [点我收藏+]

这是github上RL练习的笔记

https://github.com/dennybritz/reinforcement-learning/tree/master/DP

Implement Policy Evaluation in Python (Gridworld)

首先观察opai env.P的构造

env: OpenAI env. env.P represents the transition probabilities of the environment.
            env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
            env.nS is a number of states in the environment. 
            env.nA is a number of actions in the environment.

技术分享图片

回忆policy evaluation的迭代公式：

技术分享图片

使用向量进行计算

R_pi = np.zeros(shape=(env.nS))
P_pi = np.zeros(shape=(env.nS,env.nS))
v_pi = np.zeros(shape=(env.nS))
for s,s_item in env.P.items():
    for a,a_item in s_item.items():
        for dis in a_item:
            prob,next_state,reward,_ = dis
            R_pi[s] += policy[s,a] * reward
            P_pi[s,next_state] += policy[s,a] * prob
v_change = np.ones(shape=(env.nS,env.nS))
while (np.abs(v_change) > theta).any():
    v_change = R_pi + discount_factor * np.dot(P_pi,v_pi) - v_pi
    v_pi += v_change

首先展开env.P计算R和P，之后进行迭代至收敛

github RL: DP

原文：https://www.cnblogs.com/esoteric/p/9395261.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)