green bar is the reward function, blue curve is the possibility of differenct trajectories
if green bars are equally increased to yellow bars, the result will change!
CS294-112 深度强化学习 秋季学期(伯克利)NO.4 Policy gradients introduction
原文:https://www.cnblogs.com/ecoflex/p/9085805.html