1.1
Self-Play Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?
1.2
Symmetries Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?
1.3
Greedy Play Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?
1.4
Learning from Exploration Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?
1.5
Other Improvements Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tic-tac-toe problem as posed?
2.1
In \(\epsilon\)-greedy action selection, for the case of two actions and \(\epsilon\)= 0.5, what is the probability that the greedy action is selected?
2.2
Bandit example Consider a k-armed bandit problem with \(k = 4\) actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using \(\epsilon\)-greedy action selection, sample-average action-value estimates, and initial estimates of \(Q_1(a) = 0\), for all \(a\). Suppose the initial sequence of actions and rewards is \(A_1 = 1,R_1=-1,A_2=2,R_2=1,A_3=2,R_3=-2,A_4=2,R_4=2,A_5=3,R_5=0\). On some of these time steps the \(\epsilon\) case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?
time step | \(Q(1)\) | \(Q(2)\) | \(Q(3)\) | \(Q(4)\) | action | reward |
---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 1 | -1 |
2 | -1 | 0 | 0 | 0 | 2 | 1 |
3 | -1 | 1 | 0 | 0 | 2 | -2 |
4 | -1 | -0.5 | 0 | 0 | 2 | 2 |
5 | -1 | 1/3 | 0 | 0 | 3 | 0 |
只要看每次选择的动作是否对应最大的Q值即可,如果不是,则可以肯定是随机动作。所以time steps 4,5肯定是随机选择的动作,其他的可能是随机选择,也可能是\(greedy\) action。
2.3 In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.
2.4
If the step-size parameters, \(\alpha_n\), are not constant, then the estimate \(Q_n\) is a weighted average of previously received rewards with a weighting different from that given by (2.6). What is the weighting on each prior reward for the general case, analogous to (2.6), in terms of the sequence of step-size parameters?
2.5
(programming) Design and conduct an experiment to demonstrate the difficulties that sample-average methods have for nonstationary problems. Use a modified version of the 10-armed testbed in which all the \(q_*(a)\) start out equal and then take independent random walks (say by adding a normally distributed increment with mean zero and standard deviation 0.01 to all the \(q_*(a)\) on each step). Prepare plots like Figure 2.2 for an action-value method using sample averages, incrementally computed, and another action-value method using a constant step-size parameter, \(\alpha\)= 0.1. Use \(\epsilon\)= 0.1 and longer runs, say of 10,000 steps.
2.6
Mysterious Spikes The results shown in Figure 2.3 should be quite reliable because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve for the optimistic method? In other words, what might make this method perform particularly better or worse, on average, on particular early steps?
2.7
Unbiased Constant-Step-Size Trick In most of this chapter we have used sample averages to estimate action values because sample averages do not produce the initial bias that constant step sizes do (see the analysis leading to (2.6)). However, sample averages are not a completely satisfactory solution because they may perform poorly on nonstationary problems. Is it possible to avoid the bias of constant step sizes while retaining their advantages on nonstationary problems? One way is to use a step size of
to process the \(n\)th reward for a particular action, where \(\alpha>0\) is a conventional constant step size, and \(\bar o_0\) is a trace of one that starts at 0:
Carry out an analysis like that in (2.6) to show that \(Q_n\) is an exponential recency-weighted average without initial bias.
显然有该式子是exponential recency-weighted average的。要说明是without initial bias的,只需式子与\(Q_1\)无关即可。关注\((1-\beta_1)^n Q_1\):
有\((1-\beta_1)^n Q_1=0\),即与\(Q_1\)无关。
2.8
UCB Spikes In Figure 2.4 the UCB algorithm shows a distinct spike in performance on the 11th step. Why is this? Note that for your answer to be fully satisfactory it must explain both why the reward increases on the 11th step and why it decreases on the subsequent steps. Hint: If \(c = 1\), then the spike is less prominent.
2.9
Show that in the case of two actions, the soft-max distribution is the same as that given by the logistic, or sigmoid, function often used in statistics and artificial neural networks.
即为sigmoid的形式,其中每个动作的概率大小取决于两个动作的preference\(H_t(a)\)的差。
2.10
Suppose you face a 2-armed bandit task whose true action values change randomly from time step to time step. Specifically, suppose that, for any time step, the true values of actions 1 and 2 are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5 (case B). If you are not able to tell which case you face at any step, what is the best expectation of success you can achieve and how should you behave to achieve it? Now suppose that on each step you are told whether you are facing case A or case B (although you still don’t know the true action values). This is an associative search task. What is the best expectation of success you can achieve in this task, and how should you behave to achieve it?
2.11
(programming) Make a figure analogous to Figure 2.6 for the nonstationary case outlined in Exercise 2.5. Include the constant-step-size \(\epsilon\)-greedy algorithm with \(\alpha=0.1\). Use runs of 200,000 steps and, as a performance measure for each algorithm and parameter setting, use the average reward over the last 100,000 steps.
原文:https://www.cnblogs.com/initial-h/p/14515446.html