Chemistry > QUESTIONS & ANSWERS > Stanford UniversityCS 234midterm_2017_sol. (All)

Stanford UniversityCS 234midterm_2017_sol.

Document Content and Description Below

1 Markov Decision Process (10 Points) You are in a Las Vegas casino! You have $20 for this casino venture and will play until you lose it all or as soon as you double your money (i.e., increase your... holding to at least $40). You can choose to play two slot machines: 1) slot machine A costs $10 to play and will return $20 with probability 0:05 and $0 otherwise; and 2) slot machine B costs $20 to play and will return $30 with probability 0:01 and $0 otherwise. Until you are done, you will choose to play machine A or machine B in each turn. In the space below, provide an MDP that captures the above description. Describe the state space, action space, rewards and transition probabilities. Assume the discount factor γ = 1. You can express your solution using equations instead of a table, or a diagram. 2 Bellman Operator with Function Approximation (12 Points) Consider an MDP M = (S; A; R; P; γ) with finite discrete state space S and action space A. Assume M has dynamics model P(s0js; a) for all s; s0 2 S and a 2 A and reward model R(s; a) for all s 2 S and a 2 A. Recall that the Bellman operator B applied to a function V : S ! R is defined as B(V )(s) = max a (R(s; a) + γ X s0 P(s0js; a)V (s0)) (a) (4 Points) Now, consider a new operator which first applies a Bellman backup and then applies a function approximation, to map the value function back to a space representable by the function approximation. We will consider a linear value function approximator over a continuous state space. Consider the following graphs: The graphs show linear regression on the sample X0 = f0; 1; 2g for hypothetical underlying functions. On the left, a target function f (solid line), that evaluates to f(0) = f(1) = f(2) = 0, and its corresponding fitted function f^(x) = 0. On the right, another target function g (solid line), that evaluates to g(0) = 0 and g(1) = g(2) = 1, and its fitted function ^ g(x) = 12 7 x. What happens to the distance between points ff(0); f(1); f(2)g and fg(0); g(1); g(2)g after we do the linear approximation? In other words, compare maxx2X0 jf(x) − g(x)j and maxx2X0 jf^(x) − g^(x)j. (b) (4 Points) Is the linear function approximator here a contraction operator? Explain your answer. (c) (4 Points) Will the new operator be guaranteed to converge to single value function? If yes, will this be the optimal value function for the problem? Justify your answers. 3 Probably Approximately Correct (6 Points) Let A(α; β) be a hypothetical reinforcement learning algorithm, parametrized in terms of α and β such that for any α > β > 1, it selects action a for state s satisfying jQ(s; a) − V ∗(s)j ≤ αβ in all but N = jSjj 1−Aγjαβ steps with probability at least 1 − β12 . Per the definition of Probably Approximately Correct Reinforcement Learning, express N as a function of jSj, jAj, δ, and γ. What is the resulting N? Is algorithm A probably approximately correct (PAC)? Briefly justify. [Show More]

Last updated: 2 years ago

Preview 1 out of 12 pages

Buy Now

Instant download

We Accept: