Chemistry > QUESTIONS & ANSWERS > Stanford UniversityCS 234midterm_2017_sol. (All)
1 Markov Decision Process (10 Points) You are in a Las Vegas casino! You have $20 for this casino venture and will play until you lose it all or as soon as you double your money (i.e., increase your... holding to at least $40). You can choose to play two slot machines: 1) slot machine A costs $10 to play and will return $20 with probability 0:05 and $0 otherwise; and 2) slot machine B costs $20 to play and will return $30 with probability 0:01 and $0 otherwise. Until you are done, you will choose to play machine A or machine B in each turn. In the space below, provide an MDP that captures the above description. Describe the state space, action space, rewards and transition probabilities. Assume the discount factor γ = 1. You can express your solution using equations instead of a table, or a diagram. 2 Bellman Operator with Function Approximation (12 Points) Consider an MDP M = (S; A; R; P; γ) with finite discrete state space S and action space A. Assume M has dynamics model P(s0js; a) for all s; s0 2 S and a 2 A and reward model R(s; a) for all s 2 S and a 2 A. Recall that the Bellman operator B applied to a function V : S ! R is defined as B(V )(s) = max a (R(s; a) + γ X s0 P(s0js; a)V (s0)) (a) (4 Points) Now, consider a new operator which first applies a Bellman backup and then applies a function approximation, to map the value function back to a space representable by the function approximation. We will consider a linear value function approximator over a continuous state space. Consider the following graphs: The graphs show linear regression on the sample X0 = f0; 1; 2g for hypothetical underlying functions. On the left, a target function f (solid line), that evaluates to f(0) = f(1) = f(2) = 0, and its corresponding fitted function f^(x) = 0. On the right, another target function g (solid line), that evaluates to g(0) = 0 and g(1) = g(2) = 1, and its fitted function ^ g(x) = 12 7 x. What happens to the distance between points ff(0); f(1); f(2)g and fg(0); g(1); g(2)g after we do the linear approximation? In other words, compare maxx2X0 jf(x) − g(x)j and maxx2X0 jf^(x) − g^(x)j. (b) (4 Points) Is the linear function approximator here a contraction operator? Explain your answer. (c) (4 Points) Will the new operator be guaranteed to converge to single value function? If yes, will this be the optimal value function for the problem? Justify your answers. 3 Probably Approximately Correct (6 Points) Let A(α; β) be a hypothetical reinforcement learning algorithm, parametrized in terms of α and β such that for any α > β > 1, it selects action a for state s satisfying jQ(s; a) − V ∗(s)j ≤ αβ in all but N = jSjj 1−Aγjαβ steps with probability at least 1 − β12 . Per the definition of Probably Approximately Correct Reinforcement Learning, express N as a function of jSj, jAj, δ, and γ. What is the resulting N? Is algorithm A probably approximately correct (PAC)? Briefly justify. [Show More]
Last updated: 2 years ago
Preview 1 out of 12 pages
Buy this document to get the full access instantly
Instant Download Access after purchase
Buy NowInstant download
We Accept:
Can't find what you want? Try our AI powered Search
Connected school, study & course
About the document
Uploaded On
Jun 01, 2021
Number of pages
12
Written in
This document has been written for:
Uploaded
Jun 01, 2021
Downloads
0
Views
56
In Scholarfriends, a student can earn by offering help to other student. Students can help other students with materials by upploading their notes and earn money.
We're available through e-mail, Twitter, Facebook, and live chat.
FAQ
Questions? Leave a message!
Copyright © Scholarfriends · High quality services·