Study Log (2019.12)

4 minute read

2019-12-31

  • Reinforcement Learning
    • Chapter 8. Planning and Learning with Tabular Methods
      • 8.1 Models and Planning
      • 8.2 Dyna: Integrated Planning, Acting, and Learning Tabular Dyna-Q
    • Page #166

2019-12-30

  • Reinforcement Learning
    • Chapter 7. n-step Bootstrapping
      • 7.4 Per-decision Methods with Control Variates
      • 7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm
      • 7.6 A Unifying Algorithm: n-step Q($\sigma$) Off-policy n-step Q(sigma)
      • 7.7 Summary
    • Chapter 8. Planning and Learning with Tabular Methods
      • 8.1 Models and Planning
    • Page #161
  • endtoendAI

2019-12-29


2019-12-28

  • Reinforcement Learning
    • Chapter 7. n-step Bootstrapping
      • 7.1 n-step TD Prediction
        • 1) n-step까지 Discounted Reward 합계 G 계산 : $G \leftarrow \sum\nolimits_{i=\tau+1}^{min(\tau+n,T)} \gamma^{i-\tau-1} R_i$
        • 2) n-step에서의 Value 계산 (n-step 이후의 Reward 함축) : $G \leftarrow G + \gamma^n V(\color{red}{ S_{\tau+n} })$
        • 3) V 업데이트 : $V(S_\tau) \leftarrow V(S_\tau) + \alpha [G - V(\color{red}{ S_\tau })]$
        • RandomWalk.py
          n-step TD for estimating V
      • 7.2 n-step Sarsa
      • 7.3 n-step Off-policy Learning
      • 7.4 Per-decision Methods with Control Variates
    • Page #150

2019-12-27


2019-12-25

  • Reinforcement Learning
    • Chapter 6. Temporal-Difference Learning
      • 6.1 TD Prediction
      • 6.2 Advantages of TD Prediction Methods
      • 6.3 Optimality of TD(0)
      • 6.4 Sarsa: On-policy TD Control
      • 6.5 Q-learning: Off-policy TD Control
    • Page #133

2019-12-24


2019-12-23


2019-12-22


2019-12-21


2019-12-20


2019-12-19


2019-12-17


2019-12-16


2019-12-15


2019-12-14


2019-12-10

  • Reinforcement Learning
    • Chapter 4. Dynamic Programming
      • 4.2 Policy Improvement
      • 4.3 Policy Iteration
      • 4.4 Value Iteration
      • 4.5 Asynchronous Dynamic Programming
      • 4.6 Generalized Policy Iteration
      • 4.7 Efficiency of Dynamic Programming
      • 4.8 Summary
    • Chapter 5. Monte Carlo Methods
      • 5.1 Monte Carlo Prediction
      • 5.2 Monte Carlo Estimation of Action Values
      • 5.3 Monte Carlo Control
      • 5.4 Monte Carlo Control without Exploring Starts
      • 5.5 Off-policy Prediction via Importance Sampling
    • Page #104

2019-12-04

  • Reinforcement Learning
    • Chapter 3. Finite Markov Decision Processes
      • 3.1 The Agent–Environment Interface
      • 3.2 Goals and Rewards
      • 3.3 Returns and Episodes
      • 3.4 Unified Notation for Episodic and Continuing Tasks
      • 3.5 Policies and Value Functions
      • 3.6 Optimal Policies and Optimal Value Functions
      • 3.7 Optimality and Approximation
      • 3.8 Summary
    • Chapter 4. Dynamic Programming
    • Page #79

2019-12-03

  • Reinforcement Learning
    • Chapter 2. Multi-armed Bandits
      • 2.7 Upper-Confidence-Bound Action Selection
      • 2.8 Gradient Bandit Algorithms
      • 2.9 Associative Search (Contextual Bandits)
      • 2.10 Summary
    • Page #47

2019-12-02


2019-12-01


2019-11-30


2019-11-29


Updated:

Comments