RL Introduction
Background
Agent
Environment
State
Action
Reward
Episode
\[\begin{gathered}
\text{episode}=\{s_0,a_0,r_1,s_1,a_1,r_2,s_2,a_2,r_3,s_3,a_3,\cdots\}
\end{gathered}\]
Markov Decision Process (MDP)
\[\begin{gathered}
P(s_t|s_{t-1},a) \\
\\
\text{reward}=R(s,a)
\end{gathered}\]
Objective
Cumulative Reward
\[\begin{gathered}
G_t=r_{t+1}+r_{t+2}+\cdots+r_T
\end{gathered}\]
Discount Factor
\[\begin{aligned}
G_t&=r_{t+1}+\gamma{r_{t+2}}+\gamma^2{r_{t+3}}+\cdots \\
&=\sum_{k=0}^{\infty}{
\gamma^k{r_{t+k+1}}
}
\end{aligned}\]
Policy
\[\begin{gathered}
\pi(a|s)=P(A_t=a|S_t=s)
\end{gathered}\]
Value Function
\[\begin{gathered}
\begin{aligned}
v_\pi(s)&=\mathbb{E}_\pi[G_t|S_t=s] \\
&=\mathbb{E}_\pi\Big[\sum_{k=0}^\infty{
\gamma^k{r_{t+k+1}\big|S_t=s}
}\Big]\text{, }\end{aligned} \\
\forall{s}\in\mathcal{S}.
\end{gathered}\]
Action-Value Function(Q-Function)
\[\begin{gathered}
Q_\pi(s,a)=\mathbb{E}_\pi[G_t|S_t=s,A_t=a]=\mathbb{E}_\pi\Big[
\sum_{k=0}^\infty{
\gamma^k{r_{t+k+1}}\big|S_t=s,A_t=a
}
\Big], \\
\forall{s}\in\mathcal{S}\text{ and }\forall{a}\in\mathcal{A}.
\end{gathered}\]