RL in R

hn

2022 05 30

RL in R

Sources

A collection of notes and code snippets on

1 Sutton & Barto (2018): RL

Preface to the Second Edition

Sutton & Barto (2018) is the second edition of the book from 1998.

Extended contents, more mathematics, and new notation, yet still an introduction with

  • focus on core, online learning algorithms.

  • goal to provide a clear and simple account of the key ideas and algorithms of reinforcement learning that is accessible to readers in all the related disciplines.

New structure into three parts:

  1. tabular case with exact solutions (Chapter 2–8)

  2. function approximation (Chapter 9–13)

  3. relations of RL to psychology, neuroscience, and societal impacts (Chapter 14–17)

Possible uses to support one- or two-semester courses on reinforcement learning

Sections of Bibliographical and Historical Remarks credit the sources of the ideas presented in a chapter

Dedication to A. Harry Klopf (AFOSR): A heterostatic theory of adaptive systems

… systems that try to maximize something (whatever that might be) are
qualitatively different from equilibrium- seeking systems, …
maximizing systems hold the key to understanding important aspects
of natural intelligence and for building artificial intelligences.
(Sutton & Barto, 2018, p. xv)

Preface to the First Edition

Interest began in 1979, based on A. Harry Klopf’s heterostatic theory of adaptive systems

… the idea of a learning system that wants something,
that adapts its behavior in order to maximize a special signal from its environment.
This was the idea of a “hedonistic” learning system, or, as we would say now,
the idea of reinforcement learning.
(Sutton & Barto, 2018, p. xvii)

RL as “learning from interaction to achieve goals” (p. xviii), critical step: How to get something from the environment.

1.1 Chapter 1: Introduction

A computational approach to learning from interacting with the environment:

Learning from interaction is a foundational idea underlying nearly all theories of learning and intelligence.
(Sutton & Barto, 2018, p. 1)

Definition:

Reinforcement learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal.
The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.
(Sutton & Barto, 2018, p. 1)

Actions may not only affect immediate rewards, but also subsequent rewards.

The two most important distinguishing features of reinforcement learning (p. 2):

  • trial-and-error search
  • delayed reward

RL as (a) a problem, (b) a class of solution methods, and (c) the field that studies both problems and solution methods.

Capture the most important aspects of the real problem facing a learning agent interacting over time with its environment to achieve a goal.

Markov decision processes (MDP) include 3 aspects:

  • sensation,
  • action, and
  • goal.

A learning agent must be able to sense the state of its environment and take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment. Any method that is well suited to solving such problems is considered a RL method.

RL as a genuine paradigm in machine learning (ML) that differs from supervised learning and unsupervised learning:

  • Difference to supervised learning: In supervised learning, the learner is provided with examples (ideally representative data) and corresponding solutions. Task: Generalize from (representative) data to detect a pattern.
    By contrast, RL learns from experience without representative examples and correct solutions.

  • Difference to unsupervised learning: In unsupervised learning, a learner aims to discover structure hidden in unlabeled data.
    By contrast, RL aims to maximize reward (rather than discover structure).

Note: Trade-off between exploration and exploitation does not arise in other learning paradigms.

1.1.1 Key challenges arising in RL

  1. Trade-off between and need to balance exploration and exploitation:
  • exploration: Discover and evaluate actions, improve future actions.

  • exploitation: Use past experience to maximize reward, exploit experienced based on past actions.

Dilemma: Both are necessary to solve the task:

The agent must try a variety of actions and progressively favor those that appear to be best.
(Sutton & Barto, 2018, p. 3)

  1. A holistic approach: Explicit consideration of the whole learning problem (goal-directed agent interacting with an uncertain environment), rather than isolated subproblems (e.g., planning steps). Start with complete agent (goals, senses, actions) and uncertainty about environmental states and effects of actions.

  2. A general approach: Search for simple general principles (p. 4).

Note on generality:

  • A RL agent can be an autonomous organism or robot, but can also be a component of larger systems.

  • The most naturalistic type of learning:

Of all the forms of machine learning, reinforcement learning is the closest to the kind of learning that humans and other animals do.
(Sutton & Barto, 2018, p. 4)

1.1.2 Examples of RL

Examples:

  • chess move
  • gazelle calf
  • mobile robot
  • making breakfast

Common features:

  • agent has a goal
  • interaction: actions affect states of the environment
  • uncertainty about environment
  • considering delayed outcomes involves foresight and planning
  • monitoring goal attainment by senses
  • memory: use experience to improve performance over time

1.1.3 Elements of RL

One can identify four main subelements of a reinforcement learning system:

  1. A policy determines the agent’s behavior: An agent’s mapping from perceived states to corresponding actions (i.e., stimulus-response rules, associations). May be a simple lookup table or include complex computation and stochastic elements.

  2. A reward signal defines the goal of a reinforcement learning problem: Each time step yield an immediate reward. Agent’s goal is to maximize overall reward. Reward corresponds to evaluation (e.g., gain, utility, pain or pleasure, etc.)

  3. A value function describes the long-term desirability of states (rather than the immediate reward signal): The value of a state is the total amount of reward an agent can expect to accumulate over the future. Thus, a state many yield a low immediate reward, but be followed by better states (with higher rewards).

  4. A model of the environment (optional) allows inferences about how the environment will behave. For instance, given a state and an action, a model could predict the next state and reward. Whereas model-free RL learns by trial-and-error, so-called model-based method use RL for planning purposes.

Note the central role of value estimation:

Rewards are primary, whereas values are predictions of rewards (secondary). Nevertheless, actions are based on value-judgments, rather than judgments of rewards:

We seek actions that bring about states of highest value, not highest reward… (Sutton & Barto, 2018, p. 6)

Unfortunately, it is much harder to determine values than it is to determine rewards. Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime. (Sutton & Barto, 2018, p. 6)

1.1.4 Limitations and scope

  • Reliance on the notion of an environmental state: Any information about the relevant environment. Input to policy, value function, and model, and output of model. (Disclaimer: This book focuses on decision-making, not construction of the state signal.)

  • Focus on learning from interaction to estimate value functions, rather than optimization methods like evolutionary processes (i.e., states not experienced by individual learner). Evolutionary methods are not well-suited for solving RL problems.

1.1.5 Extended example: Tic-tac-toe

In the game of TTT, a skilled player never loses. Thus, assume we play an imperfect opponent:

  • Goal: Maximize our chance of winning (i.e., a row/column/diagonal of XXX, when playing Xs)

Value function as a table with slots for each possible state; value indicates the probability of winning from that state (e.g., 0 for any state with OOO; 1 for any state with XXX).

  • greedy move: Selecting a move to the state with the highest (estimated) probability of winning

  • exploratory move: Selecting a move at random (i.e., irrespective of its value estimate)

Improving value estimates: After each greedy move, the value of an earlier state is adjusted to be closer to the later state:

\(V(S_t) \leftarrow\ V(S_t) + \alpha [V(S_{t+1}) - V(S_t)]\)

i.e., temporal-difference learning with learning rate/step size parameter \(\alpha\).

Exploratory moves do not lead to any learning (updating).

(See Figure 1.1, p. 10)

1.1.6 Summary

Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision making. It is distinguished from other computational approaches by its emphasis on learning by an agent from direct interaction with its environment, without requiring exemplary supervision or complete models of the environment. In our opinion, reinforcement learning is the first field to seriously address the computational issues that arise when learning from interaction with an environment in order to achieve long-term goals.

Reinforcement learning uses the formal framework of Markov decision processes to define the interaction between a learning agent and its environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and nondeterminism, and the existence of explicit goals.

The concepts of value and value function are key to most of the reinforcement learning methods that we consider in this book. We take the position that value functions are important for efficient search in the space of policies. The use of value functions distinguishes reinforcement learning methods from evolutionary methods that search directly in policy space guided by evaluations of entire policies.

(Sutton & Barto, 2018, p. 13)

1.1.7 Early history of RL

Two main strands:

  1. Optimal control theory and dynamic programming:
    Bellmann equations, MDP

  2. Trial and error learning in animal psychology:

    • Thorndike (1911)’s law of effect: Actions closely followed by satisfaction/discomfort are more/less likely to reoccur.
    • A reinforcer may strenghten or weaken links, yet must yield lasting changes.
    • Confusion with supervised learning: RL captures the essentials of trial-and-error learning by selecting actions on the basis of evaluative feedback that does not rely on knowledge of what the correct action should be.
    • Minsky (1961) describes various heuristics and the basic credit-assignment problem for reinforcement learning systems
    • Harry Klopf: emphasis on trial-and-error learning and hedonic aspects of behavior, trying to achieve goals, control the environment toward achieving desired ends.

A third strand:

  1. Temporal-difference learning is distinctive in being driven by the difference between temporally successive estimates of the same quantity (new and unique in RL).

1.2 Chapter 2: Multi-armed bandits (MAB)

Distinguish between two types of feedback:

  • evaluative feedback depends entirely on the action taken

  • instructive feedback depends on the best action, independently of the action taken

In contrast to supervised learning, RL evaluates only the actions taken and thus creates a need for exploration.

1.2.1 A \(k\)-armed bandit problem

A learning problem: Repeated choice from \(k\) options; each choice (or action \(A_t\)) yields some unknown reward (\(R_t\));
the goal is to maximize total rewards over an episode (e.g., of \(T = 1000\) time steps/rounds).

A k-armed bandit problem: A bandit with \(k\) levers. If the true value of an action \(q_{*}(a)\) was known, the solution would be trivial. Learning task consist in the estimation of the option’s reward function(s) in order to shift more actions towards better levers.

The value of an action \(a\) is its estimated reward at time \(t\): \(Q_t(a)\). Ideally, \(Q_t(a) \rightarrow q_{*}(a)\).

When keeping estimates of action values, we can distinguish between

  • greedy actions: exploit current knowledge

  • explorative actions: explore options to acquire knowledge

In RL, we need to strike a good balance between exploration and exploitation (without being concerned with optimality).

1.2.2 Action-value methods

Action-value methods :=

  1. estimate the value of actions \(Q_t(a)\)

  2. select actions/make action selection decisions

Simplest solutions:

  • ad 1.: Compute the sample average (i.e., sum of rewards/number of times taken) for each action

  • ad 2.: Use greedy action selection: \(A_t = argmax(Q_t(a))\) (randomly choosing between ties)

However, we also need to explore options. Thus, a simple alternative to 2. is the \(\epsilon\)-greedy method of action selection:

  • Random action selection with a probability of \(\epsilon\), greedy selection with probability \(1-\epsilon\).

1.2.3 The 10-armed testbed

Simulation to assess the relative effectiveness of the greedy and \(\epsilon\)-greedy action-value methods:

  • 2000 randomly generated \(k\)-armed bandit problems with \(k = 10\)

  • \(q_{*}(a)\) drawn from normal (Gaussian) distribution with mean 0 and variance 1.

Result: Greedy action selection is initially superior, but \(\epsilon\)-greey outperforms greedy action selection in the long run.

Need for and advantages of \(\epsilon\)-greedy algorithms are task-dependent: Noisier and nonstationary environments require more exploration.

1.2.4 Incremental implementation

So far, action-value methods estimate action values as sample averages of observed rewards.

Goal: Compute sample averages efficiently (i.e., constant memory and computation per time step).

An incremental implementation does not need more memory or computation over time:

General form of update rule:

\(\mathrm{New\ estimate} \leftarrow\ \mathrm{Old\ estimate} + \mathrm{Step\ size} \cdot [\mathrm{Target} - \mathrm{Old\ estimate}]\)

Difference term \([\mathrm{Target} - \mathrm{Old\ estimate}]\) denotes the error of the estimate. It is reduced by \(\mathrm{Step\ size}\) in each step.

Applied to computing sample averages on each step:

\(Q_{n+1} \leftarrow\ Q_n + \frac{1}{n} \cdot [R_n - Q_n]\)

Note: In this incremental method, the magnitude of the the step size parameter \(\frac{1}{n}\) changes over time \(t\) (i.e., gets smaller as \(n\) increases).

Figure @ref(fig:bandit-algorithm) shows pseudo-code for a simple bandit algorithm that performs \(\epsilon\)-greedy action selection and incrementally updates its estimates of action values:

(ref:fig-bandit) Pseudo-code for a simple bandit algorithm (Sutton & Barto, 2018, p. 32).

(ref:fig-bandit)

(ref:fig-bandit)

Notes:

  • The bandit() function maps actions to rewards (in a given environment)

What is the environment?

The environment provides inputs (e.g., perceptions, some internal state) and outputs (e.g., rewards, when acted upon).

Thus, a definition of an environment must specify the interface to the agent.

1.3 Implementations in R

1.3.1 Simple \(k\)-armed bandit

Incremental implementation of a \(k\)-armed bandit (with rewards drawn from stable normal distributions) and an agent with \(\epsilon\)-greedy action selection and incremental updating of expected option values (i.e., table of Q values with decreasing step size parameter \(\alpha\)) (Sutton & Barto, 2018, Section 2.4):

Note: The following code was adapted from Wataru Toyokawa’s k-armed bandit tutorial and Exercise 1:

# Setup:
set.seed(777)  # reproducible randomness 
n_T <- 1000    # number of agent-environment interactions/time steps/trials 

# A. Environment:
k <- 5  # number of options
q_true <- rnorm(n = k, mean = 0, sd = 1)  # true reward means 
names(q_true) <- paste0("O_", 1:k)
q_true 
##        O_1        O_2        O_3        O_4        O_5 
##  0.4897862 -0.3985414  0.5108363 -0.3988121  1.6386861
# B. Agent: 
epsilon   <- 0.1  # exploration rate
Q_initial <- rep(0, k)  # initial expected value per option

# C. Storing/recording data:

# History of actions and rewards:
n_chosen <- rep(0, k)   # option choices N(a)
choice <- rep(NA, n_T)  # choice history
reward <- rep(NA, n_T)  # reward history

# Q value matrix (t rows, k columns):
Q <- matrix(nrow = n_T, ncol = k, byrow = TRUE)
Q[1, ] <- Q_initial  # initialize 1st row
dim(Q)
## [1] 1000    5
# head(Q)

Note:

  • Some parameters and data structures (e.g., the number of trials or simulations, environmental states, rewards, agent actions and expectations) may belong to the agent, environment, or are merely of a technical nature (e.g., to store a record of the process). For some of these conceptual elements, it is not straightforward to locate them in a particular entity. For instance, does an action or a state belong to the environment or to the agent? Similarly, we tend to view rewards as a part of the environment, but appreciating and estimating their value seems to require an agent. Fortunately, the ontological status of these elements is less important than that they are explicitly defined and reliably record what we need to run and evaluate our model.

We could create the simulation with these settings. However, this would leave the interaction between agent and environment somewhat implicit, as we would need to translate the agent’s choice into an environmental reward on every trial. To explicate this interaction, we define a bandit() function that takes an agent’s action (here: a numeric choice for an option between 1 and k) and provides a corresponding reward value (based on q_true):

bandit <- function(action){
  
  # Check inputs:
  if (!is.numeric(action) || action < 1 || action > k){
    message(paste0("bandit: action must be a choice between 1 and ", k))
    return(NA)
  }
  
  reward <- NA  # initialize 
  
  # Compute current reward:
  cur_mean <- q_true[action]
  reward <- rnorm(n = length(action), mean = cur_mean, sd = 1)
  
  return(reward)
  
}

# # Check:
# bandit(5)
# q_true[5]
# mean(bandit(rep(5, 10000))) 
# # Note: 
# bandit(99)
# bandit("best")

Notes:

  • The bandit() function maps actions to rewards, given some environment (here: k options and their average payoffs stored in q_true). In this particular case, the bandit() function could be replaced by a single line:
    rnorm(n = 1, mean = q_true[choice[t]], sd = 1).
    But when environments get more complex, explicitly representing them as a function may be useful.

  • Note that this environment could be described as relatively constant (as it always offers the same options) or as relatively dynamic (as the rewards vary between and within options). Technically speaking, this particular bandit provides a situation under risk (from the perspective of the modeler). But as the option’s probabilities are unknown to the agent, the situation is also one of uncertainty (from the perspective of the agent).

Repeating a series of steps:

To simulate the interaction between our learning agent and its environment, we need to perform the following tasks on every time step or trial \(t\):

  1. Determine the type of move (greedy or explorative) on the current trial

  2. Select a corresponding action (i.e., a greedy or explorative choice)

  3. Determine the corresponding reward from the environment

  4. Process the reward by adjusting the agent’s expectations:

    1. Update the step size parameter (alpha)
    2. Update the expectations (Q values)

The following for loop implements a corresponding simulation that iterates through 1000 time steps:

for (t in 1:n_T) {
  
  # 1. Type of move in this trial: 
  greedy <- sample(c(FALSE, TRUE), 1, prob = c(epsilon, 1 - epsilon))
  
  # 2. Select action/make a choice:
  if (greedy) {
    
    ix_Q_max <- which(Q[t, ] == max(Q[t, ]))  # position(s) of max. Q values
    
    if (length(ix_Q_max) > 1) { # multiple maxima: 
      
      choice[t] <- sample(ix_Q_max, 1)  # break ties randomly 
      
    } else { # only 1 maximum:  
      
      choice[t] <- ix_Q_max 
      
    }
    
  } else { # non-greedy/exploratory move:
    
    choice[t] <- sample(1:k, 1)
    
  }
  
  # 3. Get reward (from environment): 
  
  # reward[t] <- rnorm(n = 1, mean = q_true[choice[t]], sd = 1)  # implicit
  
  reward[t] <- bandit(action = choice[t])  # explicit
  
  
  # 4. Process reward and adjust expectations:
  
  # a. Update step size (alpha): 
  n_chosen[choice[t]] <- n_chosen[choice[t]] + 1  # update N(a) value
  alpha <- 1/n_chosen[choice[t]]  # step size parameter
  
  # b. Update Q values: 
  if (t < n_T) {  # a next trial exists: 
    
    Q[t + 1, ] <- Q[t, ]  # copy previous Q values
    
    Q_t <- Q[t, choice[t]]  # Q value of current choice 
    
    # Update Q value of the chosen option:
    Q[t + 1, choice[t]] <- Q_t + (alpha * (reward[t] - Q_t))
    
  } # end if. 
  
} # end loop. 

After running this simulation, the data structures choice, reward, and Q contain all intermediate values that we need to document the outcome and understand the process. We can probe various parts of these data structures to check the integrity of the process and for evidence of learning:

# Data of first 10 trials:
choice[1:10]
##  [1] 1 1 2 5 4 4 4 4 4 3
reward[1:10]
##  [1]  0.69249052 -0.76983718 -0.67831563 -0.24224456  1.91268271  0.20432699
##  [7] -0.94294455 -0.98881466 -2.42944795  0.07463433
Q[1:10, ]
##              [,1]       [,2] [,3]        [,4]       [,5]
##  [1,]  0.00000000  0.0000000    0  0.00000000  0.0000000
##  [2,]  0.69249052  0.0000000    0  0.00000000  0.0000000
##  [3,] -0.03867333  0.0000000    0  0.00000000  0.0000000
##  [4,] -0.03867333 -0.6783156    0  0.00000000  0.0000000
##  [5,] -0.03867333 -0.6783156    0  0.00000000 -0.2422446
##  [6,] -0.03867333 -0.6783156    0  1.91268271 -0.2422446
##  [7,] -0.03867333 -0.6783156    0  1.05850485 -0.2422446
##  [8,] -0.03867333 -0.6783156    0  0.39135505 -0.2422446
##  [9,] -0.03867333 -0.6783156    0  0.04631262 -0.2422446
## [10,] -0.03867333 -0.6783156    0 -0.44883949 -0.2422446
# Data of last 10 trials:
choice[(n_T - 9):n_T]
##  [1] 5 5 5 5 5 5 5 5 5 2
Q[(n_T - 9):n_T, ]
##            [,1]       [,2]      [,3]       [,4]     [,5]
##  [1,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.645595
##  [2,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.643409
##  [3,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.643102
##  [4,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.639869
##  [5,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.641780
##  [6,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.641938
##  [7,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.641994
##  [8,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.640815
##  [9,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.641947
## [10,] 0.4338763 -0.6884072 0.4383738 -0.4436954 1.640969
# Summaries:
sum(reward)    # reward
## [1] 1384.849
table(choice)  # choice distribution
## choice
##   1   2   3   4   5 
##  20  22 110  24 824
Q[n_T, ]  # final expected values (per option)                    
## [1]  0.4338763 -0.6884072  0.4383738 -0.4436954  1.6409692
round(q_true - Q[n_T, ], 4)  # difference to true values
##     O_1     O_2     O_3     O_4     O_5 
##  0.0559  0.2899  0.0725  0.0449 -0.0023

Visualize results:

# Combine data: 
df_1 <- data.frame(t = 1:n_T, 
                   choice = choice, 
                   reward = reward) 

df_Q <- as.data.frame(Q)  # flip matrix and save as df
names(df_Q) <- paste0("Q_", 1:k) # add names
df_all <- cbind(df_1, df_Q)

# Graphical parameters: 
q_true  # matching colors: 
##        O_1        O_2        O_3        O_4        O_5 
##  0.4897862 -0.3985414  0.5108363 -0.3988121  1.6386861
my_cols <- c("green2", "grey20", "green3", "grey40", "green4")

# Agent choices over time: 
ggplot(df_all, aes(x = t)) +
  geom_line(aes(y = choice),  alpha = .2) + 
  geom_point(aes(y = choice, color = factor(choice)), alpha = .5) + 
  scale_color_manual(values = my_cols) + 
  labs(x = "Trial t", y = "Option", color = "Choice:") + 
  theme_classic()

# Dynamics of Q values over time:
ggplot(df_all, aes(x = t)) +
  geom_line(aes(y = Q_1), color = my_cols[1]) +
  geom_line(aes(y = Q_2), color = my_cols[2]) +
  geom_line(aes(y = Q_3), color = my_cols[3]) +
  geom_line(aes(y = Q_4), color = my_cols[4]) +
  geom_line(aes(y = Q_5), color = my_cols[5], size = 1) + 
  labs(x = "Trial t", y = "Q value") + 
  theme_classic()

1.3.2 The 10-armed testbed

Another \(k\)-armed bandit (with rewards drawn from stable normal distributions) and an agent with \(\epsilon\)-greedy action selection and incremental updating of expected option values (i.e., table of \(Q\)-values with decreasing step size parameter \(\alpha\)) (Sutton & Barto, 2018, Section 2.3).

In this simulation, we pursue two additional goals:

  • Comparing greedy with several \(\epsilon\)-greedy action-value methods

  • Simulating a suite of n_S = 2000 test problems (rather than only a single one) to obtain a measure of the learning algorithm’s average behavior

Essentially, this implementation aims to reproduce Figure 2.2 from Sutton & Barto (2018) (p. 29). Most of the code above can be recycled. However, as each simulation generates a 2D data frame as its output, we can either use multi-dimensional data structures (specifically: two 3-dimensional arrays) to record the results obtained from multiple simulations, or pre-configure a very long data frame, or corresponding vectors, that allow storing the required values for all time steps/trials, simulations, and values of epsilon.

Steps:

Abstraction

  1. Wrap the simulation (from above) in a function (returning data):
simulate <- function(n_T){
  
  # Initialize data structures:
  n_chosen <- rep(0, k)   # option choices N(a)
  choice <- rep(NA, n_T)  # choice history
  reward <- rep(NA, n_T)  # reward history
  Q <- matrix(nrow = n_T, ncol = k, byrow = TRUE)  # Q value matrix
  Q[1, ] <- Q_initial  # initialize 1st row
  
  # Simulation:
  for (t in 1:n_T) {
    
    # 1. Type of move in this trial: 
    greedy <- sample(c(FALSE, TRUE), 1, prob = c(epsilon, 1 - epsilon))
    
    # 2. Select action/make a choice:
    if (greedy) {
      
      ix_Q_max <- which(Q[t, ] == max(Q[t, ]))  # position(s) of max. Q values
      
      if (length(ix_Q_max) > 1) { # multiple maxima: 
        
        choice[t] <- sample(ix_Q_max, 1)  # break ties randomly 
        
      } else { # only 1 maximum:  
        
        choice[t] <- ix_Q_max 
        
      }
      
    } else { # non-greedy/exploratory move:
      
      choice[t] <- sample(1:k, 1)
      
    }
    
    # 3. Get reward (from environment): 
    
    # reward[t] <- rnorm(n = 1, mean = q_true[choice[t]], sd = 1)  # implicit
    
    reward[t] <- bandit(action = choice[t])  # explicit
    
    
    # 4. Process reward and adjust expectations:
    
    # a. Update step size (alpha): 
    n_chosen[choice[t]] <- n_chosen[choice[t]] + 1  # update N(a) value
    alpha <- 1/n_chosen[choice[t]]  # step size parameter
    
    # b. Update Q values: 
    if (t < n_T) {  # a next trial exists: 
      
      Q[t + 1, ] <- Q[t, ]  # copy previous Q values
      
      Q_t <- Q[t, choice[t]]  # Q value of current choice 
      
      # Update Q value of the chosen option:
      Q[t + 1, choice[t]] <- Q_t + (alpha * (reward[t] - Q_t))
      
    } # end if. 
    
  } # end loop. 
  
  # Combine data: 
  df_1 <- data.frame(t = 1:n_T, 
                     choice = choice, 
                     reward = reward) 
  
  df_Q <- as.data.frame(Q)  # flip matrix and save as df
  names(df_Q) <- paste0("Q_", 1:k) # add names
  df <- cbind(df_1, df_Q)
  
  return(df)
  
} # simulate(). 

# Check:
data <- simulate(1000)
dim(data)   # n_T x 8
## [1] 1000    8
head(data)  # first trials
##   t choice    reward Q_1       Q_2 Q_3       Q_4      Q_5
## 1 1      2 -2.653199   0  0.000000   0  0.000000 0.000000
## 2 2      4 -1.450149   0 -2.653199   0  0.000000 0.000000
## 3 3      5  1.712318   0 -2.653199   0 -1.450149 0.000000
## 4 4      5  1.664053   0 -2.653199   0 -1.450149 1.712318
## 5 5      5  1.915457   0 -2.653199   0 -1.450149 1.688186
## 6 6      5  1.836424   0 -2.653199   0 -1.450149 1.763943
tail(data)  # final trials
##         t choice    reward       Q_1        Q_2       Q_3        Q_4      Q_5
## 995   995      1 1.4880288 0.7546710 -0.5120639 0.2092641 -0.4861758 1.672361
## 996   996      5 2.5119905 0.8005059 -0.5120639 0.2092641 -0.4861758 1.672361
## 997   997      5 2.8431557 0.8005059 -0.5120639 0.2092641 -0.4861758 1.673268
## 998   998      1 0.1210793 0.8005059 -0.5120639 0.2092641 -0.4861758 1.674532
## 999   999      5 1.7562492 0.7605396 -0.5120639 0.2092641 -0.4861758 1.674532
## 1000 1000      5 0.5040202 0.7605396 -0.5120639 0.2092641 -0.4861758 1.674620

Given the abstraction gained from our simulate() function, we can now set up a simulation that varies some aspects of the agent and the environment.

Storing process data

A crucial part consists in extracting and collecting the data that we need to evaluate the result. In the following, we implement three alternative solutions to illustrate a trade-off between data structure, requirements for post-processing, and computing time:

    1. Using two 3D arrays to collect results is relatively easy and fast, but requires more extensive post-processing into a 2D data structure later.
    1. Using a long 2D data frame is slightly more complicated when defining and filling its contents into the correct slots, but allows for much easier post-processing later. However, this method takes considerably more computing time than (A).
    1. Using two empty vectors and incrementally adding new results to them requires the least skills in setting up the data storage. However, it requires that the vectors are later integrated in a data frame that contains columns for the independent variables (here: epsilon, n_S, and n_T) and gets slower as the vectors get longer. (See Wataru Toyokawa’s Exercise 1 for an alternative implementation of this solution.)

Simulation

# Setup: ---- 
set.seed(246)  # reproducible randomness 

# A. Environment:
k <- 10  # number of options
# q_true <- rnorm(n = k, mean = 0, sd = 1)  # true reward means 
# names(q_true) <- paste0("O_", 1:k)

# B. Agent: 
epsilon_v <- c(0, .01, .10)  # vector of exploration rates
Q_initial <- rep(0, k)  # initial expected value per option

# C. Simulation:
n_S <- 2000  # N of simulations per agent
n_T <- 1000  # N of time steps/trials


## Prepare data structures for recording simulation results: ---- 

# Note: Use ONE of the following methods (A--C): 

# (A) Use two 3D-Arrays for storing simulation results:
all_rewards <- array(dim = c(n_T, n_S, length(epsilon_v)))
dimnames(all_rewards) <- list(paste0("t_", 1:n_T), paste0("s_", 1:n_S), paste0("e_", epsilon_v))

opt_choices <- array(dim = c(n_T, n_S, length(epsilon_v)))
dimnames(opt_choices) <- list(paste0("t_", 1:n_T), paste0("s_", 1:n_S), paste0("e_", epsilon_v))

# # (B) Use a very long 2D data frame:
# all_nr_trials <- length(epsilon_v) * n_S * n_T
# all_data_long <- data.frame(eps = rep(epsilon_v, each = (n_S * n_T)),
#                             sim = rep(rep(1:n_S, each = n_T), times = length(epsilon_v)),
#                             trial = rep(rep(1:n_T, times = n_S), times = length(epsilon_v)),
#                             reward = rep(NA, all_nr_trials),
#                             opt_choice = rep(NA, all_nr_trials))
# # dim(all_data_long)

# # (C) Use two vectors (to which new results are iteratively added):
# all_rewards_vec     <- NULL 
# all_opt_choices_vec <- NULL


## Wrap the simulation in 2 loops: ---- 
for (e in seq_along(epsilon_v)){
  
  epsilon <- epsilon_v[e]
  print(paste0("epsilon = ", epsilon))  # user feedback: epsilon value  
  
  st <- system.time( # time evaluation of expression:
    
    for (s in 1:n_S){
      
      # print(paste0("s = ", s))  # 4debugging 
      
      # Re-initialize environment:
      q_true <- rnorm(n = k, mean = 0, sd = 1)  # true reward means 
      names(q_true) <- paste0("O_", 1:k)
      
      # Run simulation:
      data <- simulate(n_T)  
      
      # Extract data needed:
      q_max <- which(q_true == max(q_true))
      opt_choice_v <- (data$choice == q_max)
      
      # Store data needed: ---- 
      
      # Note: Use ONE of the following methods (A--C): 
      
      # (A) Record data in two 3D arrays:
      all_rewards[ , s, e] <- data$reward   # vector of n_T reward values
      opt_choices[ , s, e] <- opt_choice_v  # vector of n_T logical values
      
      # # (B) Record data in a very long 2D data frame:
      # cur_row <- which((all_data_long$eps == epsilon)  # get current starting row
      #                  & (all_data_long$sim == s)
      #                  & (all_data_long$trial == 1))
      # # print(cur_row)  # 4debugging
      # 
      # all_data_long$reward[cur_row:(cur_row + n_T - 1)]     <- data$reward
      # all_data_long$opt_choice[cur_row:(cur_row + n_T - 1)] <- opt_choice_v
      
      # # (C) Append new vectors to the current ones:
      # all_rewards_vec     <- c(all_rewards_vec, data$reward)
      # all_opt_choices_vec <- c(all_opt_choices_vec, opt_choice_v)
      
    } # for s.
    
  ) # system.time(). 
  
  print(st)  # user feedback: time elapsed
  
} # for e. 
## [1] "epsilon = 0"
##    user  system elapsed 
##  30.036   1.068  31.375 
## [1] "epsilon = 0.01"
##    user  system elapsed 
##  29.784   1.039  30.928 
## [1] "epsilon = 0.1"
##    user  system elapsed 
##  31.234   1.406  33.432

Note that running n_S simulations for 3 values of \(\epsilon\) does take a while (e.g., several minutes on my PC). To monitor the computing duration, we added the system.time() function to measure the execution time of the 2000 simulations for each \(\epsilon\)-value. When alternatively commenting two of the three ways of recording our data, we can see that recording the data in the long 2D data frame (B) is more than 5 times slower than using the 3D arrays (A). Similarly, incrementally adding new data to vectors (C) is initially about 4 times faster than (B), but still slower than (A) — and gets slower with additional values of \(\epsilon\) (as the vectors to which we add data get longer).

Overall, our comparison shows that using the 3D arrays (A) is the fastest of the considered methods, but uses the most complex data construct (i.e., arrays, rather than data frame of B, or the vectors of C), and will also require the most elaborate post-processing efforts. Thus, running simulations involves considerable trade-offs between data constructs, requirements for post-processing, and computing time. As the effects of seemingly small choices (like the data structure used to store results or the way a current row index is computed) are multiplied by large numbers, small differences in the initial setup can result in huge overall effects. The optimal balance between these trade-offs depends less on the speed of the computer than on the modeler’s experience and skills.

Results

Depending on the method of storing the process data during the simulation, we must analyze different data structures to interpret the results.

  1. Rewards over time (using the 3D array all_rewards):
# (A) Rewards:
dim(all_rewards)
## [1] 1000 2000    3
# head(all_rewards)

# Flatten 3D-array into a 2D-data frame: ---- 

# # (a) using base R and tidyr functions:
# m1 <- apply(all_rewards, 3L, c)  # flatten by 3rd margin
# m0 <- tidyr::expand_grid(expand.grid(dimnames(all_rewards)[1:2]))
# ar <- data.frame(m0, m1)
# names(ar)[1:2] <- c("trial", "sim")

# (b) using i2ds array utility function: 
ar <- i2ds::flatten_array(all_rewards, margin = 3)
names(ar)[1:2] <- c("trial", "sim")

# Convert character data into numeric data:
ar$trial <- as.numeric(substr(ar$trial, 3, 99))
ar$sim   <- as.numeric(substr(ar$sim, 3, 99))

# Inspect: 
head(ar)
##   trial sim        e_0     e_0.01      e_0.1
## 1     1   1 -0.5579740  0.2693687 -1.9034697
## 2     2   1  0.5917097  1.5062428  2.3776825
## 3     3   1  3.6592104 -1.8476057 -0.4972642
## 4     4   1  0.6640653  3.2136294  0.6202576
## 5     5   1 -0.2190722  2.0805950  0.4926858
## 6     6   1 -0.2006110  2.3659025  1.2452664
tail(ar)
##         trial  sim        e_0     e_0.01      e_0.1
## 1999995   995 2000 0.42730797  3.7814709 -0.1572958
## 1999996   996 2000 1.21480299  2.1150134 -0.1330691
## 1999997   997 2000 0.17735335  2.6273146  0.4263504
## 1999998   998 2000 1.85546349 -0.0077369  0.7482845
## 1999999   999 2000 0.09821795 -0.5019054  0.1735554
## 2000000  1000 2000 3.21411779 -0.4726288  2.0584628
# Summarize for each trial: ---- 
df_rew <- ar %>% 
  group_by(trial) %>%
  summarise(mn_rew_e_0 = mean(e_0),
            mn_rew_e_0.01 = mean(e_0.01),
            mn_rew_e_0.10 = mean(e_0.1))


# Plot: ---- 
my_cols <- c("forestgreen", "firebrick3", "steelblue3")
my_x <- (n_T - n_T/15)

ggplot(df_rew, aes(x = trial)) +
  geom_line(aes(y = mn_rew_e_0), color = my_cols[1], size = .5) + 
  annotate("text", x = my_x, y = .92, label = paste("epsilon ==", 0, "~(greedy)"),   
           color = my_cols[1], parse = TRUE) + 
  geom_line(aes(y = mn_rew_e_0.01), color = my_cols[2], size = .5) + 
  annotate("text", x = my_x, y = 1.18, label = paste0("epsilon ==", 0.01), 
           color = my_cols[2], parse = TRUE) + 
  geom_line(aes(y = mn_rew_e_0.10), color = my_cols[3], size = .5) + 
  annotate("text", x = my_x, y = 1.50, label = paste0("epsilon ==", 0.10), 
           color = my_cols[3], parse = TRUE) + 
  labs(tag = "A", x = "Trial", y = "Mean reward") + 
  scale_y_continuous(limits = c(0, 1.5)) + 
  theme_classic()

  1. Optimal choices (using the 3D array opt_choices):
# (B) Optimal choices:
dim(opt_choices)
## [1] 1000 2000    3
# head(opt_choices)

# Flatten 3D-array into a 2D-data frame: ---- 

# # (a) using base R and tidyr functions:
# m1 <- apply(opt_choices, 3L, c)  # flatten by 3rd margin
# m0 <- expand_grid(expand.grid(dimnames(opt_choices)[1:2]))
# oc <- data.frame(m0, m1)
# names(oc)[1:2] <- c("trial", "sim")

# (b) using i2ds array utility function: 
oc <- i2ds::flatten_array(opt_choices, margin = 3)
names(oc)[1:2] <- c("trial", "sim")

# Convert character data into numeric data:
oc$trial <- as.numeric(substr(oc$trial, 3, 99))
oc$sim   <- as.numeric(substr(oc$sim, 3, 99))

# Inspect: 
head(oc)
##   trial sim   e_0 e_0.01 e_0.1
## 1     1   1 FALSE  FALSE FALSE
## 2     2   1 FALSE  FALSE FALSE
## 3     3   1 FALSE  FALSE FALSE
## 4     4   1 FALSE  FALSE FALSE
## 5     5   1 FALSE  FALSE FALSE
## 6     6   1 FALSE  FALSE FALSE
tail(oc)
##         trial  sim   e_0 e_0.01 e_0.1
## 1999995   995 2000 FALSE   TRUE FALSE
## 1999996   996 2000 FALSE   TRUE  TRUE
## 1999997   997 2000 FALSE   TRUE  TRUE
## 1999998   998 2000 FALSE   TRUE  TRUE
## 1999999   999 2000 FALSE   TRUE  TRUE
## 2000000  1000 2000 FALSE   TRUE  TRUE
# Summarize for each trial: ---- 
df_eps <- oc %>% 
  group_by(trial) %>%
  summarise(mean_e_0 = mean(e_0),
            mean_e_0.01 = mean(e_0.01),
            mean_e_0.10 = mean(e_0.1))

# Plot: ---- 
ggplot(df_eps, aes(x = trial)) +
  geom_line(aes(y = mean_e_0), color = my_cols[1], size = .8) + 
  annotate("text", x = my_x, y = .36, label = paste0("epsilon ==", 0, "~(greedy)"), 
           color = my_cols[1], parse = TRUE) + 
  geom_line(aes(y = mean_e_0.01), color = my_cols[2], size = .8) + 
  annotate("text", x = my_x, y = .54, label = paste0("epsilon ==", 0.01), 
           color = my_cols[2], parse = TRUE) + 
  geom_line(aes(y = mean_e_0.10), color = my_cols[3], size = .8) + 
  annotate("text", x = my_x, y = .76, label = paste0("epsilon ==", 0.10), 
           color = my_cols[3], parse = TRUE) + 
  labs(tag = "B", x = "Trial", y = "Optimal action (in %)") + 
  scale_y_continuous(limits = c(0, 1), labels = scales::percent) + 
  theme_classic()

  1. Using the 2D data frame all_data_long (from method B) or the two vectors all_rewards_vec and opt_choices_vec (from C) allows creating the same graphs (as it should, due to storing the same data), but makes the post-processing tasks much simpler. The following code would replace both 1. and 2.:
  • Verify that the methods (B) and (C) yield the same results:
# Verify equality of B and C: ---- 

# Check equality of vectors from (B) and (C):
all.equal(all_data_long$reward, all_rewards_vec) 
all.equal(all_data_long$opt_choice, all_opt_choices_vec) 

# Turn 2 vectors from (C) into a data frame (as in B):
all_data_long_2 <- data.frame(eps = rep(epsilon_v, each = (n_S * n_T)),
                              sim = rep(rep(1:n_S, each = n_T), times = length(epsilon_v)),
                              trial = rep(rep(1:n_T, times = n_S), times = length(epsilon_v)),
                              reward = all_rewards_vec,
                              opt_choice = all_opt_choices_vec)

# Verify equality of data frames from (B) and (C): 
all.equal(all_data_long, all_data_long_2)
  • Analyze the result data when stored in the 2D data frame (from B or C):
# Inspect: ---- 
dim(all_data_long)
head(all_data_long)
tail(all_data_long)

# Summarize (aggregate over simulations): ---- 
df_sum <- all_data_long %>%
  group_by(eps, trial) %>%
  summarise(nr = n(),
            nr_NA_rew = sum(is.na(reward)),
            mn_reward = mean(reward),
            nr_NA_opt = sum(is.na(opt_choice)),
            mn_opt_choice = mean(opt_choice)
            )
tail(df_sum)

# Plot: ---- 

# A. Rewards over time:
ggplot(df_sum, aes(x = trial, color = factor(eps))) +
  geom_line(aes(y = mn_reward), size = .5) + 
  labs(tag = "A", x = "Trial", y = "Mean reward", 
       color = "epsilon:") + 
  scale_color_manual(values = my_cols) + 
  scale_y_continuous(limits = c(0, 1.5)) + 
  theme_classic()

# B. Optimal choices:
ggplot(df_sum, aes(x = trial, color = factor(eps))) +
  geom_line(aes(y = mn_opt_choice), size = .8) + 
  labs(tag = "B", x = "Trial", y = "Optimal choice (in %)", 
       color = "epsilon:") + 
  scale_color_manual(values = my_cols) + 
  scale_y_continuous(limits = c(0, 1), labels = scales::percent) + 
  theme_classic()

+++ here now +++

2 Miscellaneous notes

2.1 Key elements of RL

  • Conceptual distinction between agent vs. environment

  • Learning by interaction to maximize reward

  • Agent must sense some environmental state, select an action, receive reward, and evaluate the action’s success

  • Value: central role of evaluative feedback (value function provides estimates of future rewards)

2.1.1 Definitions

Key assumptions and characteristic elements of RL:

  • Agent: A hedonistic system that wants something, maximizes some reward signal (p. xvii) A learning agent must discover actions in a trial-and-error fashion, repertoire of actions is shaped by feedback and evaluation. Agent senses environmental state and act to affect state(s).
    A RL agent can be an autonomous organism or robot, but can also be a component of larger systems.

  • Environment: state, related to agent’s goals.

  • Process: Learning from experience gained by environmental interaction to achieve goals (p. xviii), goal-directed learning from interaction to solve problems (p. 1).

  • Perspective: A computational approach to learning from interaction (p. 1):
    Adopting the perspective of an artificial intelligence researcher or engineer: Explore idealized learning situations and evaluate the effetiveness of various learning methods.

References

Minsky, M. (1961). Steps toward artificial intelligence. Proceedings of the IRE, 49(1), 8–30. https://doi.org/10.1109/JRPROC.1961.287775

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). Retrieved from http://incompleteideas.net/book/the-book-2nd.html