Research
Publications in reversed chronological order.
2022

UpsideDown Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets Štrupl, Miroslav, Faccio, Francesco, Ashley, Dylan R., Schmidhuber, Jürgen, and Srivastava, Rupesh Kumar In arXiv [Abstract] [arXiv] [Code]
UpsideDown Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that GoalConditional Supervised Learning (GCSL) – which can be viewed as a simplified version of UDRL – optimizes a lower bound on goalreaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain wellknown traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.

Learning Relative Return Policies With UpsideDown Reinforcement Learning Ashley, Dylan R., Arulkumaran, Kai, Schmidhuber, Jürgen, and Srivastava, Rupesh Kumar In Proceedings of the Multidisciplinary Conference on Reinforcement Learning and Decision Making [Abstract] [arXiv]
Lately, there has been a resurgence of interest in using supervised learning to solve reinforcement learning problems. Recent work in this area has largely focused on learning commandconditioned policies. We investigate the potential of one such method – upsidedown reinforcement learning – to work with commands that specify a desired relationship between some scalar value and the observed return. We show that upsidedown reinforcement learning can learn to carry out such commands online in a tabular bandit setting and in CartPole with nonlinear function approximation. By doing so, we demonstrate the power of this family of methods and open the way for their practical use under more complicated command structures.

All You Need Is Supervised Learning: From Imitation Learning to MetaRL With Upside Down RL Arulkumaran, Kai, Ashley, Dylan R., Schmidhuber, Jürgen, and Srivastava, Rupesh Kumar In Proceedings of the Multidisciplinary Conference on Reinforcement Learning and Decision Making [Abstract] [arXiv]
Upside down reinforcement learning (UDRL) flips the conventional use of the return in the objective function in RL upside down, by taking returns as input and predicting actions. UDRL is based purely on supervised learning, and bypasses some prominent issues in RL: bootstrapping, offpolicy corrections, and discount factors. While previous work with UDRL demonstrated it in a traditional online RL setting, here we show that this single algorithm can also work in the imitation learning and offline RL settings, be extended to the goalconditioned RL setting, and even the metaRL setting. With a general agent architecture, a single UDRL agent can learn across all paradigms.

RewardWeighted Regression Converges to a Global Optimum Štrupl, Miroslav, Faccio, Francesco, Ashley, Dylan R., Srivastava, Rupesh Kumar, and Schmidhuber, Jürgen In Proceedings of the AAAI Conference on Artificial Intelligence [Abstract] [arXiv] [Code]
RewardWeighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the ExpectationMaximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a returnweighted loglikelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove Rlinear convergence of the statevalue function to the optimum.
2021

Automatic Embedding of Stories Into Collections of Independent Media Ashley, Dylan R., Herrmann, Vincent, Friggstad, Zachary, Mathewson, Kory W., and Schmidhuber, Jürgen In arXiv [Abstract] [arXiv] [Code]
We look at how machine learning techniques that derive properties of items in a collection of independent media can be used to automatically embed stories into such collections. To do so, we use models that extract the tempo of songs to make a music playlist follow a narrative arc. Our work specifies an opensource tool that uses pretrained neural network models to extract the global tempo of a set of raw audio files and applies these measures to create a narrativefollowing playlist. This tool is available at https://github.com/dylanashley/playliststorybuilder/releases/tag/v1.0.0

Does the Adam Optimizer Exacerbate Catastrophic Forgetting? Ashley, Dylan R., Ghiassian, Sina, and Sutton, Richard S. In arXiv [Abstract] [arXiv] [Code]
Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. We use various testbeds from the reinforcement learning and supervised learning literature to (1) provide evidence that the choice of which modern gradientbased optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting and show thatsurprisinglyin many instances classical algorithms such as vanilla SGD experience less catastrophic forgetting than the more modern algorithms such as Adam. We empirically compare four different existing metrics for quantifying catastrophic forgetting and (2) show that the degree to which the learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. Our results suggest that a much more rigorous experimental methodology is required when looking at catastrophic forgetting. Based on our results, we recommend intertask forgetting in supervised learning must be measured with both retention and relearning metrics concurrently, and intratask forgetting in reinforcement learning mustat the very leastbe measured with pairwise interference.

Back to Square One: Superhuman Performance in Chutes and Ladders Through Deep Neural Networks and Tree Search Ashley, Dylan R., Kanervisto, Anssi, and Bennett, Brendan In Proceedings of the Conference of the ACH Special Interest Group on Harry Q. Bovik [Abstract] [arXiv] [PDF] [Code]
We present AlphaChute: a stateoftheart algorithm that achieves superhuman performance in the ancient game of Chutes and Ladders. We prove that our algorithm converges to the Nash equilibrium in constant time, and therefore is–to the best of our knowledge–the first such formal solution to this game. Surprisingly, despite all this, our implementation of AlphaChute remains relatively straightforward due to domainspecific adaptations. We provide the source code for AlphaChute here in our Appendix.
2020

Understanding Forgetting in Artificial Neural Networks Ashley, Dylan R. Master’s thesis, University of Alberta [Abstract] [PDF] [Code] [Slides]
This thesis is offered as a step forward in our understanding of forgetting in artificial neural networks. ANNs are a learning system loosely based on our understanding of the brain and are responsible for recent breakthroughs in artificial intelligence. However, they have been reported to be particularly susceptible to forgetting. Specifically, existing research suggests that ANNs may exhibit unexpectedly high rates of retroactive inhibition when compared with results from psychology studies measuring forgetting in people. If this phenomenon, dubbed catastrophic forgetting, exists, then explicit methods intended to reduce it may increase the scope of problems ANNs can be successfully applied to. In this thesis, we contribute to the field by answering five questions related to forgetting in ANNs: How does forgetting in psychology relate to ideas in machine learning? What is catastrophic forgetting? Does it exist in contemporary systems, and, if so, is it severe? How can we measure a system’s susceptibility to it? Are the current optimization algorithms we use to train ANNs adding to its severity? This work answers each of the five questions sequentially. We begin by answering the first and second of the five questions by providing an analytical survey that looks at the concept of forgetting as it appears in psychology and connects it to various ideas in machine learning such as generalization, transfer learning, experience replay, and eligibility traces. We subsequently confirm the existence and severity of catastrophic forgetting in some contemporary machine learning systems by showing that it appears when a simple, modern ANN (multilayered fullyconnected network with rectified linear unit activation) is trained using a conventional algorithm (Stochastic Gradient Descent through backpropagation with normal random initialization) incrementally on a wellknown multiclass classification setting (MNIST). We demonstrate that the phenomenon is a more subtle problem than a simple reversal of learning. We accomplish this by noting that both total learning time and relearning time are reduced when the multiclass classification problem is split into multiple phases containing samples from disjoint subsets of the classes. We then move on to looking at how we can measure the degree to which ANNbased learning systems suffer from catastrophic forgetting by constructing a principled testbed out of the previous multitask supervised learning problem and two wellstudied reinforcement learning problems (Mountain Car and Acrobot). We apply this testbed to answer the final of the five questions by looking at how several modern gradientbased optimization algorithms used to train ANNs (SGD, SGD with Momentum, RMSProp, and Adam) affect the amount of catastrophic forgetting that occurs during training. While doing so, we are able to confirm and expand previous hypotheses surrounding the complexities of measuring catastrophic forgetting. We find that different algorithms, even when applied to the same ANN, result in significantly different amounts of catastrophic forgetting under a variety of different metrics. We believe that our answers to the five questions constitute a step forward in our understanding of forgetting as it appears in ANNs. Such an understanding is essential for realizing the full potential that ANNs offer to the study of artificial intelligence.

Universal Successor Features for Transfer Reinforcement Learning Ma, Chen, Ashley, Dylan R., Wen, Junfeng, and Bengio, Yoshua In arXiv [Abstract] [arXiv]
Transfer in Reinforcement Learning (RL) refers to the idea of applying knowledge gained from previous tasks to solving related tasks. Learning a universal value function (Schaul et al., 2015), which generalizes over goals and states, has previously been shown to be useful for transfer. However, successor features are believed to be more suitable than values for transfer (Dayan, 1993; Barreto et al., 2017), even though they cannot directly generalize to new goals. In this paper, we propose (1) Universal Successor Features (USFs) to capture the underlying dynamics of the environment while allowing generalization to unseen goals and (2) a flexible endtoend model of USFs that can be trained by interacting with the environment. We show that learning USFs is compatible with any RL algorithm that learns state values using a temporal difference method. Our experiments in a simple gridworld and with two MuJoCo environments show that USFs can greatly accelerate training when learning multiple tasks and can effectively transfer knowledge to new tasks.
2019

Learning to Select Mates in Evolving Nonplayable Characters Ashley, Dylan R., Chockalingam, Valliappa, Kuzma, Braedy, and Bulitko, Vadim In Proceedings of the IEEE Conference on Games [Abstract] [PDF] [Slides]
Procedural content generation (PCG) is an active area of research with the potential to significantly reduce game development costs as well as create game experiences meaningfully personalized to each player. Evolutionary methods are a promising method of generating content procedurally. In particular asynchronous evolution of AI agents in an artificial life (Alife) setting is notably similar to the online evolution of nonplayable characters in a video game. In this paper, we are concerned with improving the efficiency of evolution via more effective mate selection. In the spirit of PCG, we genetically encode each agent’s preference for mating partners and thereby allowing the mateselection process to evolve. We evaluate this approach in a simple predatorprey Alife environment and demonstrate that the ability to evolve a peragent mateselection preference function indeed significantly increases the extinction time of the population. Additionally, an inspection of the evolved preference function parameters shows that agents evolve to favor mates who have survival traits.

Learning to Select Mates in Artificial Life Ashley, Dylan R., Chockalingam, Valliappa, Kuzma, Braedy, and Bulitko, Vadim In Proceedings of the Genetic and Evolutionary Computation Conference Companion [Abstract] [PDF] [Code]
Artificial life (Alife) simulations present a natural way to study interesting phenomena emerging in a population of evolving agents. In this paper, we investigate whether allowing Alife agents to select mates can extend the lifetime of a population. In our approach, each agent evaluates potential mates via a preference function. The role of this function is to map information about an agent and its candidate mate to a scalar preference for deciding whether or not to form an offspring. We encode the parameters of the preference function genetically within each agent, thus allowing such preferences to be agentspecific as well as evolving over time. We evaluate this approach in a simple predatorprey Alife environment and demonstrate that the ability to evolve a peragent mateselection preference function indeed significantly increases the extinction time of the population. Additionally an inspection of the evolved preference function parameters shows that agents evolve to favor mates who have survival traits.
2018

Comparing Direct and Indirect TemporalDifference Methods for Estimating the Variance of the Return Sherstan, Craig, Ashley, Dylan R., Bennett, Brendan, Young, Kenny, White, Adam, White, Martha, and Sutton, Richard S. In Proceedings of the Conference on Uncertainty in Artificial Intelligence [Abstract] [PDF] [SUP] [Code] [Poster] [Slides]
Temporaldifference (TD) learning methods are widely used in reinforcement learning to estimate the expected return for each state, without a model, because of their significant advantages in computational and data efficiency. For many applications involving risk mitigation, it would also be useful to estimate the variance of the return by TD methods. In this paper, we describe a way of doing this that is substantially simpler than those proposed by Tamar, Di Castro, and Mannor in 2012, or those proposed by White and White in 2016. We show that two TD learners operating in series can learn expectation and variance estimates. The trick is to use the square of the TD error of the expectation learner as the reward of the variance learner, and the square of the expectation learner’s discount rate as the discount rate of the variance learner. With these two modifications, the variance learning problem becomes a conventional TD learning problem to which standard theoretical results can be applied. Our formal results are limited to the table lookup case, for which our method is still novel, but the extension to function approximation is immediate, and we provide some empirical results for the linear function approximation case. Our experimental results show that our direct method behaves just as well as a comparable indirect method, but is generally more robust.

The Alberta Workloads for the SPEC CPU 2017 Benchmark Suite Amaral, José Nelson, Borin, Edson, Ashley, Dylan R., Benedicto, Caian, Colp, Elliot, Hoffmam, Joao Henrique Stange, Karpoff, Marcus, Ochoa, Erick, Redshaw, Morgan, and Rodrigues, Raphael Ernani In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software [Abstract] [PDF] [Code]
A proper evaluation of techniques that require multiple training and evaluation executions of a benchmark, such as FeedbackDirected Optimization (FDO), requires multiple workloads that can be used to characterize variations on the behaviour of a program based on the workload. This paper aims to improve the performance evaluation of computer systems  including compilers, computer architecture simulation, and operatingsystem prototypes  that rely on the industrystandard SPEC CPU benchmark suite. A main concern with the use of this suite in research is that it is distributed with a very small number of workloads. This paper describes the process to create additional workloads for this suite and offers useful insights in many of its benchmarks. The set of additional workloads created, named the Alberta Workloads for the SPEC CPU 2017 Benchmark Suite1 is made freely available with the goal of providing additional data points for the exploration of learning in computing systems. These workloads should also contribute to ameliorate the hidden learning problem where a researcher sets parameters to a system during development based on a set of benchmarks and then evaluates the system using the very same set of benchmarks with the very same workloads.

Directly Estimating the Variance of the λReturn Using TemporalDifference Methods Sherstan, Craig, Bennett, Brendan, Young, Kenny, Ashley, Dylan R., White, Adam, White, Martha, and Sutton, Richard S. In arXiv [Abstract] [arXiv]
This paper investigates estimating the variance of a temporaldifference learning agent’s update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state and is mathematically expressed as the expected sum of discounted future rewards (called the return). These values can be straightforwardly estimated by averaging batches of returns using Monte Carlo methods. However, if we wish to update the agent’s value estimates during learning–before terminal outcomes are observed–we must use a different estimation target called the λreturn, which truncates the return with the agent’s own estimate of the value function. Temporal difference learning methods estimate the expected λreturn for each state, allowing these methods to update online and incrementally, and in most cases achieve better generalization error and faster learning than Monte Carlo methods. Naturally one could attempt to estimate higherorder moments of the λreturn. This paper is about estimating the variance of the λreturn. Prior work has shown that given estimates of the variance of the λreturn, learning systems can be constructed to (1) mitigate risk in action selection, and (2) automatically adapt the parameters of the learning process itself to improve performance. Unfortunately, existing methods for estimating the variance of the λreturn are complex and not well understood empirically. We contribute a method for estimating the variance of the λreturn directly using policy evaluation methods from reinforcement learning. Our approach is significantly simpler than prior methods that independently estimate the second moment of the λreturn. Empirically our new approach behaves at least as well as existing approaches, but is generally more robust.