Dylan Ashley | Research

2025

On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers Miroslav Štrupl, Oleg Szehr, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, and Jürgen Schmidhuber Submitted to the Journal of Machine Learning Research [Abstract] [arXiv] [Code] [BibTeX]
This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers. These algorithms performed competitively across various benchmarks, from games to robotic tasks, but their theoretical understanding is limited to specific environmental conditions. This work initiates a theoretical foundation for algorithms that build on the broad paradigm of approaching reinforcement learning through supervised learning or sequence modeling. At the core of this investigation lies the analysis of conditions on the underlying environment, under which the algorithms can identify optimal solutions. We also assess whether emerging solutions remain stable in situations where the environment is subject to tiny levels of noise. Specifically, we study the continuity and asymptotic convergence of command-conditioned policies, values and the goal-reaching objective depending on the transition kernel of the underlying Markov Decision Process. We demonstrate that near-optimal behavior is achieved if the transition kernel is located in a sufficiently small neighborhood of a deterministic kernel. The mentioned quantities are continuous (with respect to a specific topology) at deterministic kernels, both asymptotically and after a finite number of learning cycles. The developed methods allow us to present the first explicit estimates on the convergence and stability of policies and values in terms of the underlying transition kernels. On the theoretical side we introduce a number of new concepts to reinforcement learning, like working in segment spaces, studying continuity in quotient topologies and the application of the fixed-point theory of dynamical systems. The theoretical study is accompanied by a detailed investigation of example environments and numerical experiments.
On the Distillation of Stories for Transferring Narrative Arcs in Collections of Independent Media Dylan R. Ashley*, Vincent Herrmann*, Zachary Friggstad, and Jürgen Schmidhuber Published in IEEE Transactions on Pattern Analysis and Machine Intelligence (IF 20.8) and previously presented at the NeurIPS 2023 Workshop on Machine Learning for Creativity and Design and at the NeurIPS 2022 Workshop Information-Theoretic Principles in Cognitive Systems [Abstract] [PDF] [Code] [Poster] [BibTeX]
The act of telling stories is a fundamental part of what it means to be human. This work introduces the concept of narrative information, which we define as the overlap in information space between a story and the items that compose the story. Using contrastive learning methods, we show how modern artificial neural networks can be leveraged to distill stories and extract a representation of the narrative information. We then demonstrate how evolutionary algorithms can leverage this to extract a set of narrative template curves and how these – in tandem with a novel curve-fitting algorithm we introduce – can reorder music albums to automatically induce stories in them. In doing so, we give statistically significant evidence that (1) these narrative information template curves are present in existing albums and that (2) people prefer an album ordered through one of these learned template curves over a random one. The premises of our work extend to any form of (largely) independent media, and as evidence, we also show that our method works with image data.
Upside Down Reinforcement Learning with Policy Generators Jacopo Di Ventura, Dylan R. Ashley, Vincent Herrmann, Francesco Faccio, and Jürgen Schmidhuber Presented at the 6th Multidisciplinary Conference on Reinforcement Learning and Decision Making [Abstract] [arXiv] [Code] [BibTeX]
Upside Down Reinforcement Learning (UDRL) is a promising framework for solving reinforcement learning problems which focuses on learning command-conditioned policies. In this work, we extend UDRL to the task of learning a command-conditioned generator of deep neural network policies. We accomplish this using Hypernetworks - a variant of Fast Weight Programmers, which learn to decode input commands representing a desired expected return into command-specific weight matrices. Our method, dubbed Upside Down Reinforcement Learning with Policy Generators (UDRLPG), streamlines comparable techniques by removing the need for an evaluator or critic to update the weights of the generator. To counteract the increased variance in last returns caused by not having an evaluator, we decouple the sampling probability of the buffer from the absolute number of policies in it, which, together with a simple weighting strategy, improves the empirical convergence of the algorithm. Compared with existing algorithms, UDRLPG achieves competitive performance and high returns, sometimes outperforming more complex architectures. Our experiments show that a trained generator can generalize to create policies that achieve unseen returns zero-shot. The proposed method appears to be effective in mitigating some of the challenges associated with learning highly multimodal functions. Altogether, we believe that UDRLPG represents a promising step forward in achieving greater empirical sample efficiency in RL. A full implementation of UDRLPG is publicly available at https://github.com/JacopoD/udrlpg_

2024

How to Correctly do Semantic Backpropagation on Language-based Agentic Systems Wenyi Wang*, Hisham A. Alyahya*, Dylan R. Ashley, Oleg Serikov, Dmitrii Khizbullin, Francesco Faccio, and Jürgen Schmidhuber Submitted to the 42nd International Conference on Machine Learning [Abstract] [arXiv] [Code] [BibTeX]
Language-based agentic systems have shown great promise in recent years, transitioning from solving small-scale research problems to being deployed in challenging real-world tasks. However, optimizing these systems often requires substantial manual labor. Recent studies have demonstrated that these systems can be represented as computational graphs, enabling automatic optimization. Despite these advancements, most current efforts in Graph-based Agentic System Optimization (GASO) fail to properly assign feedback to the system’s components given feedback on the system’s output. To address this challenge, we formalize the concept of semantic backpropagation with semantic gradients – a generalization that aligns several key optimization techniques, including reverse-mode automatic differentiation and the more recent TextGrad by exploiting the relationship among nodes with a common successor. This serves as a method for computing directional information about how changes to each component of an agentic system might improve the system’s output. To use these gradients, we propose a method called semantic gradient descent which enables us to solve GASO effectively. Our results on both BIG-Bench Hard and GSM8K show that our approach outperforms existing state-of-the-art methods for solving GASO problems. A detailed ablation study on the LIAR dataset demonstrates the parsimonious nature of our method. A full copy of our implementation is publicly available at https://github.com/HishamAlyahya/semantic_backprop
Automatic Album Sequencing Vincent Herrmann*, Dylan R. Ashley*, and Jürgen Schmidhuber Presented as a late-breaking demo at the 2024 Conference of the International Society for Music Information Retrieval [Abstract] [arXiv] [Code] [Poster] [BibTeX]
Album sequencing is a critical part of the album production process. Recently, a data-driven approach was proposed that sequences general collections of independent media by extracting the narrative essence of the items in the collections. While this approach implies an album sequencing technique, it is not widely accessible to a less technical audience, requiring advanced knowledge of machine learning techniques to use. To address this, we introduce a new user-friendly web-based tool that allows a less technical audience to upload music tracks, execute this technique in one click, and subsequently presents the result in a clean visualization to the user. To both increase the number of templates available to the user and address shortcomings of previous work, we also introduce a new direct transformer-based album sequencing method. We find that our more direct method outperforms a random baseline but does not reach the same performance as the narrative essence approach. Both methods are included in our web-based user interface, and this – alongside a full copy of our implementation – is publicly available at https://github.com/dylanashley/automatic-album-sequencing
Agent-as-a-Judge: Evaluate Agents with Agents Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber Submitted to the Submitted to the 42nd International Conference on Machine Learning [Abstract] [arXiv] [BibTeX]
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes – ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems – by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.
Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning Yuhui Wang, Qingyuan Wu, Weida Li, Dylan R. Ashley, Francesco Faccio, Chao Huang, and Jürgen Schmidhuber Submitted to the 42nd International Conference on Machine Learning and previously presented at the 17th European Workshop on Reinforcement Learning [Abstract] [arXiv] [Poster] [BibTeX]
The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent Markov Decision Process (MDP) for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a 100x100 maze – a task that typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module’s depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introduce an "adaptive highway loss" that constructs skip connections to improve gradient flow. We evaluate our method on 2D maze navigation environments, the ViZDoom 3D navigation benchmark, and the real-world Lunar rover navigation task. We find that our new method, named Dynamic Transition VIN (DT-VIN), scales to 5000 layers and solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in RL environments.
Towards an Extremely Robust Baby Robot With Rich Interaction Ability for Advanced Machine Learning Algorithms Mohannad Alhakami*, Dylan R. Ashley*, Joel Dunham*, Yanning Dai, Francesco Faccio, Eric Feron, and Jürgen Schmidhuber Presented as a late-breaking result at the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems [Abstract] [arXiv] [Code] [Poster] [BibTeX]
Advanced machine learning algorithms require platforms that are extremely robust and equipped with rich sensory feedback to handle extensive trial-and-error learning without relying on strong inductive biases. Traditional robotic designs, while well-suited for their specific use cases, are often fragile when used with these algorithms. To address this gap – and inspired by the vision of enabling curiosity-driven baby robots – we present a novel robotic limb designed from scratch. Our design has a hybrid soft-hard structure, high redundancy with rich non-contact sensors (exclusively cameras), and easily replaceable failure points. Proof-of-concept experiments using two contemporary reinforcement learning algorithms on a physical prototype demonstrate that our design is able to succeed in a simple target-finding task even under simulated sensor failures, all with minimal human oversight during extended learning periods. We believe this design represents a concrete step toward more tailored robotic designs for achieving general-purpose, generally intelligent robots.
Mindstorms in Natural Language-Based Societies of Mind Mingchen Zhuge*, Haozhe Liu*, Francesco Faccio*, Dylan R. Ashley*, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, Louis Kirsch, Bing Li, Guohao Li, Shuming Liu, Jinjie Mai, Piotr Piękos, Aditya Ramesh, Imanol Schlag, Weimin Shi, Aleksandar Stanić, Wenyi Wang, Yuhui Wang, Mengmeng Xu, Deng-Ping Fan, Bernard Ghanem, and Jürgen Schmidhuber Published in Computational Visual Media (IF 17.3) and previously presented at the NeurIPS 2023 Workshop on Robustness of Zero/Few-Shot Learning in Foundation Models (Best-Paper Award) [Abstract] [arXiv] [Poster] [Slides] [BibTeX]
Both Minsky’s "society of mind" and Schmidhuber’s "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents – all communicating through the same universal symbolic language – are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents-some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions.

2023

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute Aleksandar Stanić*, Dylan R. Ashley, Oleg Serikov, Louis Kirsch, Francesco Faccio, Jürgen Schmidhuber, Thomas Hofmann, and Imanol Schlag* Preprint on arXiv [Abstract] [arXiv] [Code] [BibTeX]
The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model’s throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modelling research.

2022

Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Jürgen Schmidhuber, and Rupesh Kumar Srivastava Presented at the 5th Multidisciplinary Conference on Reinforcement Learning and Decision Making and the 15th European Workshop on Reinforcement Learning [Abstract] [arXiv] [Code] [Poster] [BibTeX]
Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) – which can be viewed as a simplified version of UDRL – optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.
Learning Relative Return Policies With Upside-Down Reinforcement Learning Dylan R. Ashley, Kai Arulkumaran, Jürgen Schmidhuber, and Rupesh Kumar Srivastava Presented at the 5th Multidisciplinary Conference on Reinforcement Learning and Decision Making [Abstract] [arXiv] [Poster] [BibTeX]
Lately, there has been a resurgence of interest in using supervised learning to solve reinforcement learning problems. Recent work in this area has largely focused on learning command-conditioned policies. We investigate the potential of one such method – upside-down reinforcement learning – to work with commands that specify a desired relationship between some scalar value and the observed return. We show that upside-down reinforcement learning can learn to carry out such commands online in a tabular bandit setting and in CartPole with non-linear function approximation. By doing so, we demonstrate the power of this family of methods and open the way for their practical use under more complicated command structures.
All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL Kai Arulkumaran, Dylan R. Ashley, Jürgen Schmidhuber, and Rupesh Kumar Srivastava Presented at the 5th Multidisciplinary Conference on Reinforcement Learning and Decision Making [Abstract] [arXiv] [Code] [Poster] [BibTeX]
Upside down reinforcement learning (UDRL) flips the conventional use of the return in the objective function in RL upside down, by taking returns as input and predicting actions. UDRL is based purely on supervised learning, and bypasses some prominent issues in RL: bootstrapping, off-policy corrections, and discount factors. While previous work with UDRL demonstrated it in a traditional online RL setting, here we show that this single algorithm can also work in the imitation learning and offline RL settings, be extended to the goal-conditioned RL setting, and even the meta-RL setting. With a general agent architecture, a single UDRL agent can learn across all paradigms.
Reward-Weighted Regression Converges to a Global Optimum Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, and Jürgen Schmidhuber Published in the Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence [Abstract] [arXiv] [Code] [Poster] [BibTeX]
Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.

2021

Automatic Embedding of Stories Into Collections of Independent Media Dylan R. Ashley, Vincent Herrmann, Zachary Friggstad, Kory W. Mathewson, and Jürgen Schmidhuber Preprint on arXiv [Abstract] [arXiv] [Code] [BibTeX]
We look at how machine learning techniques that derive properties of items in a collection of independent media can be used to automatically embed stories into such collections. To do so, we use models that extract the tempo of songs to make a music playlist follow a narrative arc. Our work specifies an open-source tool that uses pre-trained neural network models to extract the global tempo of a set of raw audio files and applies these measures to create a narrative-following playlist. This tool is available at https://github.com/dylanashley/playlist-story-builder/releases/tag/v1.0.0
Does the Adam Optimizer Exacerbate Catastrophic Forgetting? Dylan R. Ashley, Sina Ghiassian, and Richard S. Sutton Preprint on arXiv [Abstract] [arXiv] [Code] [BibTeX]
Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. We use various testbeds from the reinforcement learning and supervised learning literature to (1) provide evidence that the choice of which modern gradient-based optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting and show that-surprisingly-in many instances classical algorithms such as vanilla SGD experience less catastrophic forgetting than the more modern algorithms such as Adam. We empirically compare four different existing metrics for quantifying catastrophic forgetting and (2) show that the degree to which the learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. Our results suggest that a much more rigorous experimental methodology is required when looking at catastrophic forgetting. Based on our results, we recommend inter-task forgetting in supervised learning must be measured with both retention and relearning metrics concurrently, and intra-task forgetting in reinforcement learning must-at the very least-be measured with pairwise interference.
Back to Square One: Superhuman Performance in Chutes and Ladders Through Deep Neural Networks and Tree Search Dylan R. Ashley*, Anssi Kanervisto*, and Brendan Bennett* Published in the Proceedings of the 2021 Conference of the ACH Special Interest Group on Harry Q. Bovik [Abstract] [arXiv] [PDF] [Code] [BibTeX]
We present AlphaChute: a state-of-the-art algorithm that achieves superhuman performance in the ancient game of Chutes and Ladders. We prove that our algorithm converges to the Nash equilibrium in constant time, and therefore is–to the best of our knowledge–the first such formal solution to this game. Surprisingly, despite all this, our implementation of AlphaChute remains relatively straightforward due to domain-specific adaptations. We provide the source code for AlphaChute here in our Appendix.

2020

Understanding Forgetting in Artificial Neural Networks Dylan R. Ashley Master’s thesis (University of Alberta) [Abstract] [PDF] [Code] [Slides] [BibTeX]
This thesis is offered as a step forward in our understanding of forgetting in artificial neural networks. ANNs are a learning system loosely based on our understanding of the brain and are responsible for recent breakthroughs in artificial intelligence. However, they have been reported to be particularly susceptible to forgetting. Specifically, existing research suggests that ANNs may exhibit unexpectedly high rates of retroactive inhibition when compared with results from psychology studies measuring forgetting in people. If this phenomenon, dubbed catastrophic forgetting, exists, then explicit methods intended to reduce it may increase the scope of problems ANNs can be successfully applied to.

In this thesis, we contribute to the field by answering five questions related to forgetting in ANNs: How does forgetting in psychology relate to ideas in machine learning? What is catastrophic forgetting? Does it exist in contemporary systems, and, if so, is it severe? How can we measure a system’s susceptibility to it? Are the current optimization algorithms we use to train ANNs adding to its severity?

This work answers each of the five questions sequentially. We begin by answering the first and second of the five questions by providing an analytical survey that looks at the concept of forgetting as it appears in psychology and connects it to various ideas in machine learning such as generalization, transfer learning, experience replay, and eligibility traces.

We subsequently confirm the existence and severity of catastrophic forgetting in some contemporary machine learning systems by showing that it appears when a simple, modern ANN (multi-layered fully-connected network with rectified linear unit activation) is trained using a conventional algorithm (Stochastic Gradient Descent through backpropagation with normal random initialization) incrementally on a well-known multi-class classification setting (MNIST). We demonstrate that the phenomenon is a more subtle problem than a simple reversal of learning. We accomplish this by noting that both total learning time and relearning time are reduced when the multi-class classification problem is split into multiple phases containing samples from disjoint subsets of the classes.

We then move on to looking at how we can measure the degree to which ANN-based learning systems suffer from catastrophic forgetting by constructing a principled testbed out of the previous multi-task supervised learning problem and two well-studied reinforcement learning problems (Mountain Car and Acrobot). We apply this testbed to answer the final of the five questions by looking at how several modern gradient-based optimization algorithms used to train ANNs (SGD, SGD with Momentum, RMSProp, and Adam) affect the amount of catastrophic forgetting that occurs during training. While doing so, we are able to confirm and expand previous hypotheses surrounding the complexities of measuring catastrophic forgetting. We find that different algorithms, even when applied to the same ANN, result in significantly different amounts of catastrophic forgetting under a variety of different metrics.

We believe that our answers to the five questions constitute a step forward in our understanding of forgetting as it appears in ANNs. Such an understanding is essential for realizing the full potential that ANNs offer to the study of artificial intelligence.
Universal Successor Features for Transfer Reinforcement Learning Chen Ma, Dylan R. Ashley, Junfeng Wen, and Yoshua Bengio Preprint on arXiv [Abstract] [arXiv] [BibTeX]
Transfer in Reinforcement Learning (RL) refers to the idea of applying knowledge gained from previous tasks to solving related tasks. Learning a universal value function (Schaul et al., 2015), which generalizes over goals and states, has previously been shown to be useful for transfer. However, successor features are believed to be more suitable than values for transfer (Dayan, 1993; Barreto et al., 2017), even though they cannot directly generalize to new goals. In this paper, we propose (1) Universal Successor Features (USFs) to capture the underlying dynamics of the environment while allowing generalization to unseen goals and (2) a flexible end-to-end model of USFs that can be trained by interacting with the environment. We show that learning USFs is compatible with any RL algorithm that learns state values using a temporal difference method. Our experiments in a simple gridworld and with two MuJoCo environments show that USFs can greatly accelerate training when learning multiple tasks and can effectively transfer knowledge to new tasks.

2019

Learning to Select Mates in Evolving Non-playable Characters Dylan R. Ashley*, Valliappa Chockalingam*, Braedy Kuzma*, and Vadim Bulitko Published in the Proceedings of the 2019 IEEE Conference on Games (Oral Presentation) [Abstract] [PDF] [Slides] [BibTeX]
Procedural content generation (PCG) is an active area of research with the potential to significantly reduce game development costs as well as create game experiences meaningfully personalized to each player. Evolutionary methods are a promising method of generating content procedurally. In particular asynchronous evolution of AI agents in an artificial life (A-life) setting is notably similar to the online evolution of non-playable characters in a video game. In this paper, we are concerned with improving the efficiency of evolution via more effective mate selection. In the spirit of PCG, we genetically encode each agent’s preference for mating partners and thereby allowing the mate-selection process to evolve. We evaluate this approach in a simple predator-prey A-life environment and demonstrate that the ability to evolve a per-agent mate-selection preference function indeed significantly increases the extinction time of the population. Additionally, an inspection of the evolved preference function parameters shows that agents evolve to favor mates who have survival traits.
Learning to Select Mates in Artificial Life Dylan R. Ashley*, Valliappa Chockalingam*, Braedy Kuzma*, and Vadim Bulitko Published in the Proceedings of the Genetic and Evolutionary Computation Conference Companion [Abstract] [PDF] [Code] [Poster] [BibTeX]
Artificial life (A-life) simulations present a natural way to study interesting phenomena emerging in a population of evolving agents. In this paper, we investigate whether allowing A-life agents to select mates can extend the lifetime of a population. In our approach, each agent evaluates potential mates via a preference function. The role of this function is to map information about an agent and its candidate mate to a scalar preference for deciding whether or not to form an offspring. We encode the parameters of the preference function genetically within each agent, thus allowing such preferences to be agent-specific as well as evolving over time. We evaluate this approach in a simple predator-prey A-life environment and demonstrate that the ability to evolve a per-agent mate-selection preference function indeed significantly increases the extinction time of the population. Additionally an inspection of the evolved preference function parameters shows that agents evolve to favor mates who have survival traits.

2018

Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return Craig Sherstan, Dylan R. Ashley*, Brendan Bennett*, Kenny Young, Adam White, Martha White, and Richard S. Sutton Published in the Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (Oral Presentation) [Abstract] [PDF] [SUP] [Code] [Poster] [Slides] [BibTeX]
Temporal-difference (TD) learning methods are widely used in reinforcement learning to estimate the expected return for each state, without a model, because of their significant advantages in computational and data efficiency. For many applications involving risk mitigation, it would also be useful to estimate the variance of the return by TD methods. In this paper, we describe a way of doing this that is substantially simpler than those proposed by Tamar, Di Castro, and Mannor in 2012, or those proposed by White and White in 2016. We show that two TD learners operating in series can learn expectation and variance estimates. The trick is to use the square of the TD error of the expectation learner as the reward of the variance learner, and the square of the expectation learner’s discount rate as the discount rate of the variance learner. With these two modifications, the variance learning problem becomes a conventional TD learning problem to which standard theoretical results can be applied. Our formal results are limited to the table lookup case, for which our method is still novel, but the extension to function approximation is immediate, and we provide some empirical results for the linear function approximation case. Our experimental results show that our direct method behaves just as well as a comparable indirect method, but is generally more robust.
The Alberta Workloads for the SPEC CPU 2017 Benchmark Suite José Nelson Amaral, Edson Borin, Dylan R. Ashley, Caian Benedicto, Elliot Colp, Joao Henrique Stange Hoffmam, Marcus Karpoff, Erick Ochoa, Morgan Redshaw, and Raphael Ernani Rodrigues Published in the Proceedings of the 2018 IEEE International Symposium on Performance Analysis of Systems and Software [Abstract] [PDF] [Code] [BibTeX]
A proper evaluation of techniques that require multiple training and evaluation executions of a benchmark, such as Feedback-Directed Optimization (FDO), requires multiple workloads that can be used to characterize variations on the behaviour of a program based on the workload. This paper aims to improve the performance evaluation of computer systems - including compilers, computer architecture simulation, and operating-system prototypes - that rely on the industrystandard SPEC CPU benchmark suite. A main concern with the use of this suite in research is that it is distributed with a very small number of workloads. This paper describes the process to create additional workloads for this suite and offers useful insights in many of its benchmarks. The set of additional workloads created, named the Alberta Workloads for the SPEC CPU 2017 Benchmark Suite1 is made freely available with the goal of providing additional data points for the exploration of learning in computing systems. These workloads should also contribute to ameliorate the hidden learning problem where a researcher sets parameters to a system during development based on a set of benchmarks and then evaluates the system using the very same set of benchmarks with the very same workloads.
Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods Craig Sherstan, Brendan Bennett, Kenny Young, Dylan R. Ashley, Adam White, Martha White, and Richard S. Sutton Preprint on arXiv [Abstract] [arXiv] [BibTeX]
This paper investigates estimating the variance of a temporal-difference learning agent’s update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state and is mathematically expressed as the expected sum of discounted future rewards (called the return). These values can be straightforwardly estimated by averaging batches of returns using Monte Carlo methods. However, if we wish to update the agent’s value estimates during learning–before terminal outcomes are observed–we must use a different estimation target called the λ-return, which truncates the return with the agent’s own estimate of the value function. Temporal difference learning methods estimate the expected λ-return for each state, allowing these methods to update online and incrementally, and in most cases achieve better generalization error and faster learning than Monte Carlo methods. Naturally one could attempt to estimate higher-order moments of the λ-return. This paper is about estimating the variance of the λ-return. Prior work has shown that given estimates of the variance of the λ-return, learning systems can be constructed to (1) mitigate risk in action selection, and (2) automatically adapt the parameters of the learning process itself to improve performance. Unfortunately, existing methods for estimating the variance of the λ-return are complex and not well understood empirically. We contribute a method for estimating the variance of the λ-return directly using policy evaluation methods from reinforcement learning. Our approach is significantly simpler than prior methods that independently estimate the second moment of the λ-return. Empirically our new approach behaves at least as well as existing approaches, but is generally more robust.