##### The Scattering Compositional Learner: Discovering Objects, Attributes, Relationships in Analogical Reasoning
###### Yuhuai Wu, Honghua Dong, Roger Grosse, Jimmy Ba

In this work, we focus on an analogical reasoning task that contains rich compositional structures, Raven's Progressive Matrices (RPM). To discover compositional structures of the data, we propose the Scattering Compositional Learner (SCL), an architecture that composes neural networks in a sequence. Our SCL achieves state-of-the-art performance on two RPM datasets, with a 48.7% relative improvement on Balanced-RAVEN and 26.4% on PGM over the previous state-of-the-art. We additionally show that our model discovers compositional representations of objects' attributes (e.g., shape color, size), and their relationships (e.g., progression, union). We also find that the compositional representation makes the SCL significantly more robust to test-time domain shifts and greatly improves zero-shot generalization to previously unseen analogies.

##### NVAE: A Deep Hierarchical Variational Autoencoder
###### Arash Vahdat, Jan Kautz

Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256$\times$256 pixels.

##### Pitfalls to Avoid when Interpreting Machine Learning Models
###### Christoph Molnar, Gunnar König, Julia Herbinger, Timo Freiesleben, Susanne Dandl, Christian A. Scholbeck, Giuseppe Casalicchio, Moritz Grosse-Wentrup, Bernd Bischl

Modern requirements for machine learning (ML) models include both high predictive performance and model interpretability. A growing number of techniques provide model interpretations, but can lead to wrong conclusions if applied incorrectly. We illustrate pitfalls of ML model interpretation such as bad model generalization, dependent features, feature interactions or unjustified causal interpretations. Our paper addresses ML practitioners by raising awareness of pitfalls and pointing out solutions for correct model interpretation, as well as ML researchers by discussing open issues for further research.

##### Decolonial AI: Decolonial Theory as Sociotechnical Foresight in Artificial Intelligence
###### Shakir Mohamed, Marie-Therese Png, William Isaac

This paper explores the important role of critical science, and in particular of post-colonial and decolonial theories, in understanding and shaping the ongoing advances in artificial intelligence. Artificial Intelligence (AI) is viewed as amongst the technological advances that will reshape modern societies and their relations. Whilst the design and deployment of systems that continually adapt holds the promise of far-reaching positive change, they simultaneously pose significant risks, especially to already vulnerable peoples. Values and power are central to this discussion. Decolonial theories use historical hindsight to explain patterns of power that shape our intellectual, political, economic, and social world. By embedding a decolonial critical approach within its technical practice, AI communities can develop foresight and tactics that can better align research and technology development with established ethical principles, centring vulnerable peoples who continue to bear the brunt of negative impacts of innovation and scientific progress. We highlight problematic applications that are instances of coloniality, and using a decolonial lens, submit three tactics that can form a decolonial field of artificial intelligence: creating a critical technical practice of AI, seeking reverse tutelage and reverse pedagogies, and the renewal of affective and political communities. The years ahead will usher in a wave of new scientific breakthroughs and technologies driven by AI research, making it incumbent upon AI communities to strengthen the social contract through ethical foresight and the multiplicity of intellectual perspectives available to us; ultimately supporting future technologies that enable greater well-being, with the goal of beneficence and justice for all.

##### Strong Generalization and Efficiency in Neural Programs
###### Yujia Li, Felix Gimeno, Pushmeet Kohli, Oriol Vinyals

We study the problem of learning efficient algorithms that strongly generalize in the framework of neural program induction. By carefully designing the input / output interfaces of the neural model and through imitation, we are able to learn models that produce correct results for arbitrary input sizes, achieving strong generalization. Moreover, by using reinforcement learning, we optimize for program efficiency metrics, and discover new algorithms that surpass the teacher used in imitation. With this, our approach can learn to outperform custom-written solutions for a variety of problems, as we tested it on sorting, searching in ordered lists and the NP-complete 0/1 knapsack problem, which sets a notable milestone in the field of Neural Program Induction. As highlights, our learned model can perform sorting perfectly on any input data size we tested on, with $O(n log n)$ complexity, whilst outperforming hand-coded algorithms, including quick sort, in number of operations even for list sizes far beyond those seen during training.

##### Predicting Afrobeats Hit Songs Using Spotify Data

This study approached the Hit Song Science problem with the aim of predicting which songs in the Afrobeats genre will become popular among Spotify listeners. A dataset of 2063 songs was generated through the Spotify Web API, with the provided audio features. Random Forest and Gradient Boosting algorithms proved to be successful with approximately F1 scores of 86%.

##### Deep Reinforcement Learning and its Neuroscientific Implications
###### Matthew Botvinick, Jane X. Wang, Will Dabney, Kevin J. Miller, Zeb Kurth-Nelson

The emergence of powerful artificial intelligence is defining new research directions in neuroscience. To date, this research has focused largely on deep neural networks trained using supervised learning, in tasks such as image classification. However, there is another area of recent AI work which has so far received less attention from neuroscientists, but which may have profound neuroscientific implications: deep reinforcement learning. Deep RL offers a comprehensive framework for studying the interplay among learning, representation and decision-making, offering to the brain sciences a new set of research tools and a wide range of novel hypotheses. In the present review, we provide a high-level introduction to deep RL, discuss some of its initial applications to neuroscience, and survey its wider implications for research on brain and behavior, concluding with a list of opportunities for next-stage research.

##### SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows
###### Didrik Nielsen, Priyank Jaini, Emiel Hoogeboom, Ole Winther, Max Welling

Normalizing flows and variational autoencoders are powerful generative models that can represent complicated density functions. However, they both impose constraints on the models: Normalizing flows use bijective transformations to model densities whereas VAEs learn stochastic transformations that are non-invertible and thus typically do not provide tractable estimates of the marginal likelihood. In this paper, we introduce SurVAE Flows: A modular framework of composable transformations that encompasses VAEs and normalizing flows. SurVAE Flows bridge the gap between normalizing flows and VAEs with surjective transformations, wherein the transformations are deterministic in one direction -- thereby allowing exact likelihood computation, and stochastic in the reverse direction -- hence providing a lower bound on the corresponding likelihood. We show that several recently proposed methods, including dequantization and augmented normalizing flows, can be expressed as SurVAE Flows. Finally, we introduce common operations such as the max value, the absolute value, sorting and stochastic permutation as composable layers in SurVAE Flows.

##### Meta-Learning through Hebbian Plasticity in Random Networks
###### Elias Najarro, Sebastian Risi

Lifelong learning and adaptability are two defining aspects of biological agents. Modern reinforcement learning (RL) approaches have shown significant progress in solving complex tasks, however once training is concluded, the found solutions are typically static and incapable of adapting to new information or perturbations. While it is still not completely understood how biological brains learn and adapt so efficiently from experience, it is believed that synaptic plasticity plays a prominent role in this process. Inspired by this biological mechanism, we propose a search method that, instead of optimizing the weight parameters of neural networks directly, only searches for synapse-specific Hebbian learning rules that allow the network to continuously self-organize its weights during the lifetime of the agent. We demonstrate our approach on several reinforcement learning tasks with different sensory modalities and more than 450K trainable plasticity parameters. We find that starting from completely random weights, the discovered Hebbian rules enable an agent to navigate a dynamical 2D-pixel environment; likewise they allow a simulated 3D quadrupedal robot to learn how to walk while adapting to different morphological damage in the absence of any explicit reward or error signal.

##### Meta-Learning Symmetries by Reparameterization
###### Allan Zhou, Tom Knowles, Chelsea Finn

Many successful deep learning architectures are equivariant to certain transformations in order to conserve parameters and improve generalization: most famously, convolution layers are equivariant to shifts of the input. This approach only works when practitioners know a-priori symmetries of the task and can manually construct an architecture with the corresponding equivariances. Our goal is a general approach for learning equivariances from data, without needing prior knowledge of a task's symmetries or custom task-specific architectures. We present a method for learning and encoding equivariances into networks by learning corresponding parameter sharing patterns from data. Our method can provably encode equivariance-inducing parameter sharing for any finite group of symmetry transformations, and we find experimentally that it can automatically learn a variety of equivariances from symmetries in data. We provide our experiment code and pre-trained models at https://github.com/AllanYangZhou/metalearning-symmetries.

##### Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift
###### Marvin Zhang, Henrik Marklund, Abhishek Gupta, Sergey Levine, Chelsea Finn

A fundamental assumption of most machine learning algorithms is that the training and test data are drawn from the same underlying distribution. However, this assumption is violated in almost all practical applications: machine learning systems are regularly tested on data that are structurally different from the training set, either due to temporal correlations, particular end users, or other factors. In this work, we consider the setting where test examples are not drawn from the training distribution. Prior work has approached this problem by attempting to be robust to all possible test time distributions, which may degrade average performance, or by "peeking" at the test examples during training, which is not always feasible. In contrast, we propose to learn models that are adaptable, such that they can adapt to distribution shift at test time using a batch of unlabeled test data points. We acquire such models by learning to adapt to training batches sampled according to different sub-distributions, which simulate structural distribution shifts that may occur at test time. We introduce the problem of adaptive risk minimization (ARM), a formalization of this setting that lends itself to meta-learning methods. Compared to a variety of methods under the paradigms of empirical risk minimization and robust optimization, our approach provides substantial empirical gains on image classification problems in the presence of distribution shift.

##### Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions
###### Michael Chang, Sidhant Kaushik, S. Matthew Weinberg, Thomas L. Griffiths, Sergey Levine

This paper seeks to establish a framework for directing a society of simple, specialized, self-interested agents to solve what traditionally are posed as monolithic single-agent sequential decision problems. What makes it challenging to use a decentralized approach to collectively optimize a central objective is the difficulty in characterizing the equilibrium strategy profile of non-cooperative games. To overcome this challenge, we design a mechanism for defining the learning environment of each agent for which we know that the optimal solution for the global objective coincides with a Nash equilibrium strategy profile of the agents optimizing their own local objectives. The society functions as an economy of agents that learn the credit assignment process itself by buying and selling to each other the right to operate on the environment state. We derive a class of decentralized reinforcement learning algorithms that are broadly applicable not only to standard reinforcement learning but also for selecting options in semi-MDPs and dynamically composing computation graphs. Lastly, we demonstrate the potential advantages of a society's inherent modular structure for more efficient transfer learning.

##### Shaping dynamics with multiple populations in low-rank recurrent networks
###### Manuel Beiran, Alexis Dubreuil, Adrian Valente, Francesca Mastrogiuseppe, Srdjan Ostojic

An emerging paradigm proposes that neural computations can be understood at the level of dynamical systems that govern low-dimensional trajectories of collective neural activity. How the connectivity structure of a network determines the emergent dynamical system however remains to be clarified. Here we consider a novel class of models, Gaussian-mixture low-rank recurrent networks, in which the rank of the connectivity matrix and the number of statistically-defined populations are independent hyper-parameters. We show that the resulting collective dynamics form a dynamical system, where the rank sets the dimensionality and the population structure shapes the dynamics. In particular, the collective dynamics can be described in terms of a simplified effective circuit of interacting latent variables. While having a single, global population strongly restricts the possible dynamics, we demonstrate that if the number of populations is large enough, a rank $R$ network can approximate any $R$-dimensional dynamical system.

##### A (Slightly) Improved Approximation Algorithm for Metric TSP
###### Anna R. Karlin, Nathan Klein, Shayan Oveis Gharan

For some $\epsilon > 10^{-36}$ we give a $3/2-\epsilon$ approximation algorithm for metric TSP.

##### Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge
###### Pat Verga, Haitian Sun, Livio Baldini Soares, William W. Cohen

Massive language models are the core of modern NLP modeling and have been shown to encode impressive amounts of commonsense and factual information. However, that knowledge exists only within the latent parameters of the model, inaccessible to inspection and interpretation, and even worse, factual information memorized from the training corpora is likely to become stale as the world changes. Knowledge stored as parameters will also inevitably exhibit all of the biases inherent in the source materials. To address these problems, we develop a neural language model that includes an explicit interface between symbolically interpretable factual information and subsymbolic neural knowledge. We show that this model dramatically improves performance on two knowledge-intensive question-answering tasks. More interestingly, the model can be updated without re-training by manipulating its symbolic representations. In particular this model allows us to add new facts and overwrite existing ones in ways that are not possible for earlier models.

##### Line Walking and Balancing for Legged Robots with Point Feet
###### Carlos Gonzalez, Victor Barasuol, Marco Frigerio, Roy Featherstone, Darwin G. Caldwell, Claudio Semini

The ability of legged systems to traverse highly-constrained environments depends by and large on the performance of their motion and balance controllers. This paper presents a controller that excels in a scenario that most state-of-the-art balance controllers have not yet addressed: line walking, or walking on nearly null support regions. Our approach uses a low-dimensional virtual model (2-DoF) to generate balancing actions through a previously derived four-term balance controller and transforms them to the robot through a derived kinematic mapping. The capabilities of this controller are tested in simulation, where we show the 90kg quadruped robot HyQ crossing a bridge of only 6 cm width (compared to its 4 cm diameter foot sphere), by balancing on two feet at any time while moving along a line. Additional simulations are carried to test the performance of the controller and the effect of external disturbances. The same controller is then used on the real robot to present for the first time a legged robot balancing on a contact line of nearly null support area.

##### Sequential Transfer in Reinforcement Learning with a Generative Model
###### Andrea Tirinzoni, Riccardo Poiani, Marcello Restelli

We are interested in how to design reinforcement learning agents that provably reduce the sample complexity for learning new tasks by transferring knowledge from previously-solved ones. The availability of solutions to related problems poses a fundamental trade-off: whether to seek policies that are expected to achieve high (yet sub-optimal) performance in the new task immediately or whether to seek information to quickly identify an optimal solution, potentially at the cost of poor initial behavior. In this work, we focus on the second objective when the agent has access to a generative model of state-action pairs. First, given a set of solved tasks containing an approximation of the target one, we design an algorithm that quickly identifies an accurate solution by seeking the state-action pairs that are most informative for this purpose. We derive PAC bounds on its sample complexity which clearly demonstrate the benefits of using this kind of prior knowledge. Then, we show how to learn these approximate tasks sequentially by reducing our transfer setting to a hidden Markov model and employing spectral methods to recover its parameters. Finally, we empirically verify our theoretical findings in simple simulated domains.

##### On Linear Identifiability of Learned Representations
###### Geoffrey Roeder, Luke Metz, Diederik P. Kingma

Identifiability is a desirable property of a statistical model: it implies that the true model parameters may be estimated to any desired precision, given sufficient computational resources and data. We study identifiability in the context of representation learning: discovering nonlinear data representations that are optimal with respect to some downstream task. When parameterized as deep neural networks, such representation functions typically lack identifiability in parameter space, because they are overparameterized by design. In this paper, building on recent advances in nonlinear ICA, we aim to rehabilitate identifiability by showing that a large family of discriminative models are in fact identifiable in function space, up to a linear indeterminacy. Many models for representation learning in a wide variety of domains have been identifiable in this sense, including text, images and audio, state-of-the-art at time of publication. We derive sufficient conditions for linear identifiability and provide empirical support for the result on both simulated and real-world data.

##### Object Goal Navigation using Goal-Oriented Semantic Exploration
###### Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, Ruslan Salakhutdinov

This work studies the problem of object goal navigation which involves navigating to an instance of the given object category in unseen environments. End-to-end learning-based navigation methods struggle at this task as they are ineffective at exploration and long-term planning. We propose a modular system called, `Goal-Oriented Semantic Exploration' which builds an episodic semantic map and uses it to explore the environment efficiently based on the goal object category. Empirical results in visually realistic simulation environments show that the proposed model outperforms a wide range of baselines including end-to-end learning-based methods as well as modular map-based methods and led to the winning entry of the CVPR-2020 Habitat ObjectNav Challenge. Ablation analysis indicates that the proposed model learns semantic priors of the relative arrangement of objects in a scene, and uses them to explore efficiently. Domain-agnostic module design allow us to transfer our model to a mobile robot platform and achieve similar performance for object goal navigation in the real-world.

##### Debiased Contrastive Learning
###### Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba, Stefanie Jegelka

A prominent technique for self-supervised representation learning has been to contrast semantically similar and dissimilar pairs of samples. Without access to labels, dissimilar (negative) points are typically taken to be randomly sampled datapoints, implicitly accepting that these points may, in reality, actually have the same label. Perhaps unsurprisingly, we observe that sampling negative examples from truly different labels improves performance, in a synthetic setting where labels are available. Motivated by this observation, we develop a debiased contrastive objective that corrects for the sampling of same-label datapoints, even without knowledge of the true labels. Empirically, the proposed objective consistently outperforms the state-of-the-art for representation learning in vision, language, and reinforcement learning benchmarks. Theoretically, we establish generalization bounds for the downstream classification task.

##### Causal Discovery in Physical Systems from Videos
###### Yunzhu Li, Antonio Torralba, Animashree Anandkumar, Dieter Fox, Animesh Garg

Causal discovery is at the core of human cognition. It enables us to reason about the environment and make counterfactual predictions about unseen scenarios, that can vastly differ from our previous experiences. We consider the task of causal discovery from videos in an end-to-end fashion without supervision on the ground-truth graph structure. In particular, our goal is to discover the structural dependencies among environmental and object variables: inferring the type and strength of interactions that have a causal effect on the behavior of the dynamical system. Our model consists of (a) a perception module that extracts a semantically meaningful and temporally consistent keypoint representation from images, (b) an inference module for determining the graph distribution induced by the detected keypoints, and (c) a dynamics module that can predict the future by conditioning on the inferred graph. We assume access to different configurations and environmental conditions, i.e., data from unknown interventions on the underlying system; thus, we can hope to discover the correct underlying causal graph without explicit interventions. We evaluate our method in a planar multi-body interaction environment and scenarios involving fabrics of different shapes like shirts and pants. Experiments demonstrate that our model can correctly identify the interactions from a short sequence of images and make long-term future predictions. The causal structure assumed by the model also allows it to make counterfactual predictions and extrapolate to systems of unseen interaction graphs or graphs of various sizes.

##### GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
###### Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

##### Involutive MCMC: a Unifying Framework
###### Kirill Neklyudov, Max Welling, Evgenii Egorov, Dmitry Vetrov

Markov Chain Monte Carlo (MCMC) is a computational approach to fundamental problems such as inference, integration, optimization, and simulation. The field has developed a broad spectrum of algorithms, varying in the way they are motivated, the way they are applied and how efficiently they sample. Despite all the differences, many of them share the same core principle, which we unify as the Involutive MCMC (iMCMC) framework. Building upon this, we describe a wide range of MCMC algorithms in terms of iMCMC, and formulate a number of "tricks" which one can use as design principles for developing new MCMC algorithms. Thus, iMCMC provides a unified view of many known MCMC algorithms, which facilitates the derivation of powerful extensions. We demonstrate the latter with two examples where we transform known reversible MCMC algorithms into more efficient irreversible ones.

##### FathomNet: An underwater image training database for ocean exploration and discovery
###### Océane Boulais, Ben Woodward, Brian Schlining, Lonny Lundsten, Kevin Barnard, Katy Croff Bell, Kakani Katija

Thousands of hours of marine video data are collected annually from remotely operated vehicles (ROVs) and other underwater assets. However, current manual methods of analysis impede the full utilization of collected data for real time algorithms for ROV and large biodiversity analyses. FathomNet is a novel baseline image training set, optimized to accelerate development of modern, intelligent, and automated analysis of underwater imagery. Our seed data set consists of an expertly annotated and continuously maintained database with more than 26,000 hours of videotape, 6.8 million annotations, and 4,349 terms in the knowledge base. FathomNet leverages this data set by providing imagery, localizations, and class labels of underwater concepts in order to enable machine learning algorithm development. To date, there are more than 80,000 images and 106,000 localizations for 233 different classes, including midwater and benthic organisms. Our experiments consisted of training various deep learning algorithms with approaches to address weakly supervised localization, image labeling, object detection and classification which prove to be promising. While we find quality results on prediction for this new dataset, our results indicate that we are ultimately in need of a larger data set for ocean exploration.

##### Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
###### Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret

Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$, where $N$ is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.