December 05, 2019
Normalizing Flows for Probabilistic Modeling and Inference
George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, Balaji Lakshminarayanan
Normalizing flows provide a general mechanism for defining expressive probability distributions, only requiring the specification of a (usually simple) base distribution and a series of bijective transformations. There has been much recent work on normalizing flows, ranging from improving their expressive power to expanding their application. We believe the field has now matured and is in need of a unified perspective. In this review, we attempt to provide such a perspective by describing flows through the lens of probabilistic modeling and inference. We place special emphasis on the fundamental principles of flow design, and discuss foundational topics such as expressive power and computational trade-offs. We also broaden the conceptual framing of flows by relating them to more general probability transformations. Lastly, we summarize the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.
Looking for something to read in your flight to #NeurIPS2019? Read about Normalizing Flows from our extensive review paper (also with new insights on how to think about and derive new flows) https://t.co/cPjQjZn3uf with @gpapamak @eric_nalisnick @DeepSpiker @balajiln @shakir_za pic.twitter.com/EWh8Aui7n0— Danilo J. Rezende (@DeepSpiker) December 6, 2019
Shameless self-plug (if interested):— Joan Serrà (@serrjoa) December 6, 2019
1) https://t.co/QcIxvklSdp Despite audio, there are general learnings, like using single-scale structure or forward-backward 'style' transfer.
2) https://t.co/rutwPguZbn Where show flow likelihoods are strongly influenced by input complexity.
WRT the second paper: really cool! I also noticed the complexity bias effect in https://t.co/fw0KCviq5S where I was able to condition a flow model on images with monochromatic occlusions, but not noisy occlusions (figure 6)— Andrew Gambardella (@gambsgambs) December 6, 2019
Good job suddenly upending my flight plans, Danilo 😅— Julien Cornebise (@JCornebise) December 6, 2019
Very clear presentation and figures--appreciate this timely review. For researchers new to this area, @ari_seff just posted an introductory video on normalizing flows--worth checking out! https://t.co/I2OeGxQYdS— Geoffrey Roeder (@geoffroeder) December 6, 2019
Check out our extensive review paper on normalizing flows!— George Papamakarios (@gpapamak) December 6, 2019
This paper is the product of years of thinking about flows: it contains everything we know about them, and many new insights.
With @eric_nalisnick, @DeepSpiker, @shakir_za, @balajiln.https://t.co/BBymd1uSwx
Thread 👇 https://t.co/er8QebcPS2
We hope there is something there for everyone interested in flows:— George Papamakarios (@gpapamak) December 6, 2019
- A gentle introduction for those wanting to get started.
- Explanations of existing flows for practitioners who want to deepen their understanding.
- Advanced topics for seasoned experts. pic.twitter.com/j7q5jcgrJD
Computational Mirrors: Blind Inverse Light Transport by Deep Matrix Factorization
Miika Aittala, Prafull Sharma, Lukas Murmann, Adam B. Yedidia, Gregory W. Wornell, William T. Freeman, Fredo Durand
We recover a video of the motion taking place in a hidden scene by observing changes in indirect illumination in a nearby uncalibrated visible region. We solve this problem by factoring the observed video into a matrix product between the unknown hidden scene video and an unknown light transport matrix. This task is extremely ill-posed, as any non-negative factorization will satisfy the data. Inspired by recent work on the Deep Image Prior, we parameterize the factor matrices using randomly initialized convolutional neural networks trained in a one-off manner, and show that this results in decompositions that reflect the true motion in the hidden scene.
Using computers to view the unseen - A new computational imaging method could change how we view hidden information in scenes. Amazing research by scientists from @MIT_CSAIL. Full Paper Here > https://t.co/PQp4N1fX7z via @antgrasso #ComputerScience #AI #ComputerVision pic.twitter.com/2EPYzu9FAw— Antonio Grasso (@antgrasso) December 6, 2019
Computational Mirrors: Blind Inverse Light Transport by Deep Matrix Factorization. An overview of the architecture and data flow of @MIT_CSAIL blind inverse light transport method. Full Paper Here > https://t.co/PQp4N1fX7z via @antgrasso #ComputerScience #AI #ComputerVision pic.twitter.com/8OgDtOH9YD— Antonio Grasso (@antgrasso) December 6, 2019
Deep Ensembles: A Loss Landscape Perspective
Stanislav Fort, Huiyi Hu, Balaji Lakshminarayanan
Deep ensembles have been empirically shown to be a promising approach for improving accuracy, uncertainty and out-of-distribution robustness of deep learning models. While deep ensembles were theoretically motivated by the bootstrap, non-bootstrap ensembles trained with just random initialization also perform well in practice, which suggests that there could be other explanations for why deep ensembles work well. Bayesian neural networks, which learn distributions over the parameters of the network, are theoretically well-motivated by Bayesian principles, but do not perform as well as deep ensembles in practice, particularly under dataset shift. One possible explanation for this gap between theory and practice is that popular scalable approximate Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space. We investigate this hypothesis by building on recent work on understanding the loss landscape of neural networks and adding our own exploration to measure the similarity of functions in the space of predictions. Our results show that random initializations explore entirely different modes, while functions along an optimization trajectory or sampled from the subspace thereof cluster within a single mode predictions-wise, while often deviating significantly in the weight space. We demonstrate that while low-loss connectors between modes exist, they are not connected in the space of predictions. Developing the concept of the diversity--accuracy plane, we show that the decorrelation power of random initializations is unmatched by popular subspace sampling methods.
Why do deep ensembles trained with just random initialization work surprisingly well in practice?— Balaji Lakshminarayanan (@balajiln) December 7, 2019
In our recent paper https://t.co/pnvqezb7a9 with @stanislavfort & Huiyi Hu, we investigate this by using insights from recent work on loss landscape of neural nets.
2) One hypothesis is that ensembles may lead to different modes while scalable Bayesian methods may sample from a single mode.— Balaji Lakshminarayanan (@balajiln) December 7, 2019
We measure the similarity of function (both in weight space and function space) to test this hypothesis. pic.twitter.com/QYfqvSDmWd
3) t-SNE plot of predictions along training trajectories (marked by different colors) shows that random initialization leads to diverse functions. Sampling functions from a subspace corresponding to a single trajectory increases diversity but not as much as random init. pic.twitter.com/1hxxu12a4c— Balaji Lakshminarayanan (@balajiln) December 7, 2019
4) From a bias-variance perspective, we care about both accurate solutions (low bias) and diverse solutions (as decorrelation reduces variance).— Balaji Lakshminarayanan (@balajiln) December 7, 2019
Given a reference solution, we plot diversity vs accuracy to measure how different methods trade-off diversity vs accuracy. pic.twitter.com/Gkh7N48QQH
5) We also validate the hypothesis by building low-loss tunnels between solutions found by different random inits. While points along low loss tunnel have similar accuracies, the function space disagreement between them & the two end points shows that the modes are diverse. pic.twitter.com/JUsaysXIpA— Balaji Lakshminarayanan (@balajiln) December 7, 2019
If you'd like to learn more, check out our paper https://t.co/pnvqezb7a9 :)@stanislavfort will also be giving a contributed talk about our work on Dec 13 (Friday) 9-915 AM and presenting a poster at the Bayesian deep learning workshop (https://t.co/OyPfyWua8Z) at #NeurIPS2019 pic.twitter.com/NrpnmTNlDv— Balaji Lakshminarayanan (@balajiln) December 7, 2019
By any chance have you tried any of the modes investigation that you did on CIFAR on ImageNet? There is some possible arguments that on a dataset like ImageNet, where there is significantly more structure than CIFAR, finding "different" optimas could be much harder.— AI Actor-Critic (@AIActorCritic) December 9, 2019
Combining Q-Learning and Search with Amortized Value Estimates
Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theophane Weber, Lars Buesing, Peter W. Battaglia
We introduce "Search with Amortized Value Estimates" (SAVE), an approach for combining model-free Q-learning with model-based Monte-Carlo Tree Search (MCTS). In SAVE, a learned prior over state-action values is used to guide MCTS, which estimates an improved set of state-action values. The new Q-estimates are then used in combination with real experience to update the prior. This effectively amortizes the value computation performed by MCTS, resulting in a cooperative relationship between model-free learning and model-based search. SAVE can be implemented on top of any Q-learning agent with access to a model, which we demonstrate by incorporating it into agents that perform challenging physical reasoning tasks and Atari. SAVE consistently achieves higher rewards with fewer training steps, and---in contrast to typical model-based search approaches---yields strong performance with very small search budgets. By combining real experience with information computed during search, SAVE demonstrates that it is possible to improve on both the performance of model-free learning and the computational cost of planning.
Combining Q-Learning and Search with Amortized Value Estimates— hardmaru (@hardmaru) December 7, 2019
“By combining real experience with information computed during search, we show it is possible to improve on both the performance of model-free learning and the computational cost of planning”https://t.co/5zjaTQPQyM https://t.co/s0lQd7cLHl pic.twitter.com/gGOPQl4Xk8
Excited to add to the growing literature on model-based deep RL! Search with Amortized Value Estimates (SAVE) leverages both real and planned experience by combining Q-learning with MCTS, achieving strong performance with very small search budgets. 1/4 https://t.co/zjAmrAo3OZ pic.twitter.com/3Ra2qzcmqK— Jess Hamrick (@jhamrick) December 6, 2019
We show that SAVE improves upon the model-based results on our Construction tasks, and show that it also achieves good performance on a challenging new Marble Run environment. 2/4 pic.twitter.com/wG3vasPR2d— Jess Hamrick (@jhamrick) December 6, 2019
We also find that SAVE works very well with small search budgets, in comparison to related approaches such as AlphaZero. We show that such methods, which learn policies based on the visit counts of actions during search, struggle with small search budgets. 3/4— Jess Hamrick (@jhamrick) December 6, 2019
Neural Tangents: Fast and Easy Infinite Neural Networks in Python
Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, Samuel S. Schoenholz
Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using gradient descent via the Neural Tangent Kernel. Additionally, Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space. The entire library runs out-of-the-box on CPU, GPU, or TPU. All computations can be automatically distributed over multiple accelerators with near-linear scaling in the number of devices. Neural Tangents is available at www.github.com/google/neural-tangents. We also provide an accompanying interactive Colab notebook.
Nice work! I like how readable the examples are (e.g., https://t.co/EwPW9heVwd).— Dustin Tran (@dustinvtran) December 6, 2019
Over the past few years there has been a lot of research into very wide / infinitely wide neural networks. This limit has a lot of appealing properties: neural networks become Gaussian Processes and their gradient-descent training dynamics become completely tractable.— Sam Schoenholz (@sschoenholz) December 6, 2019
This gives unprecedented insight into how / why neural networks behave the way they do. However, the math for infinite networks can be tricky and had to be worked out from scratch for each new architecture.— Sam Schoenholz (@sschoenholz) December 6, 2019
This is very similar to deep learning pre-automatic differentiation.
The core of Neural Tangents is a high level neural network library. Any network specified in Neural Tangents automatically comes with a function to compute the infinite-width limit analytically.— Sam Schoenholz (@sschoenholz) December 6, 2019
Here's an example for a two-hidden layer FC network: pic.twitter.com/xtk3BTkGbD
But of course, why stop there. Here's the code to compute the infinite-width limit of a Wide Residual Network (Resnet 28-\infty): pic.twitter.com/3fQc8tj28b— Sam Schoenholz (@sschoenholz) December 6, 2019
We also include functions to perform exact Bayesian inference on any of these models as well as continuous time gradient descent via the Neural Tangent Kernel. Here's how one could do gradient descent training. Ofc these models naturally come with uncertainty estimates. pic.twitter.com/9Ef1y1bJkj— Sam Schoenholz (@sschoenholz) December 6, 2019
There's a lot more in NT and we're actively working on it to make it even better.— Sam Schoenholz (@sschoenholz) December 6, 2019
If you're around NeurIPS, Roman Novak will be giving a talk at the AABI and we'll also be at the Bayesian DL workshop and the Science meets Engineering workshop. Come say hi!
Learned about this project at Jascha's @FieldsInstitute @VectorInst talk! Code looks as tagline says: Fast and Easy (just replace jax.experimental.stax with neural_tangents.stax)!— Jesse Bettencourt (@jessebett) December 6, 2019
re: Fast can you talk a bit about scaling. For example, what's the cost vs vanilla stax.Dense?
Neural Tangents is a Python library designed to enable research into “infinite-width” neural networks.— hardmaru (@hardmaru) December 7, 2019
They provide an API for specifying complex neural network architectures that can then be trained and evaluated in their infinite-width limit. 🙉🤯https://t.co/Wr2SqlMOwA https://t.co/vAXC02pAs8
Infinite width networks (NNGPs and NTKs) are the most promising lead for theoretical understanding in deep learning. But, running experiments with them currently resembles the dark age of ML research before ubiquitous automatic differentiation. Neural Tangents fixes that. https://t.co/a3unONiXkV— Jascha (@jaschasd) December 6, 2019
In the papers this week.— Mark Ghuneim (@MarkGhuneim) December 7, 2019
TabFact: A Large-scale Dataset for Table-based Fact Verification https://t.co/PtV1QhWx3C
Learning a Representation for Cover Song Identification Using Convolutional Neural Network https://t.co/GTzxmNRFgl— Mark Ghuneim (@MarkGhuneim) December 7, 2019
Neural Tangents: Fast and Easy Infinite Neural Networks in Python https://t.co/BEIDXmh8wh— Mark Ghuneim (@MarkGhuneim) December 7, 2019
Building high-level features using large scale unsupervised learning https://t.co/uxE0QKTgIw— Mark Ghuneim (@MarkGhuneim) December 7, 2019
Troubling Trends in Machine Learning Scholarship https://t.co/IMdr1KDyDK— Mark Ghuneim (@MarkGhuneim) December 7, 2019
December 04, 2019
StarGAN v2: Diverse Image Synthesis for Multiple Domains
Yunjey Choi, Youngjung Uh, Jaejun Yoo, Jung-Woo Ha
A good image-to-image translation model should learn a mapping between different visual domains while satisfying the following properties: 1) diversity of generated images and 2) scalability over multiple domains. Existing methods address either of the issues, having limited diversity or multiple models for all domains. We propose StarGAN v2, a single framework that tackles both and shows significantly improved results over the baselines. Experiments on CelebA-HQ and a new animal faces dataset (AFHQ) validate our superiority in terms of visual quality, diversity, and scalability. To better assess image-to-image translation models, we release AFHQ, high-quality animal faces with large inter- and intra-domain differences. The code, pretrained models, and dataset can be found at https://github.com/clovaai/stargan-v2.
Glad to share our new work, #StarGANv2. Our model can generate high-quality images reflecting the diverse styles (e.g., hairstyles, makeup) of reference images.— Yunjey Choi (@yunjey_choi) December 5, 2019
co-authors: @YoungjungUh @Jaejun_Yoo @JungWooHa2 pic.twitter.com/AHJx4itYLL
Additional results using different source and reference images. pic.twitter.com/G1M4rwOR8X— Yunjey Choi (@yunjey_choi) December 5, 2019
We will add more interesting videos. Please stay tuned.— Yunjey Choi (@yunjey_choi) December 5, 2019
Amazing stuff! Excited to see Naver working on awesome projects! ☺️— Jin (@jinyeom95) December 6, 2019
Fantastic Generalization Measures and Where to Find Them
Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio
Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
Fantastic Generalization Measures and Where to Find Them— hardmaru (@hardmaru) December 5, 2019
“We present the first large scale study of generalization in deep networks. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters.”https://t.co/fRZDZvofHE https://t.co/c1Jd1tDlwA pic.twitter.com/b0FAhjAReX
Such a good title— commandasaurus 🦕 (@amcasari) December 5, 2019
Err..On Dec 4, 2 papers appear on ArXiv. The 1st one claims '...(sic 'module criticality') is able to explain the superior generalization performance of some architectures— Vinay Prabhu (@vinayprabhu) December 5, 2019
over others, whereas earlier measures fail to do so.',but no mention of this in the 2nd paper? @bneyshabur
These two papers are done in parallel so there was no such measure when we were comparing different complexity measures 😀— Behnam Neyshabur (@bneyshabur) December 6, 2019
So, to conclude: module criticality is not a fantastic generalization measure and you will not find one in https://t.co/qhakiRsU1y ;) ?— Vinay Prabhu (@vinayprabhu) December 6, 2019
Do the same observations hold on larger datasets, like CIFAR-100 and Imagenet?— Grigory Yaroslavtsev (@gyaroslavtsev) December 5, 2019
A paper title written in WingDings, you heard it here first.— Ed Henry (@EdHenry_) December 5, 2019
you might not completely hate https://t.co/V2dTmgk0ti— Leon Derczynski (@LeonDerczynski) December 5, 2019
One of the most comprehensive studies of generalization to date; ≈40 complexity measures over ≈10K deep models. Surprising observations worthy of further investigations. Fantastic Generalization Measures: https://t.co/cjg94IIQbE w @yidingjiang @bneyshabur @dilipkay S. Bengio pic.twitter.com/POG4DoNaAU— Hossein Mobahi (@TheGradient) December 5, 2019
Where's the Takeuchi Information Criterion?!— Nicolas Le Roux (@le_roux_nicolas) December 5, 2019
Interesting paper from Google testing a variety of generalization measures on a variety of models. The most encouraging info for me was in the last paragraph: Google has computational constraints!!!! https://t.co/2XX7LAoFrm pic.twitter.com/pX0uk21Ex8— Dileep George (@dileeplearning) December 5, 2019
December 03, 2019
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi
Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.
Dream to Control: Learning Behaviors by Latent Imagination— hardmaru (@hardmaru) December 4, 2019
The agent learns a latent world model via interactions, and backprops thru imagined latent trajectories of this model to learn useful behaviors. @danijarh et al.
code https://t.co/6mHirRhTvH pic.twitter.com/67ePHW3Z2T
Hm... so what is this? A MuZero (latent planning space learned from predicting rewards) which uses MPC over the unrolled model instead of MCTS?— 𝔊𝔴𝔢𝔯𝔫 (@gwern) December 5, 2019
We introduce Dreamer, an RL agent that solves long-horizon tasks from images purely by latent imagination inside a world model. Dreamer improves over existing methods across 20 tasks.— Danijar Hafner (@danijarh) December 4, 2019
Thread 👇 pic.twitter.com/K5DnooVIUH
Dreamer learns a world model from experience. Inside the compact latent space of the model, it predicts actions and state values. The policy is optimized efficiently by propagating analytic value gradients back through imagined trajectories. pic.twitter.com/61JMdSV76d— Danijar Hafner (@danijarh) December 4, 2019
Naturally, the value function enables longsighted behavior and lets Dreamer be robust to the imagination horizon. This lets us solve new tasks that a policy without value function or online planning with PlaNet could not solve. pic.twitter.com/fee40JWADV— Danijar Hafner (@danijarh) December 4, 2019
We evaluate Dreamer across 20 challenging visual control tasks with image inputs, where it exceeds previous methods in terms of final performance, sample-efficiency, and wall-clock time. Dreamer is also applicable to discrete actions and episodes with early termination. pic.twitter.com/JsfxKPoza3— Danijar Hafner (@danijarh) December 4, 2019
Did you try using local contexts for the contrastive loss as done in DIM and ST-DIM? You are right that the global contrastive loss cannot capture much information, but contrastive losses b/w local contexts seems to have quite an improvement (Table 1&2 in DIM, and Fig3 in ST-DIM)— Ankesh Anand (@ankesh_anand) December 4, 2019
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.
The first full paper on @pytorch after 3 years of development.— Soumith Chintala (@soumithchintala) December 6, 2019
It describes our goals, design principles, technical details uptil v0.4
Catch the poster at #NeurIPS2019
Authored by @apaszke , @colesbury et. al.https://t.co/XFyX0qt1RH
This one is to all of us from around the world who in late 2016 to mid 2017 shaped the project, from the (Lua)Torch community to new core contributors, who baked the open-source cake on days, nights and weekends!— Soumith Chintala (@soumithchintala) December 6, 2019
The is only the first. We will be writing more papers on @PyTorch , on v1.0 headed by @dzhugakov @jiayq and many other pioneers, detailed papers on the JIT compiler headed by @zdevito @OwenResistor and team, on distributed pytorch, etc.— Soumith Chintala (@soumithchintala) December 6, 2019
Thanks for the shoutout to DyNet!— Chris Dyer (@redpony) December 6, 2019
December 02, 2019
LOGAN: Latent Optimisation for Generative Adversarial Networks
Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, Timothy Lillicrap
Training generative adversarial networks requires balancing of delicate adversarial dynamics. Even with careful tuning, training may diverge or end up in a bad equilibrium with dropped modes. In this work, we introduce a new form of latent optimisation inspired by the CS-GAN and show that it improves adversarial dynamics by enhancing interactions between the discriminator and the generator. We develop supporting theoretical analysis from the perspectives of differentiable games and stochastic approximation. Our experiments demonstrate that latent optimisation can significantly improve GAN training, obtaining state-of-the-art performance for the ImageNet (128 x 128) dataset. Our model achieves an Inception Score (IS) of 148 and an Fr\'echet Inception Distance (FID) of 3.4, an improvement of 17% and 32% in IS and FID respectively, compared with the baseline BigGAN-deep model with the same architecture and number of parameters.
We introduce LOGAN, a game-theory motivated algorithm, which improves the state-of-the-art in GAN image generation by over 30% measured in FID: https://t.co/6QphP1FOZU— DeepMind (@DeepMindAI) December 3, 2019
Here are samples showing higher diversity: pic.twitter.com/GkdRofrYRt
would have been fun to use images of woverine in the example...— Prof Hugo Spiers (@hugospiers) December 4, 2019
Deep Learning for Symbolic Mathematics
Guillaume Lample, François Charton
Neural networks have a reputation for being better at solving statistical or approximate problems than at performing calculations or working with symbolic data. In this paper, we show that they can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations. We propose a syntax for representing mathematical problems, and methods for generating large datasets that can be used to train sequence-to-sequence models. We achieve results that outperform commercial Computer Algebra Systems such as Matlab or Mathematica.
Deep Learning For Symbolic Mathematics— hardmaru (@hardmaru) September 27, 2019
They compare performance of standard seq2seq models (trained on generated datasets) on more elaborated mathematical tasks such as symbolic integration and solving differential equations, with Mathematica and Matlab.https://t.co/MuTcXpUh5Q pic.twitter.com/EhCce83Xgc
When humans solve integrals or differential equations by hand, I wonder how much pattern recognition we perform compared to the actual symbolic manipulation? pic.twitter.com/unrqRx78AK— hardmaru (@hardmaru) September 27, 2019
When it’s computer science they call it an algorithm, when it’s math they call it statistics, and when it’s bullsh🤬t they call it artificial intelligence 😇— Twenty Twentie and the Twenties (@cbarrett) December 5, 2019
If @an_open_mind indeed said what you claim he said, then he should be fired for gross incompetence. Hard to believe a Facebook's head of AI to be this clueless.— Carlos E. Perez 🧢 (@IntuitMachine) December 4, 2019
Another editor writing misleading headlines. In context, the "hit the wall" quote refers to the number of compute cycles, not overall progress.— Alyssa Vance (@alyssamvance) December 4, 2019
Pretty misleading headline unfortunately. This is only taking about computing resources.— Jerome Pesenti (@an_open_mind) December 4, 2019
He is just referring to the computational cost of experiments. That's just Stein's Law— Thomas G. Dietterich (@tdietterich) December 4, 2019
the headline was arguably clickbait, as discussed elsewhere, but here’s why I think the interview is nonetheless significant: https://t.co/GDqpoMipEl— Gary Marcus (@GaryMarcus) December 5, 2019
Didn't we have enough of this in the last couple of days? What is the objective of these tweets which search for trouble where there is none?— Christian Wolf (@chriswolfvision) December 5, 2019
I'm writing a paper on "higher-order" automatic differentiation, in the senses of higher-order derivatives and of higher-order functions. As an experiment, I'm sharing early, incomplete drafts. Feedback is most welcome! https://t.co/fCnYgBuNTG— Conal Elliott (@conal) December 5, 2019
Cool!!! I wish that more researchers would do this. I left some ideas / feedback here which may or may not be helpful or make sense 😀 https://t.co/s9I2FsdWPn— Paul Chiusano (@pchiusano) December 6, 2019
Our new paper, Deep Learning for Symbolic Mathematics, is now on arXiv https://t.co/cxAa3upB6h— Guillaume Lample (@GuillaumeLample) December 4, 2019
We added *a lot* of new results compared to the original submission. With @f_charton (1/7) pic.twitter.com/GrhQRT5WRW
Although neural networks struggle on simple arithmetic tasks such as addition and multiplication, we show that transformers perform surprisingly well on difficult mathematical problems such as function integration and differential equations. (2/7)— Guillaume Lample (@GuillaumeLample) December 4, 2019
We define a general framework to adapt seq2seq models to various mathematical problems, and present different techniques to generate arbitrarily large datasets of functions with their integrals, and differential equations with their solutions. (3/7)— Guillaume Lample (@GuillaumeLample) December 4, 2019
On samples of randomly generated functions, we show that transformers achieve state-of-the-art performance and outperform computer algebra systems such as Mathematica. (4/7)— Guillaume Lample (@GuillaumeLample) December 4, 2019
We show that beam search can generate alternative solutions for a differential equation, all equivalent, but written in very different ways. The model was never trained to do this, but managed to figure out that different expressions correspond to the same mathematical object 5/7— Guillaume Lample (@GuillaumeLample) December 4, 2019
We also observe that a transformer trained on functions that SymPy can integrate, is able at test time to integrate functions that SymPy is not able to integrate, i.e. the model was able to generalize beyond the set of functions integrable by SymPy. (6/7)— Guillaume Lample (@GuillaumeLample) December 4, 2019
A purely neural approach is not sufficient, since it still requires a symbolic framework to check generated hypotheses. Yet, our models perform best on very long inputs, where computer algebra systems struggle. Symbolic computation may benefit from hybrid approaches. (7/7)— Guillaume Lample (@GuillaumeLample) December 4, 2019
Do you think it's possible to interpret the network hidden layer activations (eg attention weights) to gain some insights on how the network manage to 'reason' and solve the hardest cases?— Olivier Grisel (@ogrisel) December 4, 2019
Awesome work! :) And thank you for the very helpful and clear Twitter summary.— Jeremy Howard (@jeremyphoward) December 4, 2019
Thanks for trying this! This totally overrules what I thought was possible— Phillip Wang (@lucidrains) December 4, 2019
If you are not aware, we had a discussion about the original review paper on the SymPy mailing list https://t.co/h6AG8RvUVG— Aaron Meurer (@asmeurer) December 4, 2019
This is literally the most well written ML paper I’ve ever read— (((David Shor))) (@davidshor) December 5, 2019
Deep learning for all the things!— Visiting Fellow (@jackiefloyd) December 5, 2019
if you can't help me no one can... i'm looking for a closed form solution for the inverse of the indefinite integral of f(x) = sqrt(A*x^2 + B*x + C). computing the integral (with some help) was easy, but i could not find a good inverse. had to use newton-raphson.— L. ☕️. Ritter (@paniq) December 8, 2019
Also Keplers conjecture— Suresh Venkatasubramanian (@geomblog) December 7, 2019
November 29, 2019
What's Hidden in a Randomly Weighted Neural Network?
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari
Training a neural network is synonymous with learning the values of the weights. In contrast, we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever training the weight values. Hidden in a randomly weighted Wide ResNet-50 we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet. Not only do these "untrained subnetworks" exist, but we provide an algorithm to effectively find them. We empirically show that as randomly weighted neural networks with fixed weights grow wider and deeper, an "untrained subnetwork" approaches a network with learned weights in accuracy.
What's Hidden in a Randomly Weighted Neural Network?— hardmaru (@hardmaru) December 3, 2019
“Hidden in a randomly weighted Wide ResNet-50 we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet.” 😮https://t.co/Z4MWcXEzR8 https://t.co/4Otdlz9oct
"Every block of stone has a statue inside it and it is the task of the sculptor to discover it." -- Michelangelo— Mario Klingemann (@quasimondo) December 3, 2019
Is it in the spirit of the lottery ticket hypothesis?— Piotr Migdal (@pmigdal) December 3, 2019
(IMHO by far the most profound result in deep learning theory.)
I hope to see more work into HyperNEAT-style genetic encoding for compressing large networks into a subnetwork enforcing regular structural patterns. An artificial “genetic bottleneck” should bring us closer to innate structures observed in animals: https://t.co/GijODKnFeb— hardmaru (@hardmaru) December 4, 2019
Now that would be something— Nikolai Yakovenko (@ivan_bezdomny) December 4, 2019
very interesting, but also not so interesting bc (1) isn't finding a subset of a net eqiv. (almost) to training the net? (2) you sample more, you increase your chance. https://t.co/uzsMyKpBIl— Kyunghyun Cho (@kchonyc) December 3, 2019
Not sure I agree with these criticisms, but also unclear how it differs from the Iterative magnitude pruning used by LTH.— Daniel Roy (@roydanroy) December 3, 2019
I think in some limit (as # of params grows) there may exist some subnetwork that's equivalent to a trained network (since eventually all possible subnetworks will exist), but it's an empirical question whether it exists for the sizes of network we deal with in practice— Adam Santoro (@santoroAI) December 3, 2019
What's hidden in an overparameterized neural network with random weights? If the distribution is properly scaled (e.g. Kaiming Normal), then it contains a subnetwork which achieves high accuracy without ever modifying the values of the weights...https://t.co/szj5c2oohG— Mitchell Wortsman (@Mitchnw) December 2, 2019
Alternate title: Randomly weighted neural networks. What do they contain? Do they contain things? Lets find out.https://t.co/szj5c2oohG— Mitchell Wortsman (@Mitchnw) December 2, 2019
So this subnetwork has no training involved? But are these subnetworks as good in subsequent fine-tuning as subnetworks found using the lottery ticket hypothesis?— Carlos E. Perez 🧢 (@IntuitMachine) December 3, 2019
This work is inspired by and builds upon the incredible foundational work of @oh_that_hat et al. in Deconstructing the LTH, Gaier and @hardmaru in Weight Agnostic Neural Networks, @jefrankle and @mcarbin in the LTH, and many many more -- check out these amazing/inspiring papers!— Mitchell Wortsman (@Mitchnw) December 5, 2019
I guess the implication here is that the fixed network architecture has a outsized influence rather than the actual weights. Can we arrive at this conclusion?— Carlos E. Perez 🧢 (@IntuitMachine) December 3, 2019
We recently found that a randomly initialized + fine-tuned BERT performs surprisingly well in 5/6 NLP tasks (80% acc for sentiment analysis!). I guess fine-tuning could be interpreted as tweaking the net so as to amplify the successful subnetwork?— Anna Rogers (@annargrs) December 4, 2019
Paper: https://t.co/p4DlQWsNAC pic.twitter.com/5fFvFq1bVL
Procedural Content Generation: From Automatically Generating Game Levels to Increasing Generality in Machine Learning
Sebastian Risi, Julian Togelius
The idea behind procedural content generation (PCG) in games is to create content automatically, using algorithms, instead of relying on user-designed content. While PCG approaches have traditionally focused on creating content for video games, they are now being applied to all kinds of virtual environments, thereby enabling training of machine learning systems that are significantly more general. For example, PCG's ability to generate never-ending streams of new levels has allowed DeepMind's Capture the Flag agent to reach beyond human-level-performance. Additionally, PCG-inspired methods such as domain randomization enabled OpenAI's robot arm to learn to manipulate objects with unprecedented dexterity. Level generation in 2D arcade games has also illuminated some shortcomings of standard deep RL methods, suggesting potential ways to train more general policies. This Review looks at key aspect of PCG approaches, including its ability to (1) enable new video games (such as No Man's Sky), (2) create open-ended learning environments, (3) combat overfitting in supervised and reinforcement learning tasks, and (4) create better benchmarks that could ultimately spur the development of better learning algorithms. We hope this article can introduce the broader machine learning community to PCG, which we believe will be a critical tool in creating a more general machine intelligence.
Procedural Content Generation: From Automatically Generating Game Levels to Increasing Generality in Machine Learning— hardmaru (@hardmaru) December 2, 2019
Nice review of PCG techniques in AI and games research by @risi1979 and @togelius 🤖👾https://t.co/uYM6wH2Fph https://t.co/X3PPMg3rrh pic.twitter.com/EkIUt8rPXT
PCG Gym Environment benchmarks released by OpenAI: https://t.co/1aquVBKTV5— hardmaru (@hardmaru) December 4, 2019
It's time for reinforcement learning researchers to take the domains they train on seriously. Training on a set of fixed scenarios/levels leads to brittle policies that don't generalize. Procedural content generation can help, argue @risi1979 and I here: https://t.co/Q7ZXPlgHOU pic.twitter.com/6UIzt8HY3l— Julian Togelius (@togelius) December 2, 2019
While some RL researchers have started using domain randomization, these are relatively simple methods for achieving variation. Procedural Content Generation is a set of techniques from game development and research that offers a much larger repertoire of methods for variation. pic.twitter.com/UVNioFHXe1— Julian Togelius (@togelius) December 2, 2019
PCG has been around in game development since the early eighties, originally to conserve storage space and remove the need for authoring some kinds of content, but increasingly also to make new types of games with their own PCG-based aesthetics possible. pic.twitter.com/2jf3PpLto8— Julian Togelius (@togelius) December 2, 2019
In games research, the last decade and a half has seen a number of very diverse techniques applied to content generation, including constraint satisfaction, evolutionary algorithms, fractals, grammar expansion, planning, and others. It's very far from simple randomness.— Julian Togelius (@togelius) December 2, 2019
Our paper surveys some of these methods mainly from the perspective of reinforcement learning research. But they are widely useful in other contexts as well. Note that this is a preprint, and while it has been submitted, we welcome constructive criticism!https://t.co/1627uhyCmJ— Julian Togelius (@togelius) December 2, 2019
Note that one thing that recent results in generality and specificity imply is that we need PCG-based environments for training general behavior. The Arcade Learning Environment and similar environments have had a good run, but it is time we move on to PCG-based training.— Julian Togelius (@togelius) December 2, 2019
Very much agree, fwiw— Dan Brickley (@danbri) December 3, 2019
We're signing the paper with our academic affiliations, @ITUkbh and @nyutandon, but also with our startup @modl_ai which applies this kind of research to problems in the game industry, working with major (and minor) game developers to make new kinds of games possible.— Julian Togelius (@togelius) December 2, 2019
Also, if you want a deeper dive into PCG methods, the following books are useful:— Julian Togelius (@togelius) December 2, 2019
Procedural Content Generation in Games (with @noorshak and @mm_jj_nn)https://t.co/d7Y1fTFomc
Chapter 4 of Artificial Intelligence and Games (with @yannakakis)https://t.co/nctY2rfcZu
Thanks! Very promising stuff!!— Carlos E. Perez 🧢 (@IntuitMachine) December 2, 2019
Aww, thanks for the little Wheel And Steal shot there.— mike cook (@mtrc) December 2, 2019
Most of the games there come with a set of levels, but no level generators per se. However, there are attempts at creating general level generations that work for any games, and even a competition for this!https://t.co/1nTyk9hPVx— Julian Togelius (@togelius) December 3, 2019
November 28, 2019
GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors
Masato Hagiwara, Masato Mita
The lack of large-scale datasets has been a major hindrance to the development of NLP tasks such as spelling correction and grammatical error correction (GEC). As a complementary new resource for these tasks, we present the GitHub Typo Corpus, a large-scale, multilingual dataset of misspellings and grammatical errors along with their corrections harvested from GitHub, a large and popular platform for hosting and sharing git repositories. The dataset, which we have made publicly available, contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date. We also describe our process for filtering true typo edits based on learned classifiers on a small annotated subset, and demonstrate that typo edits can be identified with F1 ~ 0.9 using a very simple classifier with only three features. The detailed analyses of the dataset show that existing spelling correctors merely achieve an F-measure of approx. 0.5, suggesting that the dataset serves as a new, rich source of spelling errors that complement existing datasets.
🎉Introducing GitHub Typo Corpus, a large-scale multilingual dataset of misspellings and grammatical errors. Contains 350k+ edits in 15+ languages. Code & Dataset https://t.co/yiXjxtUsYj Paper: https://t.co/poziDKWvGi joint work w/ @chemical_tree at RIKEN AIP and Tohoku Univ.— Masato Hagiwara (@mhagiwara) December 2, 2019
November 27, 2019
Contrastive Learning of Structured World Models
Thomas Kipf, Elise van der Pol, Max Welling
A structured understanding of our world in terms of objects, relations, and hierarchies is an important component of human cognition. Learning such a structured world model from raw sensory data remains a challenge. As a step towards this goal, we introduce Contrastively-trained Structured World Models (C-SWMs). C-SWMs utilize a contrastive approach for representation learning in environments with compositional structure. We structure each state embedding as a set of object representations and their relations, modeled by a graph neural network. This allows objects to be discovered from raw pixel observations without direct supervision as part of the learning process. We evaluate C-SWMs on compositional environments involving multiple interacting objects that can be manipulated independently by an agent, simple Atari games, and a multi-object physics simulation. Our experiments demonstrate that C-SWMs can overcome limitations of models based on pixel reconstruction and outperform typical representatives of this model class in highly structured environments, while learning interpretable object-based representations.
Contrastive Learning of Structured World Models— hardmaru (@hardmaru) November 28, 2019
A structured understanding of our world in terms of objects, relations and hierarchies is an important part of human cognition.
This paper explores using graph neural nets to learn structured world models.https://t.co/D61ZLbByLP pic.twitter.com/8NwzQoSMZW
Excited to share our work on Contrastive Learning of Structured World Models!— Thomas Kipf (@thomaskipf) November 28, 2019
C-SWMs learn object-factorized models & discover objects without supervision, using a simple loss inspired by work on graph embeddings
This is joint work w/ @ElisevanderPol & @wellingmax— Thomas Kipf (@thomaskipf) November 28, 2019
C-SWMs structure the latent space into multiple 'object slots' and learn an (additive) relational transition model using a graph neural network, to predict the next latent state. Result: highly structured latent spaces.
This allows the model to learn object-specific transitions in latent space that depend on the latent states of other objects, e.g. the positions of objects in the example below. The emerging grid in latent space reflects the structure of the environment.— Thomas Kipf (@thomaskipf) November 28, 2019
We propose a new ranking-based evaluation of "world models" directly in latent space, inspired by evaluation scores used in statistical relational learning (e.g., for knowledge base completion). Avoids common pitfalls of pixel-based evaluation.— Thomas Kipf (@thomaskipf) November 28, 2019
We also evaluate on two Atari games and find promising results, but C-SWMs still contain a number of limitations: a) no stochasticity, b) Markov assumption, and c) encoder cannot disambiguate multiple instances of the same object -- exciting directions for future work!— Thomas Kipf (@thomaskipf) November 28, 2019
cool stuff! In Eq. 5, how do you solve the data association problem of matching the i’th object with the same object at time t+1? (This was an issue for COBRA so they ended up using pixel loss instead of loss on z). Is it not an issue since the object extractor is deterministic?— Patrick Emami (@PatrickOmid) November 29, 2019
Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models
Giannis Daras, Augustus Odena, Han Zhang, Alexandros G. Dimakis
We introduce a new local sparse attention layer that preserves two-dimensional geometry and locality. We show that by just replacing the dense attention layer of SAGAN with our construction, we obtain very significant FID, Inception score and pure visual improvements. FID score is improved from $18.65$ to $15.94$ on ImageNet, keeping all other parameters the same. The sparse attention patterns that we propose for our new layer are designed using a novel information theoretic criterion that uses information flow graphs. We also present a novel way to invert Generative Adversarial Networks with attention. Our method extracts from the attention layer of the discriminator a saliency map, which we use to construct a new loss function for the inversion. This allows us to visualize the newly introduced attention heads and show that they indeed capture interesting aspects of two-dimensional geometry of real images.
Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models— roadrunner01 (@roadrunning01) November 28, 2019
repo: https://t.co/wvY5tmt3zt pic.twitter.com/IXQq7Ea7Hd
Nice and easy Colab notebook that actually works right out the box, provided in the official repo! 👍https://t.co/LJU6HcApsF— Jonathan Fly 👾 (@jonathanfly) November 28, 2019
Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models— ML Review (@ml_review) November 28, 2019
By @giannis_daras @gstsdn @Han_Zhang_ @AlexGDimakis
New sparse attention layer improves SAGAN FID score on ImageNet from 18.65 to 15.94https://t.co/W48RlTJhi9https://t.co/GrIwiNPVhH pic.twitter.com/eTn0QvWTNX
Excited to announce our paper: Your Local GAN.— Giannis Daras (@giannis_daras) November 28, 2019
We obtain 14.53% FID ImageNet improvement on SAGAN by only changing the attention layer.
We introduce a new sparse attention layer with 2-D locality. Thread: 1/n https://t.co/3RfuWCLbFm
We show that the model sometimes attends to arbitrarly long positions but other times use our grid locality bias to model homogeneous areas, such as backgrounds. 7/n pic.twitter.com/5pGSeDkULj— Giannis Daras (@giannis_daras) November 28, 2019
The suprising result of improving FID by attending to less positions with a 2-D locality bias, encourages furthers research on this area. We open-source a trained model and our code. You can also quicky play with it in this Google collab: https://t.co/80oH7mpMvk— Giannis Daras (@giannis_daras) November 28, 2019
Hopefully, you will be able to explore our new technique for inverting GANs with attention and also generate some cool images for the ImageNet category of your choice. For the end, a gif exploring the latent space of maltese dogs. 9/9 pic.twitter.com/y6Ix0ZGXOL— Giannis Daras (@giannis_daras) November 28, 2019
New paper: Your Local GAN: a new layer of two-dimensional sparse attention and a new generative model. Also progress on inverting GANs which may be useful for inverse problems. https://t.co/x5Q39vMaOp— Alex Dimakis (@AlexGDimakis) November 28, 2019
with @giannis_daras from NTUA and @gstsdn @Han_Zhang_ from @googleai
November 26, 2019
MixNMatch: Multifactor Disentanglement and Encodingfor Conditional Image Generation
Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Lee
We present MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture from real images with minimal supervision, for mix-and-match image generation. We build upon FineGAN, an unconditional generative model, to learn the desired disentanglement and image generator, and leverage adversarial joint image-code distribution matching to learn the latent factor encoders. MixNMatch requires bounding boxes during training to model background, but requires no other supervision. Through extensive experiments, we demonstrate MixNMatch's ability to accurately disentangle, encode, and combine multiple factors for mix-and-match image generation, including sketch2color, cartoon2img, and img2gif applications. Our code/models/demo can be found at https://github.com/Yuheng-Li/MixNMatch
SuperGlue: Learning Feature Matching with Graph Neural Networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems.
Name conflict alert:— Krishna Murthy (@krrish94) December 8, 2019
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Neurips 2019) https://t.co/gnLJplO8ek
SuperGlue: Learning Feature Matching with Graph Neural Networks https://t.co/sRrUrroi32@quantombone @ddetone
We know of the clash. The benchmark uses a slightly different name with all caps GLUE, and I don’t think there is any local feature matching method named SuperGlue that conflicts from the vision side. SuperGlue uses SuperPoint, so we wanted Super in the name too. @pesarlin— Tomasz Malisiewicz (@quantombone) December 8, 2019
SuperGlue: Learning Feature Matching with Graph Neural Networks. “A neural model that simultaneously performs context aggregation, feature matching, and filtering in a single unified architecture.”https://t.co/1Z7wqHAtfU #ComputerVision #Robotics pic.twitter.com/4U8cVrUQ9g— Tomasz Malisiewicz (@quantombone) November 27, 2019
Can it run in real-time?— Aakash Kumar Nain (@A_K_Nain) November 27, 2019
Single Headed Attention RNN: Stop Thinking With Your Head
The leading approaches in language modeling are all obsessed with TV shows of my youth - namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale silicon. We opt for the lazy path of old and proven techniques with a fancy crypto inspired acronym: the Single Headed Attention RNN (SHA-RNN). The author's lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly different acronym and slightly different result. We take a previously strong language model based only on boring LSTMs and get it to within a stone's throw of a stone's throw of state-of-the-art byte level language model results on enwik8. This work has undergone no intensive hyperparameter optimization and lived entirely on a commodity desktop machine that made the author's small studio apartment far too warm in the midst of a San Franciscan summer. The final results are achievable in plus or minus 24 hours on a single GPU as the author is impatient. The attention mechanism is also readily extended to large contexts with minimal computation. Take that Sesame Street.
Introducing the SHA-RNN :)— Smerity (@Smerity) November 27, 2019
- Read alternative history as a research genre
- Learn of the terrifying tokenization attack that leaves language models perplexed
- Get near SotA results on enwik8 in hours on a lone GPU
No Sesame Street or Transformers allowed.https://t.co/oCArjFKVDK pic.twitter.com/RN5TPZ3xWH
Not boring. REJECT.— Yasser Souri (@yassersouri) November 28, 2019
I find the boom layer a bit hard to motivate. It has the same (theoretical) computation cost as a stack of N residual layers. The latter should outperform it, though.— Christian Szegedy (@ChrSzegedy) November 28, 2019
Loved it. "Wut? Wut indeed my astute reader"— Ben Lee (@benlee) November 27, 2019
The language in this one is very entertaining :) Were the diagrams made with TikZ?— Adam Erickson 🌎🛰️ (@admercs) November 27, 2019
The diagrams are upside down, though 😔— Alfredo Canziani (@alfcnz) November 27, 2019
Nice to see that you reference tweets in your paper.— Carlos E. Perez 🧢 (@IntuitMachine) November 27, 2019
Nice paper and congrats on the more savory one bedroom apartment move!— Roelof Pieters (@graphific) November 27, 2019
that’s hilarious:)— Dan Brickley (@danbri) November 27, 2019
Well this is an enjoyable read!— Jordan Burgess (@jordnb) November 27, 2019
So you didn't train it on 1024 TPUs to achieve result 1% better then SOTA? The carbon footprint of training it was less then transatlantic flight? Why would anyone care?— Tim (@tymwol) November 27, 2019
ha! love the abstract— Matthew Kenney (@baykenney) November 27, 2019
"All my best work seems to come from being relatively low resourced and creative anyway." 💯— Matt Henderson (@matthen2) November 27, 2019
is it robust to random random seed.— tsauri (@tsauri_eecs) November 27, 2019
last time I saw complaints the random seed is hyperparameter
I wish more papers were this fun.— Loren Lugosch (@lorenlugosch) November 27, 2019
haha, wonderful name: is the SHA in the name also a joke about the ability of transformers to memorize passages of text?— Federico Vaggi (@F_Vaggi) November 27, 2019
'The leading approaches in language modeling are all obsessed with TV shows of my youth - namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale silicon.' STRONG abstract opener! 😂— Jonathan Fly 👾 (@jonathanfly) November 27, 2019
Would you mind sharing weights for wikitext-103 I would write then https://t.co/v9am7mqg7K code to place your model here: https://t.co/rhSyjLcBCB. You could upload them as release on github it is then easy to fetch— Piotr Czapla (@PiotrCzapla) November 29, 2019
Warning: This is not a paper, but a masterful work of art.— hardmaru (@hardmaru) November 27, 2019
The tokenization “attack” doesn’t make sense to me. Teacher forcing is a property of *training*. So are you saying that finer-grained vocabularies result in less sparse signals during training? That seems as uncontroversial a thing to say as that high batchsizes reduce variance.— Sebastian J. Mielke (@sjmielke) November 27, 2019
Hi Smerity, this is great, both results wise & philosophically. Not having *to rent your augmented intelligence* is something very important but not widely appreciated.— deen-chan (@sir_deenicus) November 27, 2019
A running theme through your paper is also something I remarked on a year ago: https://t.co/AAkNnVPczs
Single Headed Attention RNN: Stop Thinking With Your Head— Thomas Lahore (@evolvingstuff) November 27, 2019
"The final results are achievable in plus or minus 24 hours on a single GPU as the author is impatient."
"Take that Sesame Street."
code: https://t.co/uWRPylK1tj pic.twitter.com/dT6v2DounV
I knew that was Smerity just from reading the caption.— 𝔊𝔴𝔢𝔯𝔫 (@gwern) November 27, 2019
The abstract ends with "...Take that, Sesame Street."— Avijit Thawani (@thawani_avijit) November 27, 2019
Can we all please start writing papers this way?
this paper from @Smerity on his new language model is a wonderful read... Technical innovations aside (and I see it's getting plenty of praise for those!) it's a fantastic intro to the Big Picture of language modeling https://t.co/xe46gKckw8 pic.twitter.com/elT3QD9q49— James Vincent (@jjvincent) November 28, 2019
My favorite part of abstract: “The final results are achievable in plus or minus 24 hours on a single GPU as the author is impatient. .... Take that Sesame Street”— Yasser Souri (@yassersouri) November 27, 2019
This paper https://t.co/FSPz17tj6K is a must read for those doing NLP work. It's refreshingly honest and written in an entertaining and readable style. I think very few can pull off this style! This is likely the best Deep Learning paper of 2019! @Smerity #ai #nlp #deeplearning— Carlos E. Perez 🧢 (@IntuitMachine) November 27, 2019
Was reading this paper and laughing out loud at the abstract (first time for an NLP paper!). I was like "I'd love to meet this person". Then I took a look at the author: turns out it's @Smerity. No surprise!— Samiur Rahman (@samiur1204) November 27, 2019
P.S. This is a great paper even without humor!https://t.co/pDaYu5atfh pic.twitter.com/PzxdXGWtK7
November 25, 2019
Oops! Predicting Unintentional Action in Video
Dave Epstein, Boyuan Chen, Carl Vondrick
From just a short glance at a video, we can often tell whether a person's action is intentional or not. Can we train a model to recognize this? We introduce a dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset. We train a supervised neural network as a baseline and analyze its performance compared to human consistency on the tasks. We also investigate self-supervised representations that leverage natural signals in our dataset, and show the effectiveness of an approach that uses the intrinsic speed of video to perform competitively with highly-supervised pretraining. However, a significant gap between machine and human performance remains. The project website is available at https://oops.cs.columbia.edu
Oh here's a fun one: "Oops! Predicting Unintentional Action in Video"— Jonathan Fly 👾 (@jonathanfly) November 27, 2019
Great project site, check out the graphs showing ground-truth versus predictions for the videos:
site: https://t.co/ELHdGPWjxl pic.twitter.com/8MIn1mW5bE
Funny and interesting.— Manu Romero (@mrm8488) November 27, 2019
Remind me of an actually practical high-five predictor:https://t.co/jBKe14B2s9— Maxim Leyzerovich (@round) November 27, 2019
This is awesome.— Andy Baio (@waxpancake) November 27, 2019
Rigging the Lottery: Making All Tickets Winners
Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, Erich Elsen
Sparse neural networks have been shown to be more parameter and compute efficient compared to dense networks and in some cases are used to decrease wall clock inference times. There is a large body of work on training dense networks to yield sparse networks for inference. This limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. Our method updates the topology of the network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. Importantly, by adjusting the topology it can start from any initialization - not just "lucky" ones. We demonstrate state-of-the-art sparse training results with ResNet-50, MobileNet v1 and MobileNet v2 on the ImageNet-2012 dataset, WideResNets on the CIFAR-10 dataset and RNNs on the WikiText-103 dataset. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static.
My advice to people wanting to get into game programming has been to write small games completely from scratch while also working on commercial game mods and with unity or unreal. I’m following that myself for AI — I have some C++ backprop-from-scratch projects while also \— John Carmack (@ID_AA_Carmack) November 28, 2019
\ learning python / pytorch / jupyter and experimenting with pretrained models. I had to give myself a bit of a kick to not dwell too much in the lowest levels, but now I am enjoying the new world quite a bit. You can do a remarkable amount with very little code, but when I \— John Carmack (@ID_AA_Carmack) November 28, 2019
\ actually write a loop in python because I don’t know the correct way to do something with tensor ops I get reminded just how slow python is relative to C++.— John Carmack (@ID_AA_Carmack) November 28, 2019
I would recommend Cython for those loops...it's easier than linking up pure C or C++, and a much better development experience than trying to remember all the numpy trivia. Example casting numpy array to pointer, to drop into a loop: https://t.co/OvJbsYgDpB— Matthew Honnibal (@honnibal) November 29, 2019
Not only you. Have you tried Python C API🤣?— AI = f(ML) (@AIMLDeveloper) November 28, 2019
It turns out a good balance: reusable high-performance code and the ability to draw graphs using standard tools.
Have a look at Numpy. Lot of the tensor libraries are modeled (or directly built on) this. Once you grok how to do vectorized calculations with Numpy, the tensor stuff will mostly come naturally because it works most often the same. If you need to write a for loop ... 1/2— Jan Ciger (@janoc200) November 28, 2019
👍 When an AI is overtrained, it becomes less efficient as it is less able to assimilate unknown concepts.— Benoît Dumas (@dumas181) November 28, 2019
Same goes for people.
Experiment from scratch, explore your own universe or you will get stuck in well-known & industrialized ways.
10 GOTO 20— Robert Morrison (@RobertAnim8er) November 28, 2019
20 GOTO 10
Hey John, start your Youtube channel.— Adnan-عدنان (@kadnan) November 28, 2019
Can you start a Twitch channel? I’d quit my job to follow you on your adventure.— David Hurley (@davidhurley87) November 28, 2019
“Fast Sparse ConvNets”, a collaboration w/ @GoogleAI [https://t.co/TPD6mI9MA6], implements fast Sparse Matrix-Matrix Multiplication to replace dense 1x1 convolutions in MobileNet architectures. The sparse networks are 66% the size and 1.5-2x faster than their dense equivalents. pic.twitter.com/poDKMzfA4u— DeepMind (@DeepMindAI) November 26, 2019
We also introduce a technique [https://t.co/SFz2vTThTv] for training neural networks that are sparse throughout training from a random initialization - no luck required, all initialization “tickets” are winners. pic.twitter.com/fA7VmXrj20— DeepMind (@DeepMindAI) November 26, 2019
I love the casting of NN initialization as a game of chance. It suggests that crossing one’s fingers during training is requisite best practice for state of the art results.— Brandon Rohrer (@_brohrer_) November 27, 2019
November 24, 2019
Causality for Machine Learning
Graphical causal inference as pioneered by Judea Pearl arose from research on artificial intelligence (AI), and for a long time had little connection to the field of machine learning. This article discusses where links have been and should be established, introducing key concepts along the way. It argues that the hard open problems of machine learning and AI are intrinsically related to causality, and explains how the field is beginning to understand them.
A big problem with #AI is that it hasn't read and couldn't understand @yudapearl's Book of Why. This easy to understand essay by @bschoelkopf (and inspired by Pearl) takes us through the gaps in ML thinking and reasoning, cause and effect https://t.co/WbeOWmy5HT @MPI_IS pic.twitter.com/hSQg3PsNKN— Eric Topol (@EricTopol) November 26, 2019
Also this book is great (and a free PDF is available): https://t.co/fCeFjA5su8— Mehrdad Yazdani (@crude2refined) November 26, 2019
#AI/ML can’t «understand» anything. And there’s also limits to prediction in #complexadaptivesystems (like biological systems/humans) as they are partially stochastic. https://t.co/Lc6WEK1TGQ #Causality #science pic.twitter.com/irjejn5Iwh— Julia B. (@JuliaB_fitness) November 26, 2019
I love @yudapearl 's book so much! Profound, heterodox.— D.A. Wallach (@dawallach) November 27, 2019
"Machine learning often disregards information that animals use heavily: interventions in the world, domain shifts, temporal structure"— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"The transformation, however, really started already in the mid 20th century under the name of cybernetics. It replaced energy by information."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"In recent years, genuine connections between machine learning and causality have emerged, and— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
we will argue that these connections are crucial if we want to make progress on the major open problems of AI."
"Machines often perform poorly, however, when faced with problems that violate the IID assumption yet seem trivial to humans."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"Structural causal model (SCM) view is intuitive for those machine learning researchers who are more accustomed to thinking in terms of estimating functions rather than probability distributions."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"Causal discovery and learning tries to arrive at such— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
models in a data-driven way, using only weak assumptions."
"Causal— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
models can be seen as descriptions that lie in between, abstracting away from physical realism while retaining the power to answer certain interventional or counterfactual questions"
"Whenever we perceive an object, our brain makes the assumption that the object and the mechanism by which the information contained in its light reaches our brain are independent."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"This is an invariance implied by the above independence, allowing us to infer 3D information even without stereo vision (“structure from motion”)."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"For a model to correctly predict the effect of interventions, it— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
needs to be robust with respect to generalizing from an observational distribution to certain interventional distributions."
"Algorithmic information theory provides a natural framework for non-statistical graphical models."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"What is elegant about this approach is that it shows that causality is not intrinsically bound to— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
statistics, and that independence of noises and the independence of mechanisms now coincide since the independent programs play the role of the unexplained noise terms."
"We thus predicted that Semi Supervised Learning should be impossible for causal learning problems, but feasible otherwise, in particular for anticausal ones."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"One can hypothesize that causal direction should also have an influence on whether classifiers are vulnerable to adversarial attacks."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"In such an architecture (GAN), the encoder is an anticausal mapping that recognizes or reconstructs causal drivers in the world."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"The decoder establishes the connection between the low dimensional latent representation (of the noises driving the causal model) and the high dimensional world; this part constitutes a causal generative image model."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
"I expect that going forward, causality will play a major role..., moving beyond the representation of statistical dependence structures towards models that support intervention, planning, and reasoning, realizing Konrad Lorenz’ notion of thinking as acting in an imagined space."— Carlos E. Perez 🧢 (@IntuitMachine) November 28, 2019
Causality for Machine Learning— ML Review (@ml_review) November 26, 2019
Mostly non-technical intro to key causal models and how they can contribute to resolving open ML problems like generalization across domains or "thinking" (i.e., acting in an imagined space)https://t.co/ULwPx1qFWP pic.twitter.com/gEyuWTKLbi
November 22, 2019
Instance Cross Entropy for Deep Metric Learning
Xinshao Wang, Elyor Kodirov, Yang Hua, Neil Robertson
Loss functions play a crucial role in deep metric learning thus a variety of them have been proposed. Some supervise the learning process by pairwise or tripletwise similarity constraints while others take advantage of structured similarity information among multiple data points. In this work, we approach deep metric learning from a novel perspective. We propose instance cross entropy (ICE) which measures the difference between an estimated instance-level matching distribution and its ground-truth one. ICE has three main appealing properties. Firstly, similar to categorical cross entropy (CCE), ICE has clear probabilistic interpretation and exploits structured semantic similarity information for learning supervision. Secondly, ICE is scalable to infinite training data as it learns on mini-batches iteratively and is independent of the training set size. Thirdly, motivated by our relative weight analysis, seamless sample reweighting is incorporated. It rescales samples' gradients to control the differentiation degree over training examples instead of truncating them by sample mining. In addition to its simplicity and intuitiveness, extensive experiments on three real-world benchmarks demonstrate the superiority of ICE.
"Instance Cross Entropy for Deep Metric Learning" --"instance cross entropy (ICE) [...] measures the difference between an estimated instance-level matching distribution and its ground-truth one" -- One of these things worth trying out next time! https://t.co/85LIOZ2Hp9 pic.twitter.com/YYVJKa5sPI— Sebastian Raschka (@rasbt) November 26, 2019
November 21, 2019
Fast Sparse ConvNets
Erich Elsen, Marat Dukhan, Trevor Gale, Karen Simonyan
Historically, the pursuit of efficient inference has been one of the driving forces behind research into new deep learning architectures and building blocks. Some recent examples include: the squeeze-and-excitation module, depthwise separable convolutions in Xception, and the inverted bottleneck in MobileNet v2. Notably, in all of these cases, the resulting building blocks enabled not only higher efficiency, but also higher accuracy, and found wide adoption in the field. In this work, we further expand the arsenal of efficient building blocks for neural network architectures; but instead of combining standard primitives (such as convolution), we advocate for the replacement of these dense primitives with their sparse counterparts. While the idea of using sparsity to decrease the parameter count is not new, the conventional wisdom is that this reduction in theoretical FLOPs does not translate into real-world efficiency gains. We aim to correct this misconception by introducing a family of efficient sparse kernels for ARM and WebAssembly, which we open-source for the benefit of the community as part of the XNNPACK library. Equipped with our efficient implementation of sparse primitives, we show that sparse versions of MobileNet v1, MobileNet v2 and EfficientNet architectures substantially outperform strong dense baselines on the efficiency-accuracy curve. On Snapdragon 835 our sparse networks outperform their dense equivalents by $1.3-2.4\times$ -- equivalent to approximately one entire generation of MobileNet-family improvement. We hope that our findings will facilitate wider adoption of sparsity as a tool for creating efficient and accurate deep learning architectures.
“Fast Sparse ConvNets”, a collaboration w/ @GoogleAI [https://t.co/TPD6mI9MA6], implements fast Sparse Matrix-Matrix Multiplication to replace dense 1x1 convolutions in MobileNet architectures. The sparse networks are 66% the size and 1.5-2x faster than their dense equivalents. pic.twitter.com/poDKMzfA4u— DeepMind (@DeepMindAI) November 26, 2019
We also introduce a technique [https://t.co/SFz2vTThTv] for training neural networks that are sparse throughout training from a random initialization - no luck required, all initialization “tickets” are winners. pic.twitter.com/fA7VmXrj20— DeepMind (@DeepMindAI) November 26, 2019
Sparsification, or pruning of weights, in convolutional neural networks has a long history as a compression technique, and good support in deep learning frameworks, e.g. Model Optimization Toolkit in TensorFlow. [1/4]— Marat Dukhan (@MaratDukhan) November 28, 2019
Computations in sparsified models involve many multiplications by zeroes, which can be skipped in theory, but common wisdom suggested that it is impractical in software inference implementations. [2/4]— Marat Dukhan (@MaratDukhan) November 28, 2019
Our recent work [https://t.co/bbO55kRpPU] with colleagues from @DeepMind and @GoogleAI demonstrates that with a right layout and optimizations sparse inference delivers practical and non-negligible speedups of 1.3X-2.4X on a range of MobileNet and EfficientNet models. [3/4]— Marat Dukhan (@MaratDukhan) November 28, 2019
Adversarial Examples Improve Image Recognition
Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, Quoc V. Le
Adversarial examples are commonly viewed as a threat to ConvNets. Here we present an opposite perspective: adversarial examples can be used to improve image recognition models if harnessed in the right manner. We propose AdvProp, an enhanced adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to our method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples. We show that AdvProp improves a wide range of models on various image recognition tasks and performs better when the models are bigger. For instance, by applying AdvProp to the latest EfficientNet-B7  on ImageNet, we achieve significant improvements on ImageNet (+0.7%), ImageNet-C (+6.5%), ImageNet-A (+7.0%), Stylized-ImageNet (+4.8%). With an enhanced EfficientNet-B8, our method achieves the state-of-the-art 85.5% ImageNet top-1 accuracy without extra data. This result even surpasses the best model in  which is trained with 3.5B Instagram images (~3000X more than ImageNet) and ~9.4X more parameters. Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet.
Hard to keep up with @tanmingxing @quocleix and team! I just finished making it through EfficientDet and Noisy Student and now have AdvProp, w/ a quietly released update of all EfficientNet weights, incl a new B8 model spec. https://t.co/EnYENU710e— Ross Wightman (@wightmanr) November 23, 2019
The TLDR of the paper; use adversarial examples as training data augmentation, maintain separate BatchNorm for normal vs adversarial examples. Neat. As usual I've ported & tested #PyTorch weights https://t.co/NMvRUrBYFp— Ross Wightman (@wightmanr) November 23, 2019
I don’t think we’ll see AGI in our lifetimes. This articles echoes that sentiment, by someone who actually knows what they’re talking about :) https://t.co/vm6QrXwpsv— Vishal Kapur (@figelwump) December 4, 2019
I respectfully disagree for a few reasons.— Giuliano Giacaglia (@giacaglia) December 5, 2019
First, the number of neurons that we can simulate is around 1M neurons, which is the equivalent of what a honeybee has in its brain. It’s no coincidence that we have self-flying drones, like skydio
Sykdio is a good example. Do you think that it's kick ass navigation system deserves to be dubbed as intelligence? Would you say it's 5% intelligent or 0%?. I would argue 0%— Faraz Khan (@faraz_r_khan) December 5, 2019
1M compared to 200B is 0.05% intelligent compared to humans. So I would say 0.05% intelligent :-)— Giuliano Giacaglia (@giacaglia) December 5, 2019
Gotcha 😄 that's where we differ I guess. Some say being able to navigate anything is intelligence. I say intelligence is to come up with new untrained ways to navigate something totally new. Skydio would suck in a dark cave or open desert, for example.— Faraz Khan (@faraz_r_khan) December 5, 2019
Yeah, right now our neural networks are only based on seen data (and a lot of data), and this is where hardware seems to be a limiting factor for development of intelligence. A lot of our intelligence is based on Predictive Coding - our running simulations on predicting future— Giuliano Giacaglia (@giacaglia) December 5, 2019
Yann LeCun is working on this area, where you want to create simulations of what the future looks like. He is starting with frames of videos.— Giuliano Giacaglia (@giacaglia) December 5, 2019
Is it a great tool? Yes ofcourse! I think the problem is that it's called a "neural net" while what it basically does is some gradient based math to minimize deviation from all training data. Maybe a different name would help 😁— Faraz Khan (@faraz_r_khan) December 4, 2019
Just go work for a company that uses any machine learning and you'll find out how restricted in scope it is. There's just no path (in code or math) which points towards any "intelligence". It's model fitting at best. I think the intelligence part is thrown in just for the buzz— Faraz Khan (@faraz_r_khan) December 4, 2019
NN training with adversarial examples doesn't improve the generalization ability because of the distribution gap between clean and adversarial ones. Only using different BNs for clean and adversarial can solve this problem, achieving new SOTA ImageNet acc https://t.co/FsQ7oR9uHu— Daisuke Okanohara (@hillbig) November 28, 2019