Accepted Submissions

📜 Reachability Embeddings: Scalable Self-Supervised Representation Learning from Mobility Trajectories for Multimodal Computer Vision

Swetava Ganguli (Apple)*; C. V. Krishnakumar Iyer (Apple); Vipul Pandey (Apple)
Abstract: Self-supervised representation learning techniques utilize large datasets without semantic annotations to learn meaningful, universal features that can be conveniently transferred to solve a wide variety of downstream supervised tasks. In this work, we propose a self-supervised method for learning representations of geographic locations from unlabeled GPS trajectories to solve downstream geospatial computer vision tasks. Tiles resulting from a raster representation of the earth's surface are modeled as nodes on a graph or pixels of an image. GPS trajectories are modeled as allowed Markovian paths on these nodes. A scalable and distributed algorithm is presented to compute image-like representations, called reachability summaries, of the spatial connectivity patterns between tiles and their neighbors implied by the observed Markovian paths. A convolutional, contractive autoencoder is trained to learn compressed representations, called reachability embeddings, of reachability summaries for every tile. Reachability embeddings serve as task-agnostic, feature representations of geographic locations. Using reachability embeddings as pixel representations for five different downstream geospatial tasks, cast as supervised semantic segmentation problems, we quantitatively demonstrate that reachability embeddings are semantically meaningful representations and result in 4-23% gain in performance, as measured using area under the precision-recall curve (AUPRC) metric, when compared to baseline models that use pixel representations that do not account for the spatial connectivity between tiles. Reachability embeddings transform sequential, spatiotemporal mobility data into semantically meaningful tensor representations that can be combined with other sources of imagery and are designed to facilitate multimodal learning in geospatial computer vision.

📜 DAGMA: Learning DAGs via M-matrices and a Log-Determinant Acyclicity Characterization

Kevin Bello (UChicago/CMU)*
Abstract: The combinatorial problem of learning directed acyclic graphs (DAGs) from data was recently framed as a purely continuous optimization problem by leveraging a differentiable acyclicity characterization of DAGs based on the trace of a matrix exponential function. Existing acyclicity characterizations are based on the idea that powers of an adjacency matrix contain information about walks and cycles. In this work, we propose a \emph{fundamentally different} acyclicity characterization based on the log-determinant (log-det) function, which leverages the nilpotency property of DAGs. To deal with the inherent asymmetries of a DAG, we relate the domain of our log-det characterization to the set of \emph{\M-matrices}, which is a key difference to the classical log-det function defined over the cone of positive definite matrices. Similar to acyclicity functions previously proposed, our characterization is also exact and differentiable. However, when compared to existing characterizations, our log-det function: (1) Is better at detecting large cycles; (2) Has better behaved gradients; and (3) Its runtime is in practice about an order of magnitude faster. From the optimization side, we drop the typically used augmented Lagrangian scheme, and propose DAGMA (\emph{Direct Acyclic Graphs via \M-matrices for Acyclicity}), a method that resembles the central path approach for barrier methods. Each point in the central path of DAGMA is a solution to an unconstrained problem regularized by our log-det function, then we show that at the limit of the central path, the solution is guaranteed to be a DAG. Finally, we provide extensive experiments for \emph{linear} and \emph{nonlinear} SEMs, and show that our approach can reach large speed-ups and smaller structural Hamming distance against state-of-the-art methods.

🎤 Radically Lower Data-Labeling Costs for Document Extraction Models with Selective Labeling

Yichao Zhou (Google)*; James B Wendt (Google); Navneet Potti (Google); Jing Xie (Google); Sandeep Tata ("Google, USA")
Abstract: In this paper, we propose {\em selective labeling} to radically reduce the cost of acquiring the several thousand high-quality labeled documents that are needed to train a document extraction model with acceptable accuracy. The key is to simplify the labeling task to provide ``yes/no'' labels for candidate extractions predicted by a model trained on partially-labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through extensive experiments on 3 document types that selective labeling can reduce the cost of acquiring labeled data by $10\times$ while achieving negligible loss in accuracy.

📜 Randomized Exploration for Reinforcement Learning with General Value Function Approximation

Haque Ishfaq (Mila, McGill University)*; Qiwen Cui (University of Washington); Viet Nguyen (Mila, McGill University); Alex Ayoub (University of Alberta); Zhuoran Yang (Princeton); Zhaoran Wang (Northwestern U); Doina Precup (McGill University); Lin Yang (UCLA)
Abstract: We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm \citep{russo2019worst} as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches \citep{auer2002finite}, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class $\mathcal{F}$, our algorithm achieves a worst-case regret bound of $\tilde{O}(\mathrm{poly}(d_EH)\sqrt{T})$ where $T$ is the time elapsed, $H$ is the planning horizon and $d_E$ is the \emph{eluder dimension} \citep{russo2013eluder} of $\mathcal{F}$. In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.

📜 Multi-Frame Self-Supervised Depth with Transformers

Vitor Guizilini (Toyota Research Institute)*; Rareș A Ambruș (Toyota Research Institute); Dian Chen (Toyota Research Institute); Sergey Zakharov (Toyota Research Institute); Adrien Gaidon (Toyota Research Institute)
Abstract: Multi-frame depth estimation improves over single-frame approaches by also leveraging geometric relationships between images via feature matching, in addition to learning appearance-based features. In this paper we revisit feature matching for self-supervised monocular depth estimation, and propose a novel transformer architecture for cost volume generation. We use depth-discretized epipolar sampling to select matching candidates, and refine predictions through a series of self- and cross-attention layers. These layers sharpen the matching probability between pixel features, improving over standard similarity metrics prone to ambiguities and local minima. The refined cost volume is decoded into depth estimates, and the whole pipeline is trained end-to-end from videos using only a photometric objective. Experiments on the KITTI and DDAD datasets show that our DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation, and is even competitive with highly specialized supervised single-frame architectures. We also show that our learned cross-attention network yields representations transferable across datasets, increasing the effectiveness of pre-training strategies. Project page: \url{}.

📜 Learning Optical Flow, Depth, and Scene Flow without Real-World Labels

Vitor Guizilini (Toyota Research Institute)*; Kuan-Hui Lee (Toyota Research Institute); Rareș A Ambruș (Toyota Research Institute); Adrien Gaidon (Toyota Research Institute)
Abstract: Self-supervised monocular depth estimation enables robots to learn 3D perception from raw video streams. This scalable approach leverages projective geometry and ego-motion to learn via view synthesis, assuming the world is mostly static. Dynamic scenes, which are common in autonomous driving and human-robot interaction, violate this assumption. Therefore, they require modeling dynamic objects explicitly, for instance via estimating pixel-wise 3D motion, i.e. scene flow. However, the simultaneous self-supervised learning of depth and scene flow is ill-posed, as there are infinitely many combinations that result in the same 3D point. In this paper we propose DRAFT, a new method capable of jointly learning depth, optical flow, and scene flow by combining synthetic data with geometric self-supervision. Building upon the RAFT architecture, we learn optical flow as an intermediate task to bootstrap depth and scene flow learning via triangulation. Our algorithm also leverages temporal and geometric consistency losses across tasks to improve multi-task learning. Our DRAFT architecture simultaneously establishes a new state of the art in all three tasks in the self-supervised monocular setting on the standard KITTI benchmark.

🎤 Self-Supervised Camera Self-Calibration from Video

Vitor Guizilini (Toyota Research Institute)*; Jiading Fang (Toyota Technological Institute at Chicago); Igor Vasiljevic (Toyota Research Institute); Rareș A Ambruș (Toyota Research Institute); Greg Shakhnarovich (Toyota Technological Institute at Chicago); Adrien Gaidon (Toyota Research Institute); Matthew Walter (Toyota Technological Institute at Chicago)
Abstract: Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by inferring per-frame projection models that optimize a view-synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods.

🎤 Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations

Jeff Z. HaoChen (Stanford University)*; Colin Wei (Stanford University); Ananya Kumar (Stanford University); Tengyu Ma (Stanford)
Abstract: Contrastive learning is a highly effective method for learning representations from unlabeled data. Recent works show that contrastive representations can transfer across domains, leading to simple state-of-the-art algorithms for unsupervised domain adaptation. In particular, a linear classifier trained to separate the representations on the source domain can also predict classes on the target domain accurately, even though the representations of the two domains are far from each other. We refer to this phenomenon as linear transferability. This paper analyzes when and why contrastive representations exhibit linear transferability in a general unsupervised domain adaptation setting. We prove that linear transferability can occur when data from the same class in different domains (e.g., photo dogs and cartoon dogs) are more related with each other than data from different classes in different domains (e.g., photo dogs and cartoon cats) are. Our analyses are in a realistic regime where the source and target domains can have unbounded density ratios and be weakly related, and they have distant representations across domains.

📜 Few-shot Continual Learning using HyperTransformers

Maksym Vladymyrov (Google)*; Andrey Zhmoginov (Google); Mark Sandler (Google)
Abstract: In this paper we propose a novel continual few-shot learning method that makes it possible to learn without forgetting from multiple few-shot tasks arriving sequentially. Our approach is based on the recently proposed HyperTransformer (HT), which is able to generate CNN weights directly from the support set of a given episode. We propose to use these generated weights as an input to HT for the next episode of the continual-learning sequence. The weights themselves are used by HT as an embedding of the previously learned tasks. In experiments, we show that the HT is capable of retaining knowledge about previous tasks it seen without catastrophic forgetting.

📜 iROAD: Learning an Implicit Recursive Octree Auto-Decoder to Efficiently Encode 3D Shapes

Sergey Zakharov (Toyota Research Institute)*; Rareș A Ambruș (Toyota Research Institute); Katherine Liu (Toyota Research Institute); Adrien Gaidon (Toyota Research Institute)
Abstract: Compact and accurate representations of 3D shapes are central to many perception and robotics tasks. State-of-the-art learning-based methods can reconstruct single objects but scale poorly to large datasets. We present a novel recursive implicit representation to efficiently and accurately encode large datasets of complex 3D shapes by recursively traversing an implicit octree in latent space. Our implicit Recursive Octree Auto-Decoder iROAD learns a hierarchically structured latent space enabling state-of-the-art reconstruction results at a compression ratio above 99%. We also propose an efficient curriculum learning scheme that naturally exploits the coarse-to-fine properties of the underlying octree spatial representation. We explore the scaling law relating latent space dimension, dataset size, and reconstruction accuracy, showing that increasing the latent space dimension is enough to scale to large shape datasets. Finally, we show that our learned latent space encodes a coarse-to-fine hierarchical structure yielding reusable latents across different levels of details, and we provide qualitative evidence of generalization to novel shapes outside the training set.

📜 Photo-realistic Neural Domain Randomization

Sergey Zakharov (Toyota Research Institute)*; Rareș A Ambruș (Toyota Research Institute); Vitor Guizilini (Toyota Research Institute); Wadim Kehl (Woven Planet); Adrien Gaidon (Toyota Research Institute)
Abstract: Synthetic data is a scalable alternative to manual supervision, but it requires overcoming the sim-to-real domain gap. This discrepancy between virtual and real worlds is typically addressed by two seemingly opposed approaches: improving the realism of simulation or foregoing realism entirely via domain randomization. In this paper, we show that the recent progress in neural rendering enables a new unified approach we call photo-realistic neural domain randomization (PNDR). We propose to learn a composition of neural networks acting as a physics-based ray tracer that generates high-quality renderings from scene geometry alone. Our pipeline is modular, composed of different neural networks for materials, lighting, and rendering, thus enabling randomization of different key image generation components. We apply our approach to the task of 6D object detection, and show that we generalize well to novel scenes and also significantly outperform the state of the art in terms of real-world transfer.

📜 Contextual Mondegreen: Voice Query Transcriptions based on Contextual Signals

Otto Stegmaier (Google Research)*; Yan Zhu (Google Inc.); Santiago Ontanon (Google LLC); Sukhdeep Sodhi (Google); Vikram Aggarwal (Google); Ambarish Jash (Google); Ayooluwakunmi Jeje (Google); Allen Wu (Google); Senqiang Zhou (Google)
Abstract: As online search queries increasingly come from voice, automatic speech recognition becomes a key component to deliver relevant search results. Errors introduced during speech recognition lead to irrelevant search results, and hence user dissatisfaction. This paper builds on our previous work on the Mondegreen system, a statistical approach to correcting voice query transcriptions without using audio signals. Specifically, we present "Contextual Mondegreen", an approach to correcting voice queries in the text space using contextual signals about the query and the user without relying on audio signals. Thanks to its contextual signals, Contextual Mondegreen can learn that certain corrections might be desirable for the majority of users but not for a subset of them, and vice-versa, resulting in increased query correction precision.

📜 Image Search with Text Feedback by Additive Attention Compositional Learning

Yuxin Tian (University of California, Merced)*; Shawn Newsam (UC Merced); Kofi A Boakye (Pinterest)
Abstract: Effective image retrieval with text feedback stands to impact a range of real-world applications, such as e-commerce. Given a source image and text feedback that describes the desired modifications to that image, the goal is to retrieve the target images that resemble the source yet satisfy the given modifications by composing a multi-modal (image-text) query. We propose a novel solution to this problem, Additive Attention Compositional Learning (AACL), that uses a multi-modal transformer-based architecture and effectively models the image-text contexts. Specifically, we propose a novel image-text composition module based on additive attention that can be seamlessly plugged into deep neural networks. We also introduce a new challenging benchmark derived from the Shopping100k dataset. AACL is evaluated on three large-scale datasets (FashionIQ, Fashion200k, and Shopping100k), each with strong baselines. Extensive experiments show that AACL achieves new state-of-the-art results on all three datasets.

📜 ShAPO: Implicit Representations for Multi-Object Shape, Appearance, and Pose Optimization

Muhammad Zubair Irshad (Georgia Institute of Technology)*; Sergey Zakharov (Toyota Research Institute); Rareș A Ambruș (Toyota Research Institute); Thomas Kollar (Toyota Research Institute); Zsolt Kira (Georgia Institute of Technology); Adrien Gaidon (Toyota Research Institute)
Abstract: Our method studies the complex task of holistic object-centric 3D understanding from a single RGB-D observation. As it is an ill-posed problem, existing methods suffer from low performance for both 3D shape and 6D pose estimation in complex multi-object scenarios with occlusions. We present ShAPO, a method for joint multi-object detection, 3D textured reconstruction, 6D object pose and size estimation. Key to \titleShort is a single-shot pipeline to regress shape, appearance and pose latent codes along with the masks of each object instance, which is then further refined in a sparse-to-dense fashion. We propose a novel, octree-based differentiable optimization step, allowing us to further improve object shape, pose and appearance simultaneously under the learned latent space, in an analysis-by-synthesis fashion. Our novel joint implicit textured object representation allows us to accurately identify and reconstruct novel unseen objects without having access to their 3D meshes. Through extensive experiments, we show that our method, trained on simulated indoor scenes, accurately regresses the shape, appearance and pose of novel objects in the real-world with minimal fine-tuning. Our method significantly out-performs all baselines on the NOCS dataset with an 8% absolute improvement in mAP for 6D pose estimation.

📜 RbX: Region-based explanations of prediction models

Ismael Lemhadri (Stanford University); Harrison Li (Stanford University)*; Trevor Hastie (Stanford)
Abstract: We introduce region-based explanations (RbX), a novel, model-agnostic method to generate local explanations of scalar predictions from a black-box prediction model using only query access. In many contexts, there is a natural notion of which prediction values are "close" to the prediction at some target point in feature space. RbX is based on a greedy algorithm for constructing a convex polytope that approximates the region of feature space with such “close” predictions values. The geometry of this polytope — specifically the change in each coordinate necessary to escape the polytope — provides the desired region-based explanations of the local sensitivity of the predictions to each of the features near the target point. These “escape distances” can be standardized to rank the features according to local importance. The explanations from RbX are guaranteed to satisfy a “sparsity” axiom, requiring that features which do not enter into the prediction model are assigned zero importance. At the same time, real data examples and synthetic experiments suggest RbX can more readily detect locally relevant features than popular existing methods.

📜 Semi-Supervised Learning with Decision Trees: Graph Laplacian Tree Alternating Optimization

Arman Zharmagambetov (UC Merced)*; Miguel A Carreira-Perpinan (UC Merced)
Abstract: Semi-supervised learning seeks to learn a machine learning model when only a small amount of the available data is labeled. The most widespread approach uses a graph prior, which encourages similar instances to have similar predictions. This has been very successful with models ranging from kernel machines to neural networks, but has remained inapplicable to decision trees, for which the optimization problem is much harder. We solve this based on a reformulation of the problem which requires iteratively solving two simpler problems: a supervised tree learning problem, which can be solved by the Tree Alternating Optimization algorithm; and a label smoothing problem, which can be solved through a sparse linear system. The algorithm is scalable and highly effective even with very few labeled instances, and makes it possible to learn accurate, interpretable models based on decision trees in such situations.

📜 Asynchronous Distributed Bayesian Optimization at HPC Scale

Romain P Egele (Université Paris Saclay)*; Isabelle Guyon (Clopinet); Prasanna Balaprakash (Argonne National Laboratory)
Abstract: Bayesian optimization (BO) is a widely used approach for computationally expensive black-box optimization such as simulator calibration and hyperparameter optimization of deep learning methods. In BO, a dynamically updated computationally cheap surrogate model is employed to learn the input-output relationship of the black-box function; this surrogate model is used to explore and exploit the promising regions of the input space. Multipoint BO methods adopt a single manager/multiple workers strategy to achieve high-quality solutions in shorter time. However, the computational overhead in multipoint generation schemes is a major bottleneck in designing BO methods that can scale to thousands of workers. We present an asynchronous-distributed BO (ADBO) method wherein each worker runs a search and asynchronously communicates the input-output values of black-box evaluations from all other workers without the manager. We scale our method up to 4,096 workers and demonstrate improvement in the quality of the solution and faster convergence. We demonstrate the effectiveness of our approach for hyperparameter optimization (HPO) of neural networks from the Exascale computing project CANDLE benchmarks.

📜 Rewards Encoding Environment Dynamics (REED) Improves Preference-based Reinforcement Learning

Katherine Metcalf (Apple, Inc.)*
Abstract: N/A

📜 A Parametric Class of Approximate Gradient Updates for Policy Optimization

Ramki Gummadi (Google)*; Saurabh Kumar (Google Brain); Junfeng Wen (Carleton University); Dale Schuurmans (Google / University of Alberta)
Abstract: Approaches to policy optimization have been motivated from diverse principles, based on how the parametric model is interpreted (e.g. value versus policy representation) or how the learning objective is formulated, yet they share a common goal of maximizing expected return. To better capture the commonalities and identify key differences between policy optimization methods, we develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function. In particular, we identify a parameterized space of approximate gradient updates for policy optimization that is highly structured, yet covers both classical and recent examples, including PPO. As a result, we obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality. An experimental investigation demonstrates that the additional degrees of freedom provided in the parameterized family of updates can be leveraged to obtain non-trivial improvements both in synthetic domains and on popular deep RL benchmarks.

📜 FedEmbed: Personalized Private Federated Learning

Barry Theobald (Apple)*; Andrew Silva (Georgia Institute of Technology); Katherine Metcalf (Apple, Inc.); Nicholas Apostoloff (Apple Inc.)
Abstract: Private Federated Learning (PFL) updates a shared global model using decentralized data collection. Personalization in PFL is challenging because a data sample could have conflicting labels, e.g., one sub-population of users prefers a sample, whilst other sub-populations dislike the same sample. Our contribution, FedEmbed allows personalization of a global model by (1) assigning a user to a sub-population of users with similar preferences using a dictionary of prototypical users, and (2) learning personal embeddings on-device. We demonstrate that current approaches to PFL are inadequate for handling data with conflicting labels, and we show that FedEmbed achieves up to 45% improvement over baseline approaches.

📜 Beyond Tabula Rasa: Reincarnating Reinforcement Learning

Rishabh Agarwal (Google Research, Brain Team)*; Max Schwarzer (MILA, Université de Montréal); Pablo Samuel Castro (Google); Marc G. Bellemare (Google Brain); Aaron Courville (MILA, Université de Montréal)
Abstract: Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL systems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of tabula rasa RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding RL problems. To address these issues, we present reincarnating RL as an alternative workflow, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one agent to another. To exemplify challenges in setting up this workflow, we focus on the specific reincarnating RL setting of efficiently transferring an existing sub-optimal policy to a standalone value-based RL agent. We demonstrate that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL's gains over tabula rasa RL, on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Our findings also raise several questions that require further investigating reincarnating RL. Overall, this work argues for a different approach for doing RL research, which we believe could significantly improve real-world RL adoption and help democratize RL research.

📜 MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge

Linxi Fan (NVIDIA Corporation)*; Guanzhi Wang (Stanford University); Yunfan Jiang (Stanford University); Ajay Mandlekar (Stanford University); De-An Huang (Stanford University); Yuke Zhu (University of Texas - Austin); Anima Anandkumar (NVIDIA/Caltech)
Abstract: Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the code and knowledge bases ( to promote research towards the goal of generally capable embodied agents.

📜 SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos

Gamaleldin F Elsayed (Google Research, Brain Team)*; Aravindh Mahendran (Google); Sjoerd van Steenkiste (Google); Klaus Greff (Google); Michael C Mozer (Google Research, Brain Team); Thomas Kipf (Google Brain)
Abstract: The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset.

📜 Physics-based Validation of Machine-learning Approaches for the COSI Space Mission

Yasaman Ebrahimi (Space Sciences Laboratory)*; Olivia Salaben (University of California, Berkeley); Rhea Senthil Kumar (University of California, Berkeley); Andreas Zoglauer (University of California, Berkeley)
Abstract: Validating machine-learning approaches for application to the future COSI space telescope involves testing the approaches with realistic simulations of the instrument and the space radiation environment, and validating their performance as a function of all physical input and measured photon parameters as well as of all detrimental effects that can occur in the COSI detectors. This rigorous process which covers all physically relevant parameters for the operation of the instrument insures that the performance of the selected machine-learning models is unbiased and produces consistent results in all segments of the data space.

📜 ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest

Paul Baltescu (Pinterest); Haoyu Chen (Pinterest)*; Nikil Pancha (Pinterest, Inc.); Andrew H Zhai (Pinterest, Inc.); Jure Leskovec (Pinterest, Inc.); Charles Rosenberg (Pinterest)
Abstract: Learned embeddings for products are an important building block for web-scale e-commerce recommendation systems. At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases including user, image and search based recommendations. This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost. While most prior work focuses on building product embeddings from features coming from a single modality, we introduce a transformer-based architecture capable of aggregating information from both text and image modalities and show that it significantly outperforms single modality baselines. We also utilize multi-task learning to make ItemSage optimized for several engagement types, leading to a candidate generation system that is efficient for all of the engagement objectives of the end-to-end recommendation system. Extensive offline experiments are conducted to illustrate the effectiveness of our approach and results from online A/B experiments show substantial gains in key business metrics (up to +7% gross merchandise value/user and +11% click volume).

📜 Leveraging Unlabeled Data to Track Memorization

Mahsa Forouzesh (EPFL); Hanie Sedghi (Google)*; Patrick Thiran (EPFL)
Abstract: Deep neural networks may easily memorize noisy labels present in real-world datasets, which degrades their ability to generalize. It is therefore important to track and evaluate the robustness of models against noisy label memorization. We propose a method to assess this robustness, which bypasses the requirement to access ground-truth labels, by simply taking a subset of the dataset and assigning random labels to it. We first show, both empirically and theoretically, that networks with a high accuracy on unseen clean data are also resistant to the memorization of such a held-out randomly-labeled set. Next, we leverage this observation and introduce a metric, called \emph{susceptibility}, which is easy to compute during training on a held-out randomly-labeled set, and is a very good indicator of the resistance of the trained model against memorization. We show through extensive experiments that one can utilize susceptibility and the overall training accuracy to distinguish models that maintain a low memorization on the training set and generalize well to unseen clean data.

📜 Adaptation of Surgical Activity Recognition Models Across Operating Room

Ali Mottaghi (Stanford University)*; Aidean Sharghi (Intuitive Surgical Inc.); Serena Yeung (Stanford University); Omid Mohareri (Intuitive Surgical Inc.)
Abstract: Automatic surgical activity recognition enables more intelligent surgical devices and a more efficient workflow. Integration of such technology in new operating rooms has the potential to improve care delivery to patients and decrease costs. Recent works have achieved a promising performance on surgical activity recognition; however, the lack of generalizability of these models is one of the critical barriers to the wide-scale adoption of this technology. In this work, we study the generalizability of surgical activity recognition models across operating rooms. We propose a new domain adaptation method to improve the performance of the surgical activity recognition model in a new operating room for which we only have unlabeled videos. Our approach generates pseudo labels for unlabeled video clips that it is confident about and trains the model on the augmented version of the clips. We extend our method to a semi-supervised domain adaptation setting where a small portion of the target domain is also labeled. In our experiments, our proposed method consistently outperforms the baselines on a dataset of more than 480 long surgical videos collected from two operating rooms.

📜 Translation of Taxonomy Entities using Graph Neural Networks

Zhuliu Li (LinkedIn)*; Yanen Li (LINKEDIN CORPORATION); Yiming Wang (LINKEDIN CORPORATION); Xiao Yan (LinkedIn); Weizhi Meng (LinkedIn); Jaewon Yang (LINKEDIN CORPORATION)
Abstract: Taxonomies describe the definitions of entities, entities’ attributes and the relations among the entities, and thus play an important role in building a knowledge graph. In this paper, we tackle the task of taxonomy entity translation, which is to translate the names of taxonomy entities in a source language to a target language. The translations then can be utilized to build a knowledge graph in the target language. Despite its importance, taxonomy entity translation remains a hard problem for AI models due to two major challenges. One challenge is understanding the semantic context in very short entity names. Another challenge is having deep understanding for the domain where the knowledge graph is built. We present TaxoTrans, a novel method for taxonomy entity translation that can capture the context in entity names and the domain knowledge in taxonomy. To achieve this, TaxoTrans creates a heterogeneous graph to connect entities, and formulates the entity name translation problem as link prediction in the heterogeneous graph: given a pair of entity names across two languages, TaxoTrans applies a graph neural network to determine whether they form a translation pair or not. Because of this graph, TaxoTrans can capture both the semantic context and the domain knowledge. Our offline experiments on LinkedIn’s skill and title taxonomies show that by modeling semantic information and domain knowledge in the heterogeneous graph, TaxoTrans outperforms the state-of-the-art translation methods by ∽ 10%. Human annotation and A/B test results further demonstrate that the accurately translated entities significantly improves user engagements and advertising revenue at LinkedIn.

📜 Towards Building Explainable-AI Systems Across LinkedIn: Key Challenges and Resolutions

Jilei Yang (LinkedIn Corporation)*
Abstract: Delivering the best member and customer experiences with a focus on trust is core to our work at LinkedIn. To this end, we build transparent and explainable AI systems on top of machine learning models across LinkedIn via model interpretation. We encountered two key challenges when building the systems: I. Model interpretation results may not be intuitive to non-technical audience. II. Model interpretation can be time-consuming. We proposed and developed two resolutions: CrystalCandle - a user-facing model explainer that creates user-digestible interpretations to deal with Challenge I, and FastTreeSHAP - an open-sourced Python package that speeds up the SHAP value computation by 3x to deal with Challenge II. Both resolutions have been successfully adopted at LinkedIn, leading to boosts in adoption rate of model recommendations and increases in downstream key metrics such as revenue.

📜 Results of the NeurIPS'22 Cross-Domain MetaDL Competition

Dustin J Carrion (Université Paris-Saclay)*; Hong Chen (Tsinghua University); Adrian El Baz (ChaLearn); Sergio Escalera (CVC and University of Barcelona); Chaoyu Guan (Tsinghua University); Isabelle Guyon (CNRS, INRIA, University Paris-Saclay and ChaLearn); Ihsan Ullah (Université Paris Saclay); Xin Wang (Tsinghua University); Wenwu Zhu (Tsinghua University)
Abstract: The aim of this paper is to present the results of the latest challenge in the ChaLearn meta-learning series, accepted at NeurIPS'22, focusing on "cross-domain" meta-learning. Meta-learning aims to leverage experience gained from previous tasks to solve new tasks efficiently (i.e., with better performance, little training data and/or modest computational resources). While previous challenges in the series focused on "within-domain" few-shot learning problems, with the aim of learning efficiently N-way k-shot tasks (i.e., N class classification problems with k training examples), this competition challenges the participants to solve "any-way" and "any-shot" problems drawn from various domains (healthcare, ecology, biology, manufacturing, and others), chosen for their humanitarian and societal impact. The competition is with code submission, fully blind-tested on the CodaLab challenge platform. The code of the winners will be open-sourced, enabling to deploy automated machine learning solutions for few-shot image classification across several domains.

📜 Learning differentiable solvers for systems with hard constraints

Geoffrey Negiar (UC Berkeley)*; Michael Mahoney ("University of California, Berkeley"); Aditi Krishnapriyan (UC Berkeley)
Abstract: We introduce a practical method to enforce linear partial differential equation (PDE) constraints for functions defined by neural networks (NNs), up to a desired tolerance. By combining methods in differentiable physics and applications of the implicit function theorem to NN models, we develop a differentiable PDE-constrained NN layer. During training, our model learns a family of functions, each of which defines a mapping from PDE parameters to PDE solutions. At inference time, the model finds an optimal linear combination of the functions in the learned family by solving a PDE-constrained optimization problem. Our method provides continuous solutions over the domain of interest that exactly satisfy desired physical constraints. Our results show that incorporating hard constraints directly into the NN architecture achieves much lower test error when compared to training on an unconstrained objective.

📜 Towards Multimodal Multitask Scene Understanding Model for Indoor Mobile Agents

Yao-Hung Hubert Tsai (Apple)*; Hanlin Goh (Apple); Ali Farhadi (Apple); Jian Zhang (Apple Inc.)
Abstract: The perception system in personalized mobile agents requires developing indoor scene understanding models, which could capture objectiveness, analyze human behaviors, understand 3D geometries, etc. Nonetheless, this direction has not been well-explored when compared to models in the outdoor environment (e.g., the autonomous driving system that includes pedestrian prediction, car detection, traffic sign recognition, etc.). In this paper, we first discuss four core challenges: 1) fusion between heterogeneous sources of information (e.g., RGB images and Lidar point clouds), 2) modeling relationship between a diverse set of outputs (e.g., 3D object locations, depth estimation, and human poses), 3) computational efficiency under mobile compute constraints, and 4) insufficient, or even no, labeled data for real-world indoor environments. Then, we describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges. MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks. We show that MMISM performs on par or even better than single-task models; e.g., we improve the baseline 3D object detection results by 16\% on the benchmark ARKitScenes dataset.

📜 Developing a Machine Learning Mechanism for Selecting MRI Radiology Titles Using Electronic Medical Records

Peyman Shokrollahi (Stanford University)*; Juan M. Zambrano Chaves (Stanford University); Jonathan P.H. Lam (Stanford University); Avishkar Sharma (Stanford University); Debashish Pal (GE Healthcare); Naeim Bahrami (GE Healthcare); Akshay S Chaudhari (Stanford University); Andreas M. Loening (Stanford University)
Abstract: N/A

📜 Safe Real-World Reinforcement Learning for Mobile Agent Obstacle Avoidance

Mario Srouji (Apple); Wei Ding (Apple); Yao-Hung Hubert Tsai (Apple)*; Ali Farhadi (Apple); Jian Zhang (Apple Inc.)
Abstract: This work builds an efficient safety system for real-world obstacle avoidance for mobile agents. At the core of our safety system is a reinforcement-learning guided search algorithm with the following three key properties. First, it performs learning and inference in the real world with a strong guarantee of safety. Second, it does in-time path planning when it detects the agent will be in unsafe positions in the future, to adapt the agent's travel path and avoid getting into worst-case braking scenarios (i.e., reaching a complete stop). Third, it considers a fusion strategy from heterogeneous sources of inputs, including Lidar, Ultrasonic array, and wheel odometry, for their complementary information to detect a diversified set of obstacles, including walls (static), glass (transparent), human (dynamic), etc. Our real-world experiments show that, when compared to mobile agents without a safety system and mobile agents with traditional safety systems, our approach enjoys a higher average speed, lower crash rate, higher goals reached rate, smaller computation overhead, and smoother overall control.

🎤 LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

Dhruv Shah (UC Berkeley); Błażej B Osiński (Lyft Inc)*; Brian Ichter (Google Brain); Sergey Levine (UC Berkeley)
Abstract: Goal-conditioned policies for robotic navigation can be trained on large, unannotated datasets, providing for good generalization to real-world settings. However, particularly in vision-based settings where specifying goals requires an image, this makes for an unnatural interface. Language provides a more convenient modality for communication with robots, but contemporary methods typically require expensive supervision, in the form of trajectories an-notated with language descriptions. Our main observation is that we can utilize off-the-shelf pre-trained models trained on large corpora of visual and language datasets — that are widely available and show great few-shot generalization capabilities — to create this interface for embodied instruction following. To achieve this, we combine the strengths of two such robot-agnostic pre-trained models with a pre-trained navigation model. We use a visual navigation model (VNM: ViNG) to create a topological “mental map” of the environment using the robot’s observations. Given free-form textual instructions, we use a pre-trained large language model (GPT-3) to decode the instructions into a sequence of textual landmarks. We then use a vision-language model (CLIP) for grounding these textual landmarks in the topological map, by inferring a joint likelihood over the landmarks and nodes. A novel search algorithm is then used to maximize a probabilistic objective, and find a plan for the robot, which is then executed by VNM. Our contribution, Large Model Navigation, or LM-Nav, is the first instantiation of a robotic system that combines the confluence of pre-trained vision-and-language models with a goal-conditioned controller, to derive actionable plans without any fine-tuning in the target environment. We show that LM-Nav is able to successfully follow natural language instructions in new environments over the course of 100s of meters of complex, suburban navigation, while disambiguating paths with fine-grained commands.

🎤 When can you trust your model's predictions? A Mistrust Scoring Framework for inference

Nandita Bhaskhar (Stanford University)*; Daniel Rubin (Stanford University); Christopher Lee-Messer (Stanford University)
Abstract: N/A

📜 Self-Supervision for Scene Graph Embeddings

Brigit Schroeder (University of California Santa Cruz)*; Subarna Tripathi (Intel Labs)
Abstract: Scene graph embeddings are used in applications such as image retrieval, image generation and image captioning. Many of the models for these tasks are trained on large datasets such as Visual Genome, but the collection of these human-annotated datasets is costly and onerous. We seek to improve scene graph embedding representation learning by leveraging the already available data (e.g. the scene graphs themselves) with the addition of self-supervision. In self-supervised learning, models are trained for pretext tasks which do not depend on manual labels and use the existing available data. However, it is largely unexplored in the area of image scene graphs. In this work, starting from a baseline scene graph embedding model trained on the pretext task of layout prediction, we propose several additional self-supervised pretext tasks. The impact of these additions is evaluated on a downstream retrieval task that was originally associated with the baseline model. Experimentally, we demonstrate that the addition of each task individually and cumulatively improves on the retrieval performance of the baseline model, resulting in near saturation when all are combined.

📜 One-class recommendation systems with the hinge pairwise distance loss and orthogonal representations

Ramin Raziperchikolaei (Rakuten)*; Young Joo Chung (Rakuten)
Abstract: In one-class recommendation systems, the goal is to learn a model from a small set of interacted users and items and then identify the positively-related user-item pairs among a large number of pairs with unknown interactions. Most previous loss functions rely on dissimilar pairs of users and items, which are selected from the ones with unknown interactions, to obtain better prediction performance. This strategy introduces several challenges such as increasing training time and hurting the performance by picking "similar pairs with the unknown interactions" as dissimilar pairs. In our work, the goal is to only use the similar set to train the models, discard the dissimilar ones. We achieve this goal by adding two terms to the objective function. The first one is a hinge pairwise distance loss that avoids the collapsed solution by keeping the average pairwise distance of all the representations greater than a margin. The second one is an orthogonality term that minimizes the correlation between the dimensions of the representations and avoids the partially collapsed solution. We conduct experiments on a variety of tasks on public and real-world datasets. The results show that our approach using only similar pairs outperforms state-of-the-art methods using similar pairs and a large number of dissimilar pairs.

📜 A Simple, Yet Effective Approach to Finding Biases in Code Generation

Spyridon Mouselinos (University of Warsaw)*; Mateusz Malinowski (Deepmind); Henryk Michalewski (Google LLC)
Abstract: Recently, scores of high-performing code generation models have surfaced. They use large language models as their backbone, and they offer to assist anyone in their coding routines. However, can we identify a piece of code or its specification that can mislead even top-performing models in completing the code? While assessing the difficulty of coding challenges remains an open question, we find that typical mistakes made during coding can render a task harder; and thus leading to failures of the existing models. To demonstrate the failures, we have extended two popular code generation challenges by an automated error-inducing module, and test various code generation models. Our results reveal biases towards specific prompt structure and keywords during code generation. Finally, we also study the effects of harnessing such augmentations during model training.

📜 Data Feedback Loops: Model-driven Amplification of Dataset Biases

Rohan Taori (Stanford University)*; Tatsunori Hashimoto (Stanford)
Abstract: We study a setting inspired by the internet-scale training and deployment of neural systems, where interactions with one model may be recorded as internet history and scraped as training data in the future. We analyze the stability of this system via changes to a bias statistic (e.g. toxicity rate from a language model) on the model's outputs over time. We find that the degree of bias amplification in these systems is closely linked to whether the model's outputs are consistently calibrated with respect to its training data distributions. Experiments in three different scenarios -- image classification, structured prediction, and language generation -- suggest that models that behave more like samplers are often more calibrated and thus more stable. Based on this insight, we propose an intervention to help stabilize existing unstable feedback systems.

📜 On the impact of overfitting in learning to rank using a margin loss: a case study in job recommender systems

Solal Nathan (Laboratoire Interdisciplinaire des Sciences du Numérique)*; Guillaume Bied (Laboratoire Interdisciplinaire des Sciences du Numérique)
Abstract: In learning to rank and recommender systems, it is typically untractable to directly optimize on metrics of interest such as recall or precision. Surrogate losses are instead used for learning: an important case are margin-based loss functions, which seek to separate relevant samples from irrelevant ones. This paper studies the relation between margin loss and the true metric in the real-world setting of job recommender systems. Intriguingly, in this setting, overfitting the margin loss does not translate to overfitting on the metric of interest. To understand this phenomenon, we introduce novel concepts of participation (the share of training samples with non-zero contribution to the loss) and cycling (stability of the population of samples non-zero contribution throughout training).

📜 Depth Field Networks for Generalizable Multi-view Scene Representation

Igor Vasiljevic (Toyota Research Institute)*; Vitor Guizilini (Toyota Research Institute); Jiading Fang (Toyota Technological Institute at Chicago); Rareș A Ambruș (Toyota Research Institute); Greg Shakhnarovich (Toyota Technological Institute at Chicago); Matthew Walter (Toyota Technological Institute at Chicago); Adrien Gaidon (Toyota Research Institute)
Abstract: Modern 3D computer vision leverages learning to boost geometric reasoning, mapping image data to classical structures such as cost volumes or epipolar constraints to improve matching. These architectures are specialized according to the particular problem, and thus require significant task-specific tuning, often leading to poor domain generalization performance. Recently, generalist Transformer architectures have achieved impressive results in tasks such as optical flow and depth estimation by encoding geometric priors as inputs rather than as enforced constraints. In this paper, we extend this idea and propose to learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity. We also show that introducing view synthesis as an auxiliary task further improves depth estimation. Our Depth Field Networks (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.

📜 DANGER: A Framework of Danger-Aware Novel Dataset Generator Extension for Robustness Test of Machine Learning

Shengjie Xu (UC Santa Cruz)*; Leilani H Gilpin (UCSC)
Abstract: Benchmark datasets for autonomous driving, such as KITTI, Argoverse, or Waymo are realistic, but they are designed to be too idealistic. These datasets do not contain errors, difficult driving maneuvers, or other corner cases. We propose a framework for perturbing autonomous vehicle datasets, the DANGER framework, which generates edge-case images on top of current autonomous driving datasets. The input to DANGER is a photorealistic datasets from real driving scenarios. We present the DANGER algorithm for vehicle position manipulation and the interface towards the renderer module, and present primitive generation cases applied to the virtual KITTI dataset. Our experiments prove that DANGER can be used as a framework for enlarging the current dataset to cover generative corner cases.

📜 Posterior Sampling Model-based Policy Optimization

Chaoqi Wang (University of Toronto)*; Yuxin Chen (UChicago); Kevin Murphy (Google)
Abstract: In this paper, we propose a model-based reinforcement learning (MBRL) algorithm based on posterior sampling. We first motivate our method by showing that several popular MBRL algorithms can not trade off exploration and exploitation properly. Then, we construct a hierarchical Bayesian model over the optimal policies with the environment’s dynamics as the latent variable, and employ posterior sampling to handle exploration and exploitation. Empirically, our method surpasses the baselines by a large margin on several benchmark continuous control tasks.

📜 Layerwise Training of Convex Convolutional Neural Networks with the Burer-Monteiro Factorization

Arda Sahiner (Stanford University)*; Tolga Ergen (Stanford University); Batu Ozturkler (Stanford University); John Pauly (Stanford University); Morteza Mardani (Stanford University); Mert Pilanci (Stanford University)
Abstract: It has been demonstrated that two-layer ReLU-activation neural networks are equivalent to convex programs. Convex training of neural networks guarantees of global optimality. The convex formulation of neural networks induces a unique regularizer: a type of nuclear norm which promotes sparse factorization while the left factor is constrained to an affine space. This constrained nuclear-norm is NP-hard to compute. To address this, we leverage the Burer-Monterio (BM) factorization to (i) inherit the per-iteration complexity of non-convex training of neural network, and (ii) inherit the optimality guarantees of convex training. For convexified ReLU CNNs, we develop verifiable relative optimality bounds for all stationary points of the BM factorization. Our experiments with image classification indicate that this BM factorization allows layerwise training of convex CNNs, for the first time to matching the performance of multi-layer non-convex CNNs.

📜 TE2Rules: Extracting Rule Lists from Tree Ensembles

G Roshan Lal (LinkedIn )*; Elaine Chen (LinkedIn); Varun Mithal (LinkedIn)
Abstract: Tree Ensemble (TE) models (e.g. Gradient Boosted Trees and Random Forests) often provide higher prediction performance compared to single decision trees. However, TE models generally lack transparency and interpretability, as humans have difficulty understanding their decision logic. This paper presents a novel approach to convert a TE trained for a binary classification task, to a rule list (RL) that is a global equivalent to the TE and is comprehensible for a human. This RL captures all necessary and sufficient conditions for decision making by the TE. Experiments on benchmark datasets demonstrate that, compared to state-of-the-art methods, (i) predictions from the RL generated by TE2Rules have high fidelity with respect to the original TE, (ii) the RL from TE2Rules has high interpretability measured by the number and the length of the decision rules, (iii) the run-time of TE2Rules algorithm can be reduced significantly at the cost of a slightly lower fidelity, and (iv) the RL is a fast alternative to the state-of-the-art rule-based instance-level outcome explanation techniques. For more details on our implementation and for reproducing the results in this paper, our code can be found here:

📜 Medical Codes Prediction from Clinical Notes: From Human Coders to Machines

Byung-Hak Kim (AKASA)*
Abstract: See the uploaded PDF file

📜 Surrogate for Long-Term User Experience in Recommender Systems

Yuyan Wang (Google Brain)*; Mohit Sharma (University of Minnesota); Can Xu (Google); Sriraj Badam (Google); Qian Sun (Google); Lee Richardson (Google); Lisa Chung (Google); Ed H. Chi (Google); Minmin Chen (Google)
Abstract: Over the years we have seen recommender systems shifting focus from optimizing short-term engagement toward improving long-term user experience on the platforms. While defining good long-term user experience is still an active research area, we focus on one specific aspect of improved long-term user experience here, which is user revisiting the platform. These long term outcomes however are much harder to optimize due to the sparsity in observing these events and low signal-to-noise ratio (weak connection) between these long-term outcomes and a single recommendation. To address these challenges, we propose to establish the association between these long-term outcomes and a set of more immediate term user behavior signals that can serve as surrogates for optimization. To this end, we conduct a large-scale study of user behavior logs on one of the largest industrial recommendation platforms serving billions of users. We study a broad set of sequential user behavior patterns and standardize a procedure to pinpoint the subset that has strong predictive power of the change in users' long-term visiting frequency. Specifically, they are predictive of users' increased visiting to the platform in $5$ months among the group of users with the same visiting frequency to begin with. We validate the identified subset of user behaviors by incorporating them as reward surrogates for long-term user experience in a reinforcement learning (RL) based recommender. Results from multiple live experiments on the industrial recommendation platform demonstrate the effectiveness of the proposed set of surrogates in improving long-term user experience.

🎤 Plex: Towards Reliability using Pretrained Large Model Extensions

Balaji Lakshminarayanan (Google Brain)*; Dustin Tran (Google); Jeremiah Liu (Google Research); Michael W Dusenberry (Google); Du Phan (Google); Mark Collier (Google); Jie Ren (Google Research); Kehang Han (Google); Zi Wang (Google); Zelda Mariet (Google); Huiyi Hu (Google Deepmind); Neil B Band (University of Oxford); Tim G. J. Rudner (University of Oxford); Karan Singhal (Google Research); Zachary Nado (Google Brain); Joost van Amersfoort (University of Oxford); Andreas Kirsch (University of Oxford); Rodolphe Jenatton (Google); Nithum Thain (Google); Honglin Yuan (Stanford); Kelly Buchanan (Google); Kevin Murphy (Google); D Sculley (Google); Yarin Gal (University of Oxford); Zoubin Ghahramani (Google); Jasper Snoek (Google Brain)
Abstract: A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models' abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks involving uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot uncertainty). We devise 10 types of tasks over 38 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions (plex) for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol as it does not require designing scores or tuning the model for each individual task. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples. We also demonstrate Plex's capabilities on challenging tasks including zero-shot open set recognition, active learning, and uncertainty in conversational language understanding.

📜 Beyond neural scaling laws: beating power law scaling via data pruning

Ben Sorscher (Stanford University)*; Surya Ganguli (); Ari Morcos (Facebook AI Research); Robert Geirhos (University of Tubingen); Shashank Shekhar (University of Guelph)
Abstract: Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how both in theory and practice we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling performance on ResNets trained on CIFAR-10, SVHN, and ImageNet. Given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.

📜 Accelerating Computational Chemistry with Machine Learning

Daniel Rothchild (UC Berkeley)*; Andrew Rosen (UC Berkeley); Eric Taw (UC Berkley); Joseph E Gonzalez (UC Berkeley); Aditi Krishnapriyan (UC Berkeley)
Abstract: Quantum mechanical simulations, such as density functional theory (DFT), are widely used to study and design new materials. However, these methods can often be prohibitively slow. Machine learning (ML) provides an alternative to accelerate such simulations. However, current ML approaches for accelerating DFT must train on large amounts of energies and force data for many materials. We develop a self-supervised ML training method that only needs the material structure (bypassing the expensive energies/forces training). Our results show that we can achieve low error in a fraction of the DFT computation time and with a fraction of the training data of other ML methods.

Call for Abstracts

BayLearn 2022

The BayLearn 2022 abstract submission site is now open CLOSED for submissions: BayLearn 2022 CMT

The abstract submission deadline is Thursday, July 14th, 2022 11:59 pm PDT Please submit abstracts as a 2-page pdf in NeurIPS format. An extra page for acknowledgements and references is allowed.

About BayLearn: The BayLearn Symposium is an annual gathering of machine learning researchers and scientists from the San Francisco Bay Area. While BayLearn promotes community building and technical discussions between local researchers from academic and industrial institutions, it also welcomes visitors. This one-day event combines invited talks, contributed talks, and posters, to foster exchange of ideas.

Meet with fellow Bay Area machine learning researchers and scientists during the symposium that will be held in mid October. Exact date to be decided

Feel free to circulate this invitation to your colleagues and relevant contacts.

Key Dates


We encourage submission of abstracts. Acceptable material includes work which has already been submitted or published, preliminary results and controversial findings. We do not intend to publish paper proceedings; only abstracts will be shared through an online repository. Our primary goal is to foster discussion! For examples of previously accepted talks, please watch the paper presentations from previous BayLearn Symposiums:

For more information about submissions, please look here:

Submit your abstracts via CMT: BayLearn 2022 CMT

Mailing List: If this email was forwarded to you, and you would like to join the BayLearn mailing list so that you will receive future communications from us directly, please sign up.

Unsubscribe Note: you are receiving this e-mail because you have previously registered for, or registered interest in BayLearn. If you wish to no longer receive e-mails from BayLearn, please unsubscribe using this link: Unsubscribe

Best Regards,

The BayLearn Organizers