## Accepted Submissions

### Conditional Generation of Synthetic Geospatial Images from Pixel-level and Feature-level Inputs

Xuerong Xiao (Stanford University); Swetava Ganguli (Apple)*; Vipul Pandey (Apple)
Abstract: Training robust supervised deep learning models for many geospatial applications of computer vision is difficult due to dearth of class-balanced and diverse training data. Conversely, obtaining enough training data for many applications is financially prohibitive or may be infeasible, especially when the application involves modeling rare or extreme events. As an example, detecting infrequent events or changes like road closures, road blocks, etc. are critical to keep a geospatial mapping service up-to-date in real-time and can significantly improve user experience and above all, user safety. Synthetically generating data (and labels) using a generative model that can sample from a target distribution and exploit the multi-scale nature of images can be an inexpensive solution to address scarcity of labeled data. Towards this goal, we present a deep conditional generative model, called VAE-Info-cGAN, that combines a Variational Autoencoder (VAE) with a conditional Information Maximizing Generative Adversarial Network (InfoGAN), for synthesizing semantically rich images simultaneously conditioned on a pixel-level condition (PLC) and a macroscopic feature-level condition (FLC). Dimensionally, the PLC can only vary in the channel dimension from the synthesized image and is meant to be a task-specific input. The FLC is modeled as an attribute vector in the latent space of the generated image which controls the contributions of various characteristic attributes germane to the target distribution. Experiments on a GPS trajectories dataset show that the proposed model can accurately generate various forms of spatiotemporal aggregates across different geographic locations while conditioned only on a raster representation of the road network. The primary intended application of the VAE-Info-cGAN is synthetic data (and label) generation for targeted data augmentation for computer vision-based modeling of problems relevant to geospatial analysis and remote sensing.

### Finding Experts in Transformer Models

Xavier Suau Cuadros (Apple Inc.)*; Luca Zappella (Apple Inc.); Nicholas Apostoloff (Apple Inc.)
Abstract: In this work we study the presence of expert units in pre-trained Transformer Models (TM), and how they impact a model's performance. We define expert units to be neurons that are able to classify a concept with a given average precision, where a concept is represented by a binary set of sentences containing the concept (or not). Leveraging the OneSec dataset (Scarlini et al., 2019), we compile a dataset of 1641 concepts that allows diverse expert units in TM to be discovered. We show that expert units are important in several ways: (1) The presence of expert units is correlated (r^2=0.833) with the generalization power of TM, which allows ranking TM without requiring fine-tuning on suites of downstream tasks. We further propose an empirical method to decide how accurate such experts should be to evaluate generalization. (2) The overlap of top experts between concepts provides a sensible way to quantify concept co-learning, which can be used for explainability of unknown concepts. (3) We show how to self-condition off-the-shelf pre-trained language models to generate text with a given concept by forcing the top experts to be active, without requiring re-training the model or using additional parameters.

### Deep Reinforcement Learning at the Edge of the Statistical Precipice

Rishabh Agarwal (Google Research, Brain Team)*; Max Schwarzer (Mila); Pablo Samuel Castro (Google); Aaron Courville (MILA, Université de Montréal); Marc G. Bellemare (Google Brain)
Abstract: Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks. However, only reporting point estimates ignores the statistical uncertainty implied by the use of a finite number of evaluation runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a handful of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few-run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we assert reporting interval estimates of aggregate performance and propose performance distributions to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology to prevent unreliable results from stagnating the field.

### Implicit vs. Explicit GAN-Based Style Transfer for Continuous Path Keyboard Input Modeling

Akash Mehra (Apple Inc.)*; Jerome Bellegarda (Apple Inc.); Ojas Bapat (Apple Inc.); Hema Koppula (Apple Inc.); Rick Chang (Apple Inc.); Ashish Shrivastava (Apple); Oncel Tuzel (Apple)
Abstract: The success of continuous path keyboard input as an alternative text input modality requires high-quality training data to inform the underlying recognition model. In [1], we have adopted generative adversarial networks (GANs) to augment the training corpus with user-realistic synthetic paths. GAN-driven synthesis makes it possible to emulate the acquisition of enough paths from enough users to learn a model sufficiently robust across a large population. Here we study the influence of different GAN architectures on path quality and diversity. Experiments show that explicit content/style disentanglement resulting from separate style encoding has only a limited impact on end user perception, but implicit and explicit style transfer paradigms are complementary in the kind of user-realistic artifacts they generate. Leveraging multiple GAN strategies thus injects more robustness into the model through broader coverage of user idiosyncrasies across a wide lexical range.

### Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization

John P Miller (University of California, Berkeley)*; Rohan Taori (Stanford University); Aditi Raghunathan (Stanford University); Shiori Sagawa (Stanford University); Pang Wei Koh (Stanford University); Vaishaal Shankar (UC Berkeley); Percy Liang (Stanford University); Yair Carmon (Tel Aviv University); Ludwig Schmidt (Toyota Research Institute)
Abstract: For machine learning systems to be reliable, we must understand their performance in unseen, out-of-distribution environments. In this paper, we empirically show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts. Specifically, we demonstrate strong correlations between in-distribution and out-of-distribution performance on variants of CIFAR-10 & ImageNet, a synthetic pose estimation task derived from YCB objects, FMoW-WILDS satellite imagery classification, and wildlife classification in iWildCam-WILDS. The correlation holds across model architectures, hyperparameters, training set size, and training duration, and is more precise than what is expected from existing domain adaptation theory. To complete the picture, we also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR- 10-C and the tissue classification dataset Camelyon17-WILDS. Finally, we provide a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.

### What Can I Do Here? Learning New Skills by Imagining Visual Affordances

Khazatsky Alexander (UC Berkeley)*; Ashvin V Nair (UC Berkeley)
Abstract: A generalist robot equipped with learned skills must be able to perform many tasks in many different environments. However, zero-shot generalization to new settings is not always possible. When the robot encounters a new environment or object, it may need to finetune some of its previously learned skills to accommodate this change. But crucially, previously learned behaviors and models should still be suitable to accelerate this relearning. In this paper, we aim to study how generative models of possible outcomes can allow a robot to learn visual representations of affordances, so that the robot can sample potentially possible outcomes in new situations, and then further train its policy to achieve those outcomes. In effect, prior data is used to learn what kinds of outcomes may be possible, such that when the robot encounters an unfamiliar setting, it can sample potential outcomes from its model, attempt to reach them, and thereby update both its skills and its outcome model. This approach, visuomotor affordance learning (VAL), can be used to train goal-conditioned policies that operate on raw image inputs, and can rapidly learn to manipulate new objects via our proposed affordance-directed exploration scheme. We show that VAL can utilize prior data to solve real-world tasks such drawer opening, grasping, and placing objects in new scenes with only five minutes of online experience in the new scene.

### Evolving Reinforcement Learning Algorithms

John D Co-Reyes (UC Berkeley); Yingjie Miao (Google)*; Daiyi Peng (Google Brain); Esteban Real (Google Brain); Sergey Levine (Google); Quoc Le (Google Brain); Honglak Lee (LG AI Research / University of Michigan); Aleksandra Faust (Google Brain)
Abstract: We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs that compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. We highlight two learned algorithms with good generalization performance over classical control tasks, gridworld type tasks, and Atari games, and release a dataset of 1000 discovered algorithms for further analysis.

### Sensitivity Analysis for Fairness

Aparna R Joshi (Apple)*; Xavier Suau Cuadros (Apple Inc.); Nivedha Sivakumar (Apple Inc.); Luca Zappella (Apple Inc.); Nicholas Apostoloff (Apple Inc.)
Abstract: As the use of deep learning in high-impact domains becomes ubiquitous, it is increasingly important to assess the resilience of the models. One such high impact domain is that of face recognition with real world applications involving images affected by various degradations, such as blur, contrast, or noise. Moreover, images captured across different attributes, such as gender and race can also challenge the robustness of a face recognition algorithm. While summary statistics suggest that the aggregate performance of face recognition models has continued to improve, Visual Psychophysics Sensitivity Analysis (VPSA) [6] provides a good way to pinpoint the individual causes of failure by way of introducing incremental perturbations in the data. However, different degradations may affect subgroups differently. With the increasing focus on the fairness of face recognition algorithms, we extend VPSA to analyze the ability of the model to perform well for different subgroups of a population, and pinpoint the exact failure modes for a subgroup by measuring targeted robustness.

### Neuroevolution-Enhanced Multi-Objective Optimization for Mixed-Precision Quantization

Santiago Miret (Intel Labs)*; Vui Seng Chua (Intel Labs); Mattias Marder (Intel Corp); Mariano Phielipp (Intel AI Lab); Nilesh Jain (Intel); Somdeb Majumdar (Intel Labs)
Abstract: Mixed-precision quantization is a powerful tool to enable memory and compute savings of neural network workloads by deploying different sets of bit-width precisions on separate compute operations. Recent research has shown significant progress in applying mixed-precision quantization techniques to reduce the memory footprint of various workloads, while also preserving task performance. Prior work, however, has often ignored additional objectives, such as bit-operations, that are important for deployment of workloads on hardware. Here we present a flexible and scalable framework for automated mixed-precision quantization that optimizes multiple objectives. Our framework relies on Neuroevolution-Enhanced Multi-Objective Optimization (NEMO), a novel search method, to find Pareto optimal mixed-precision configurations for memory and bit-operations objectives. Within NEMO, a population is divided into structurally distinct sub-populations (species) which jointly form the Pareto frontier of solutions for the multi-objective problem. At each generation, species are re-sized in proportion to the goodness of their contribution to the Pareto frontier. This allows NEMO to leverage established search techniques and neuroevolution methods to continually improve the goodness of the Pareto frontier. In our experiments we apply a graph-based representation to describe the underlying workload, enabling us to deploy graph neural networks trained by NEMO to find Pareto optimal configurations for various workloads trained on ImageNet. Compared to the state-of-the-art, we achieve competitive results on memory compression and superior results for compute compression for MobileNet-V2, ResNet50 and ResNeXt-101-32x8d. A deeper analysis of the results obtained by NEMO also shows that both the graph representation and the species-based approach are critical in finding effective configurations for all workloads.

### Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering

Sateesh Kumar (Retrocausal); Sanjay Haresh (Retrocausal, Inc); Awais Ahmed (Retrocausal, Inc.); Andrey Konin (Retrocausal); Zeeshan Zia (Retrocausal, Inc.); Quoc-Huy Tran (Retrocausal, Inc.)*
Abstract: We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. This is in contrast with prior works where representation learning and clustering are often performed sequentially. We leverage temporal information in videos by employing temporal optimal transport and temporal coherence loss. In particular, we incorporate a temporal regularization term into the standard optimal transport module, which preserves the temporal order of the activity, yielding the temporal optimal transport module for computing pseudo-label cluster assignments. Next, the temporal coherence loss encourages neighboring video frames to be mapped to nearby points while distant video frames are mapped to farther away points in the embedding space. The combination of these two components results in effective representations for unsupervised activity segmentation. Furthermore, previous methods require storing learned features for the entire dataset before clustering them in an offline manner, whereas our approach processes one mini-batch at a time in an online manner. Extensive evaluations on three public datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, show that our approach performs on par or better than previous methods for unsupervised activity segmentation, despite having significantly less memory constraints.

### Correspondence between neuroevolution and gradient descent

Stephen Whitelam (Lawrence Berkeley National Lab)*
Abstract: We show analytically that training a neural network by conditioned stochastic mutation or "neuroevolution" of its weights is equivalent, in the limit of small mutations, to gradient descent on the loss function in the presence of Gaussian white noise. Averaged over independent realizations of the learning process, neuroevolution is equivalent to gradient descent on the loss function. We use numerical simulation to show that this correspondence can be observed for finite mutations, or shallow and deep neural networks. Our results provide a connection between two families of neural-network training methods that are usually considered to be fundamentally different.

### Evaluating the fairness of fine-tuning strategies in self-supervised learning

Jason Ramapuram (Apple Inc)*; Dan Busbridge (Apple); Russ Webb (Apple)
Abstract: In this work we examine how fine-tuning impacts the fairness of contrastive Self-Supervised Learning (SSL) models. Our findings indicate that Batch Normalization (BN) statistics play a crucial role, and that updating only the BN statistics of a pre-trained SSL backbone improves its downstream fairness (36% worst subgroup, 25% mean subgroup gap). This procedure is competitive with supervised learning, while taking 4.4× less time to train and requiring only 0.35% as many parameters to be updated. Finally, inspired by recent work in supervised learning, we find that updating BN statistics and training residual skip connections (12.3% of the parameters) achieves parity with a fully fine-tuned model, while taking 1.33× less time to train.

### SkillRank: Ranking Skills in Job Market

Abstract: We propose SkillRank to understand skills in the job market. SkillRank is a way of quantifying skills from a job market's perspective, and it works by taking into account the frequency and weight of interactions between jobs and skills to determine a rough estimate of the trendiness of skills in the job market. Initial results show that SkillRank provides us a novel and quantitative way to understand skills in the job market.

### Offline Reinforcement Learning for Mobile Notifications

Abstract: Mobile notifications have taken a major role in driving and maintaining user engagement. Most machine learning applications in notifications focus on the near-term impact through signals such as notification click through rates and immediate user visit without a principled consideration on long-term user impact. However, a user's experience depends on a sequence of notifications, and their joint impact can have different intermediate and long term engagement patterns. Thus, these notification deliveries should not be treated as independent decisions on the platform and require coordinated planning for the optimal user experience. In this paper, we propose a offline reinforcement learning framework to optimize sequential notification decisions for long-term user engagement. We describe a state-marginalized importance sampling policy evaluation approach, which can be used to evaluate the policy offline and also tune learning hyperparameters. Through simulations that approximate the notifications ecosystem, we demonstrate the performance and benefits of the offline evaluation approach as a part of the reinforcement learning modeling approach. Finally, we collect data through online exploration in the production system, train an offline Double Deep Q-Network and launch a successful policy online.

### Evidential Softmax: A Sparse Multimodal Alternative to Softmax

Phil Chen (Stanford University)*; Masha Itkina (Stanford University); Ransalu Senanayake (Stanford University); Mykel J Kochenderfer (Stanford University)
Abstract: Many applications of generative models rely on the marginalization of their high-dimensional output probability distributions. Normalization functions that yield sparse probability distributions can make exact marginalization more computationally tractable. In this work, we present ev-softmax, a sparse alternative to softmax that preserves the multimodality of probability distributions. We introduce a continuous family of approximations to ev-softmax that have full support and can thus be trained with probabilistic loss functions such as the negative log-likelihood and Kullback-Leibler divergence. We demonstrate that ev-softmax successfully reduces the dimensionality of output sample space while maintaining multimodality.

### Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

Kexin Huang (Harvard University)*; Tianfan Fu (Georgia Institute of Technology); Wenhao Gao (Massachusetts Institute of Technology); Yue Zhao (Carnegie Mellon University); Yusuf Roohani (Stanford University); Jure Leskovec (Stanford University); Connor Coley (MIT); Cao Xiao (IQVIA); Jimeng Sun (UIUC); Marinka Zitnik (Harvard University)
Abstract: Therapeutics machine learning is an emerging field with incredible opportunities for innovation and impact. However, advancement in this field requires the formulation of meaningful learning tasks and careful curation of datasets. Here, we introduce Therapeutics Data Commons (TDC), the first unifying platform to systematically access and evaluate machine learning across the entire range of therapeutics. To date, TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and types of meaningful data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards. All resources are integrated and accessible via an open Python library. We carry out extensive experiments on selected datasets, demonstrating that even the strongest algorithms fall short of solving key therapeutics challenges, including real dataset distributional shifts, multi-scale modeling of heterogeneous data, and robust generalization to novel data points. We envision that TDC can facilitate algorithmic and scientific advances and considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation. TDC is an open-science initiative available at https://tdcommons.ai.

### A Unified Approach to Ads Relevance Prediction for Multiple Relevance Objectives

Sharare Zehtabian (University of Central Florida)*; Satyajit Gupte (Pinterest); Tarun Kumar (Pinterest); Sindhu Raghavan (Pinterest)
Abstract: Ads relevance modeling is the problem of predicting relevance of an ad for a given query/context, typically framed as a supervised machine learning problem. A typical ads auction takes ads relevance score as one of its inputs in order to deliver the right ads to users to drive maximum value to them. Ads relevance is defined on either a binary scale or a multi-point scale. Depending on different business and product objectives, the definition of relevance could vary for different features and surfaces in any product. It is not scalable to build different machine learning models to predict contextual relevance for different objectives or definitions. In this paper, we explore a unified approach to contextual ad relevance prediction using deep ordinal classification models. We demonstrate the efficacy of ordinal classification models on a benchmark dataset for search ads relevance use-case using two baselines including multi-class and binary classification.

### Minimizing Communication while Maximizing Performance in Multi-Agent Reinforcement Learning

Varun Kumar (Intel AI Lab)*; Hassam Sheikh (Intel Labs); Somdeb Majumdar (Intel Labs); Mariano Phielipp (Intel AI Lab)
Abstract: Inter-agent communication can significantly increase performance in multi-agent tasks that require co-ordination to achieve a shared goal. Prior work has shown that it is possible to learn inter-agent communication protocols using multi-agent reinforcement learning and message-passing network architectures. However, these models use an unconstrained broadcast communication model, in which an agent communicates with all other agents at every step, even when the task does not require it. In real-world applications, where communication may be limited by system constraints like bandwidth, power and network capacity, one might need to reduce the number of messages that are sent. In this work, we explore a simple method of minimizing communication while maximizing performance in multi-task learning: simultaneously optimizing a task-specific objective and a communication penalty. We show that the objectives can be optimized using Reinforce and the Gumbel-Softmax reparameterization. We introduce two techniques to stabilize training: 50% training and message forwarding. Training with the communication penalty on only 50% of the episodes prevents our models from turning off their outgoing messages. Second, repeating messages received previously helps models retain information, and further improves performance. With these techniques, we show that we can reduce communication by 75% with no loss of performance.

### Revisiting Nearest Neighbors from a signal approximation perspective

Sarath Shekkizhar (University of Southern California)*; Antonio Ortega (University of Southern California); Sarath Shekkizhar (University of Southern California)
Abstract: Several machine learning methods leverage the idea of locality by using k-nearest neighbor or epsilon-neighborhood techniques to design pattern recognition models. However, the choice of parameters k/epsilon in these methods is often ad hoc and lacks a clear interpretation. We revisit the problem of neighborhood definition from a sparse signal approximation perspective and propose an improved approach, Non-Negative Kernel regression (NNK). NNK formulates neighborhood selection as a non negative basis pursuit problem and is adaptive to the local distribution of samples near the data point of interest. NNK neighbors are geometric, robust and exhibit superior performance in neighborhood based machine learning. We show that NNK classification with features extracted from a self supervised model achieves 79.8% top-1 ImageNet accuracy, outperforming previous non parametric benchmarks while requiring no hyperparameter tuning or data augmentation.

### Coupled Gradient Estimators for Discrete Latent Variables

Abstract: Training models with discrete latent variables is challenging due to the high variance of unbiased gradient estimators. While low-variance reparameterized gradients of a continuous relaxation can provide an effective solution, a continuous relaxation is not always available or tractable. Dong et al. (2020) and Yin et al. (2020) introduced a performant estimator that does not rely on continuous relaxations; however, it is limited to binary random variables. We introduce a novel derivation of their estimator based on importance sampling and statistical couplings, which we extend to the categorical setting. Motivated by the construction of a stick-breaking coupling, we introduce gradient estimators based on reparameterizing categorical variables as sequences of binary variables and Rao-Blackwellization. In systematic experiments, we show that our proposed categorical gradient estimators provide state-of-the-art performance, whereas even with additional Rao-Blackwellization, previous estimators (Yin et al., 2019) underperform a simpler REINFORCE with a leave-one-out-baseline estimator (Koolet al., 2019).

Andrew Campbell (AT&T)*; Tri Bui (AT&T)
Abstract: Record linkages is one of the most common tasks in enterprise data analysis and there are many good heuristics available, but the task is complicated when alternative spellings of names are used due to typos or cultural conventions around name orderings. These permutations can be thought of as adversarial transformations that expose blind spots in record matching algorithms. Identifying all possible iterations can be computationally intensive, requiring on-the-fly calculations of all likely permutations, and then checking them against a customer name list. In applications with very tight latency requirements this combinatorial explosion can be prohibitive. To mitigate the need to compute these permutations manually, we propose using embeddings that will automatically collocate permutations in a high dimensional space. Instead of generating permutations of a new name, we simply embed the name once and check the distances to embeddings of the names on the customer list. We restrict our analysis to two-part names (first and last), apply adversarial transformations, and use Mikolov’s doc2vec to embed the names. We find that embeddings achieve accuracy greater than 99% in the task of identifying matching permutations of the same name and consistently outperform the baseline heuristics using Jaro-Winkler and Levenshtein distances.

### Soft Calibration Objectives for Neural Networks

Abstract: Optimal decision making requires that classifiers produce uncertainty estimates consistent with their empirical accuracy. However, deep neural networks are often under- or over-confident in their predictions. Consequently, methods have been developed to improve the calibration of their predictive uncertainty both during training and post-hoc. In this work, we propose differentiable losses to improve calibration based on a soft (continuous) version of the binning operation underlying popular calibration-error estimators. When incorporated into training, these soft calibration losses achieve state-of-the-art single-model ECE across multiple datasets with less than 1% decrease in accuracy. For instance, we observe an 82% reduction in ECE (70% relative to the post-hoc rescaled ECE) in exchange for a 0.7% relative decrease in accuracy relative to the cross entropy baseline on CIFAR-100. When incorporated post-training, the soft-binning-based calibration error objective improves upon temperature scaling, a popular recalibration method. Overall, experiments across losses and datasets demonstrate that using calibration-sensitive procedures yield better uncertainty estimates under dataset shift than the standard practice of using a cross entropy loss and post-hoc recalibration methods.

### Sparcle: Spatial Reassignment of spots to cells via maximum likelihood estimation

Sandhya Prabhakaran (Moffitt Cancer Center)*
Abstract: Imaging-based spatial transcriptomics has the power to reveal patterns of single-cell gene expression by detecting mRNA transcripts as individually resolved spots in multiplexed images. However, molecular quantification has been severely limited by the computational challenges of segmenting poorly outlined, overlapping cells, and of overcoming technical noise; the majority of transcripts are routinely discarded because they fall outside the segmentation boundaries. This lost information leads to less accurate gene count matrices and weakens downstream analyses, such as cell type or gene program identification. Here, we present Sparcle, a probabilistic model that reassigns transcripts to cells based on gene covariation patterns and incorporates spatial features such as distance to nucleus. Its utility is shown on multiplexed error-robust fluorescence in situ hybridization (MERFISH) single-molecule FISH (smFISH) data, probabilistic cell typing by In situ Sequencing (pciSeq) and spatially-resolved transcript amplicon readout mapping (STARmap). Sparcle improves transcript assignment, provides clearer per-cell quantification of each gene, better delineation of cell boundaries, and improved cluster assignments.

### Few Shot Dialogue State Tracking using Meta-learning

Saket Dingliwal (Amazon)*; Shuyang Gao (Amazon); Sanchit Agarwal (Amazon); Chien-Wei Lin (Amazon); Tagyoung Chung (Amazon); Dilek Z Hakkani-Tur (Amazon Alexa AI)
Abstract: Dialogue State Tracking (DST) forms a core component of automated chatbot based systems designed for specific goals like hotel, taxi reservation, tourist information etc. With the increasing need to deploy such systems in new domains, solving the problem of zero/few-shot DST has become necessary. There has been a rising trend for learning to transfer knowledge from resource-rich domains to unknown domains with minimal need for additional data. In this work, we explore the merits of meta-learning algorithms for this transfer and hence, propose a meta-learner D-REPTILE specific to the DST problem. With extensive experimentation, we provide clear evidence of benefits over conventional approaches across different domains, methods, base models and datasets with significant (5-25%) improvement over the baseline in low-data setting. Our proposed meta-learner is agnostic of the underlying model and hence any existing state-of-the-art DST system can improve its performance on unknown domains using our training strategy.

### On Local Aggregation in Heterophilic Graphs

Hesham Mostafa (Intel Corporation)*; Marcel Nassar (Intel Corporation); Somdeb Majumdar (Intel Labs)
Abstract: Graph Neural Networks (GNNs) often perform poorly in graph node classification tasks when the graph has low homophily, i.e, adjacent nodes are more likely to have different labels. Several recent GNN extensions claim to address this limitation by increasing the aggregation range of GNN layers, either through multi-hop aggregation, or through long-range aggregation from distant nodes. We experimentally show that these recent methods are superfluous on the commonly used heterophilic datasets as they are outperformed by properly tuned classical GNNs and multi-layer perceptrons (MLPs). We also show that the homophily metric is a poor predictor of the performance of GNNs . Instead, we propose the \textit{Neighborhood Information Content} (NIC) metric, which is a novel information-theoretic graph metric that is more relevant for GNNs as it directly quantifies the label-relevant information in a node's neighborhood. We show that, empirically, NIC correlates better with GNN accuracy in node classification tasks than homophily.

### Full Surround Monodepth from Multiple Cameras

Vitor Guizilini (Toyota Research Institute); Igor Vasiljevic (Toyota Technological Institute at Chicago)*; Rareș A Ambruș (Toyota Research Institute); Adrien Gaidon (Toyota Research Institute); Greg Shakhnarovich (Toyota Technological Institute at Chicago)
Abstract: Self-supervised monocular depth and ego-motion estimation is a promising approach to replace or supplement expensive depth sensors such as LiDAR for robotics applications like autonomous driving. However, most research in this area focuses on a single monocular camera or stereo pairs that cover only a fraction of the scene around the vehicle. In this work, we extend monocular self-supervised depth and ego-motion estimation to large-baseline multi-camera rigs. Using generalized spatio-temporal contexts, pose consistency constraints, and carefully designed photometric loss masking, we learn a single network generating dense, consistent, and scale-aware point clouds that cover the same full surround 360 degree field of view as a typical LiDAR scanner. We also propose a new scale-consistent evaluation metric more suitable to multi-camera settings. Experiments on two challenging benchmarks illustrate the benefits of our approach over strong baselines.

### The Evolution of Out-of-Distribution RobustnessThroughout Fine-Tuning

Abstract: Although machine learning models typically experience a drop in performance on out-of-distribution data, accuracies on in- versus out-of-distribution data are widely observed to follow a single linear trend when evaluated across a testbed of models. Models that are more accurate on the out-of-distribution data relative to this baseline exhibit “effective robustness” and are exceedingly rare. Identifying such models, and understanding their properties, is key to improving out-of-distribution performance. We conduct a thorough empirical investigation of effective robustness during fine-tuning and surprisingly find that models pre-trained on larger datasets exhibit effective robustness during training that vanishes at convergence. We study how properties of the data influence effective robustness, and we show that it increases with the larger size, more diversity, and higher example difficulty of the dataset. We also find that models that display effective robustness are able to correctly classify 10% of the examples that no other current testbed model gets correct. Finally, we discuss several strategies for scaling effective robustness to the high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art models.

### Intrinsic Sliced Wasserstein Distances for Comparing Collections of Probability Distributions on Manifolds and Graphs

Raif Rustamov (-); Subhabrata Majumdar (AT&T Labs Research)*
Abstract: Collections of probability distributions arise in a variety of statistical applications ranging from user activity pattern analysis to brain connectomics. In practice these distributions are represented by histograms over diverse domain types including finite intervals, circles, cylinders, spheres, other manifolds, and graphs. This paper introduces an approach for detecting differences between two collections of histograms over such general domains. We propose the intrinsic slicing construction that yields a novel class of Wasserstein distances on manifolds and graphs. These distances are Hilbert embeddable, allowing us to reduce the histogram collection comparison problem to a more familiar mean testing problem in a Hilbert space. We provide two testing procedures one based on resampling and another on combining p-values from coordinate-wise tests. Our experiments in a variety of data settings show that the resulting tests are powerful and the p-values are well-calibrated. Example applications to synthetic and real data are provided.

### ML-CI: Machine Learning Confidence Intervals for Covid-19 forecasts

Isabelle Guyon (CNRS, INRIA, University Paris-Saclay and ChaLearn); Alice Lacan (Inria)*
Abstract: Epidemic forecasting has always been challenging and the recent Covid-19 outbreaks emphasizes it. We introduce a novel approach to address the problem of evaluating confidence intervals (CI) of time series prediction forecasts for compartmental models, using machine learning. We evaluate our approach using real data of the Covid pandemic on 27 countries. Compartmental models were trained taking into account non pharmaceutical governmental measures. A Random Forest regressor was trained, using various engineered features, to predict the forecasting error for various horizons on synthetic data, then applied to estimate CI on real data forecasts. Our method outperforms baselines using forecast likelihood as metric.

### Decomposing Variance from Mini-Batch Order and Parameter Initialization

Abstract: While statistical learning theory often accurately describes the empirical behavior of classical machine learning methods, modern deep learning methods diverge from what is expected. Notably, generalization error (composed of bias, variance, and irreducible error from noise) no longer follows the standard U-shaped curve of the bias-variance tradeoff observed in simpler models. Recent studies of overparameterized models have uncovered double descent -- after diverging at what is known as the interpolation threshold, variance decreases again as more parameters are added. This surprising behavior suggests the need for an improved understanding of the causes of variance.

### LocoProp: Enhancing BackProp via Local Loss Optimization

Abstract: We study a local loss construction approach for optimizing neural networks. We start by motivating the problem as minimizing a squared loss between the pre-activations of each layer and a local target, plus a regularizer term on the weights. The targets are chosen so that the first gradient descent step on the local objectives recovers vanilla BackProp, while the exact solution to each problem results in a preconditioned gradient update. We improve the local loss construction by forming a Bregman divergence in each layer tailored to the transfer function which keeps the local problem convex w.r.t. the weights. The generalized local problem is again solved iteratively by taking small gradient descent steps on the weights, for which the first step recovers BackProp. We run several ablations and show that our construction consistently improves convergence, reducing the gap between first-order and second-order methods.

### Improving robustness of Machine Learning/AI systems through trust scoring frameworks

Bhairav H Mehta (MICROSOFT)*; Deb Mukhopadhyay (MICROSOFT)
Abstract: Effectively evaluating AI/ML model robustness and defenses against adversarial examples has proven to be extremely difficult. Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks and scoring robustness, few have succeeded; There is no rigorous study done to assess effectiveness of various attacks, robustness scoring method and defenses against these attacks iteratively. In this paper, we discuss the methodological foundations, review commonly accepted best practices, and suggest new methods for evaluating attacks, Iterative robustness scoring and defenses cycles. We experimented with 6 white box and 7 black box attacks defined in IBM ART library and tried them on 20 different datasets from Kaggle mostly of commercial interest. We also used 7 defense techniques available now* and we were able to derive a model defense framework with pre-processing, adversarial training and post processing. We orchestrated in ML pipelines for a developer to iteratively calculate the robustness of his/her models and apply defense techniques to improve robustness. We hope that both researchers developing defenses as well as readers and reviewers who wish to understand the completeness of an evaluation consider our advice to appreciate overall AI/ML model robustness lifecycle across multiple attack and defenses against attacks. We provide this robustness cycle framework to assist developers in evaluating and securing their AI models.

### Loop Estimator for Discounted Values in Markov Reward Processes

Falcon Dai (Toyota Technological Institute at Chicago)*; Matthew Walter (Toyota Technological Institute at Chicago)
Abstract: At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of states with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called \emph{loop estimator} that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of $O(1)$ when estimating the value of a single positive recurrent state $s$ unlike TD with $O(S)$ or model-based methods with $O(S^2)$. Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of $\widetilde{O}(\sqrt{\tau_s/T})$ over steps $T$ on a single sample path, where $\tau_s$ is the maximal expected hitting time to state $s$. In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.

### Learning Rate Grafting: Transferability of Optimizer Tuning

Abstract: In the empirical science of training large neural networks, the learning rate schedule is a notoriously challenging-to-tune hyperparameter, which can depend on all other properties (architecture, optimizer, batch size, dataset, regularization, ...) of the problem. In this work, we probe the entanglements between the optimizer and the learning rate schedule. We propose the technique of optimizer grafting, which allows for the transfer of the overall implicit step size schedule from a tuned optimizer to a new optimizer, preserving empirical performance. This provides a robust plug-and-play baseline for optimizer comparisons, leading to reductions to the computational cost of optimizer hyperparameter search. Using grafting, we discover a non-adaptive learning rate correction to SGD which allows it to train a BERT model to state-of-the-art performance. Besides providing a resource-saving tool for practitioners, the invariances discovered via grafting shed light on the successes and failure modes of optimizers in deep learning.

### Interval Deep Learning

David Betancourt (Georgia Tech)*; Rafi Muhanna (Georgia Institute of Technology)
Abstract: The use of deep neural networks (DNNs) is becoming more prevalent in important fields such as healthcare, physical sciences, climate change, transportation, and finance---many of which include safety-critical applications. In such applications, the input data to DNNs is generally unstructured, perturbed, unlabeled, sparse or partially missing, and exposed to uncertainty from multiple sources. For any predictive model within a complex problem, not properly quantifying the input uncertainty can lead to inaccurate and even disastrous decision-making. Granted the high risks associated with uncertainty, it is desired to quantify both input and parameter uncertainty in DNNs, while also obtaining prediction uncertainty bounds. However, despite their exceptional prediction capabilities, current DNNs do not have an implicit mechanism to quantify and propagate significant input data uncertainty. Moreover, DNNs resemble frequentist estimators and only produce point estimates. Having point estimates for the parameters is a major drawback for uncertainty quantification because it does not allow to reason about the uncertainty in the parameters of the model and its corresponding prediction. We argue that often we do not have enough knowledge to make distribution assumptions about the \emph{epistemic uncertainty} in the problem. Thus, for this work, we assume only knowledge about the \emph{interval uncertainty} of the input data in the form of upper and lower bounds---without making distribution assumptions. Given this knowledge, we use rigorous interval analysis (IA) to model the epistemic and aleatory uncertainty in the input data and propagate it through the computations of a DNN. In this work, we present novel \emph{interval deep learning} algorithms capable of quantifying input and parameter uncertainty through IA, and of producing uncertainty prediction bounds with guaranteed enclosures for the model.

### Distributed Full-batch Training of Graph Neural Networks Using Sequential Rematerialization

Hesham Mostafa (Intel Corporation)*
Abstract: We propose the Sequential Aggregation and Rematerialization (SAR) scheme for distributed full-batch training of Graph Neural Networks (GNNs) on large graphs. The key innovation in SAR is the sequential rematerialization scheme which sequentially re-constructs then frees pieces of the prohibitively expensive GNN computational graph during the backward pass. This results in excellent memory scaling behavior where the memory consumption per worker goes down linearly with the number of workers in a distributed setting, even for densely connected graphs. SAR is the first distributed full-batch GNN training approach that has this memory scaling behavior and is thus the first approach that can scale full-batch GNN training to arbitrarily large and dense graphs by simply adding more workers. Using SAR, we report the largest full-batch GNN training results on commodity hardware.

### A Near-Optimal Method for Minimizing the Maximum of N Convex Loss Functions

Hilal Asi (Stanford University); Yair Carmon (Tel Aviv University); Arun Jambulapati (Stanford University); Yujia Jin (Stanford University)*; Aaron Sidford (Stanford)
Abstract: We give a stochastic first-order method that minimizes $\max_{i\in[N]}f_i(x)$ for convex, Lipschitz $f_1,\cdots,f_N$ with $O(N\epsilon^{-2/3}+\epsilon^{-2})$ queries to first-order oracles. This improves upon the previous best bound of $O(N\epsilon^{-2})$ queries due to subgradient descent and matches a recent lower bound up to poly-logarithmic factors.

### Is Pseudo-Lidar needed for Monocular 3D Object detection?

Dennis Park (Toyota Research Institute); Rareș A Ambruș (Toyota Research Institute)*; Vitor Guizilini (Toyota Research Institute); Jie Li (Toyota Research Institute); Adrien Gaidon (Toyota Research Institute)
Abstract: Recent progress in 3D object detection from single images leverages monocular depth estimation as a way to produce 3D pointclouds, turning cameras into pseudo-lidar sensors. These two-stage detectors improve with the accuracy of the intermediate depth estimation network, which can itself be improved without manual labels via large-scale self-supervised learning. However, they tend to suffer from overfitting more than end-to-end methods, are more complex, and the gap with similar lidar-based detectors remains significant. In this work, we propose an end-to-end, single stage, monocular 3D object detector, DD3D, that can benefit from depth pre-training like pseudo-lidar methods, but without their limitations. Our architecture is designed for effective information transfer between depth estimation and 3D detection, allowing us to scale with the amount of unlabeled pre-training data. Our method achieves state-of-the-art results on two challenging benchmarks, with 16.34% and 9.28% AP for Cars and Pedestrians (respectively) on the KITTI-3D benchmark, and 41.5% mAP on NuScenes.

Abstract: We consider a sequential decision-making problem where an agent can take one action at a time and each action has a stochastic temporal extent, i.e., a new action cannot be taken until the previous one is finished. Upon completion, the chosen action yields a stochastic reward. The agent seeks to maximize its cumulative reward over a finite time budget, with the option of giving up'' on a current action --- hence forfeiting any reward -- in order to choose another action. We cast this problem as a variant of the stochastic multi-armed bandits problem with stochastic consumption of resource. For this problem, we first establish that the optimal arm is the one that maximizes the ratio of the expected reward of the arm to the expected waiting time before the agent sees the reward due to pulling that arm. Using a novel upper confidence bound on this ratio, we then introduce an upper confidence based-algorithm, \starucb{}, for which we establish logarithmic, problem-dependent regret bound which has an improved dependence on problem parameters compared to previous works.