Accepted Submissions (đŸŽ€ = oral)

CALID: Collabrative Accelerate LLM Inference with Draft Model with Filter Decoding

Yifan Hua (University of California, Santa Cruz); Shengze Wang (University of California, Santa Cruz)*; Xiaoxue Zhang (University of Nevada, Reno); Daniel Zhu (The Harker School); Arthur Cheong (Mountain View High School); Karan K Mohindroo (Pierrepont); Allan Dewey (University of California, Santa Cruz); Chen Qian (University of California, Santa Cruz)
Abstract: The burgeoning application of large language models (LLMs) in various AI-driven applications has elevated operational costs due to their reliance on expensive cloud computational resources. In response to this challenge, we propose CALID, a novel framework designed to optimize the utilization of edge-cloud computational resources and inference latency while maintaining the quality of service. By integrating the small language model (SLM) with the negative log-likelihood (NLL) based confidence scoring mechanism, CALID effectively filters preliminary drafts, allowing only those with low confidence scores to be refined by the larger, more computationally expensive LLM. Preliminary evaluations indicate a significant reduction in computational costs with minimal impact on output quality, showcasing CALID's potential to facilitate more sustainable, cost-effective, and high-performance LLM deployment in edge-cloud environments.

SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets

Swetava Ganguli (Apple)*; Daria Reshetova (Stanford ); C. V. Krishnakumar Iyer (Apple); Vipul Pandey (Apple)
Abstract: We propose a Self-supervised Anomaly Detection technique, called SeMAnD, to detect geometric anomalies in Multimodal geospatial datasets. Geospatial data comprises of acquired and derived heterogeneous data modalities that we transform to semantically meaningful, image-like tensors to address the challenges of representation, alignment, and fusion of multimodal data. SeMAnD is comprised of (i) a simple data augmentation strategy, called RandPolyAugment, capable of generating diverse augmentations of vector geometries, and (ii) a self-supervised training objective with three components that incentivize learning representations of multimodal data that are discriminative to local changes in one modality which are not corroborated by the other modalities. Detecting local defects is crucial for geospatial anomaly detection where even small anomalies (e.g., shifted, incorrectly connected, malformed, or missing polygonal vector geometries like roads, buildings, landcover, etc.) are detrimental to the experience and safety of users of geospatial applications like mapping, routing, search, and recommendation systems. Our empirical study on test sets of different types of real-world geometric geospatial anomalies across 3 diverse geographical regions demonstrates that SeMAnD is able to detect real-world defects and outperforms domain-agnostic anomaly detection strategies by 4.8-19.7% as measured using anomaly classification AUC. We also show that model performance increases (i) up to 20.4% as the number of input modalities increase and (ii) up to 22.9% as the diversity and strength of training data augmentations increase.

Cubist-style image effects with oblique decision trees

Edric Chan (Great Oak High School, Temecula, California); Magzhan Gabidolla (University of California, Merced); Miguel A Carreira-Perpinan (UC Merced)*
Abstract: See submission

Improving the Faithfulness of LLM-based Abstractive Summarization with Span-level Unlikelihood Training

Sicong Huang (University of California, Santa Cruz)*; Qianqi Yan (University of California, Santa Cruz); Shengze Wang (University of California, Santa Cruz); Ian Lane (University of California, Santa Cruz)
Abstract: Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. Despite their ability to generate fluent summaries, these models often produce texts that are unfaithful to the original documents. Current approaches to mitigating unfaithfulness typically involve post-processing corrections or contrastive learning from synthetically generated negative samples, which do not fully address the spectrum of errors that can arise in LLM-generated summaries. In this paper, we introduce a novel approach to fine-tune LLMs specifically to reduce the occurrence of unfaithful spans of text in generated summaries. We first annotate span-level hallucinations in LLM-generated summaries using automatic labeling with GPT-4. We then fine-tune the LLM using both summaries with no hallucinations and spans of hallucinated text to improve the faithfulness of the model. This paper introduces a dataset labeled to distinguish between faithful and unfaithful content and compare the performance of three techniques: gradient ascent, unlikelihood training, and task vector negation. Our experimental results show that unlikelihood training can effectively use span-level annotations to enhance summary faithfulness, reducing the number of summaries with hallucinations from 31% to 13%, a reduction of 58% on the CNN summarization dataset and from 33% to 20%, a reduction of 39% on the SAMSum dataset.

Hierarchical data visualization via PCA trees

Miguel A Carreira-Perpinan (UC Merced); Kuat Gazizov (UC Merced)*
Abstract: We propose a new model for dimensionality reduction, the PCA tree, which works like a regular autoencoder, having explicit projection and reconstruction mappings. The projection is effected by a sparse oblique tree, having hard, hyperplane splits using few features and linear leaves. The reconstruction mapping is a set of local linear mappings. Thus, rather than producing a global map as in t-SNE and other methods, which often leads to distortions, it produces a hierarchical set of local PCAs. The use of a sparse oblique tree and of PCA in its leaves makes the overall model interpretable and very fast to project or reconstruct new points. Joint optimization of all the parameters in the tree is a nonconvex nondifferentiable problem. We propose an algorithm that is guaranteed to decrease the error monotonically and which scales to large datasets without any approximation. In experiments, we show PCA trees are able to identify a wealth of low-dimensional and cluster structure in image and document datasets.

Leveraging Spiking Neural Networks for Solar Energy Prediction in Agriculture

Assel Kembay (UC Santa Cruz)*; Rui-Jie Zhu (University of California, Santa Cruz); Nicholas Kuipers (University of California, Santa Cruz); Jason Eshraghian (University of California, Santa Cruz); Colleen Josephson (University of California, Santa Cruz)
Abstract: Can the efficiency of biological neurons revolutionize solar energy prediction in agriculture? Unlike traditional neural networks, Spiking Neural Networks (SNNs) mimic the sparse, event-driven nature of biological neurons, offering superior temporal model capacity and energy efficiency. We propose a deep learning model leveraging weather forecasting data from the National Renewable Energy Laboratory (NREL), eliminating the need for costly solar irradiance meters. Our SNN-based model matches the accuracy of Long Short-Term Memory (LSTM) networks while consuming only 4.3% of the energy. This significant reduction makes SNNs ideal for resource-constrained agricultural deployments. This study shows that brain-inspired computing can lead to more sustainable and efficient energy management in agriculture, transforming renewable energy integration in farming operations.

IntentRec: Predicting User Session Intent in Netflix

Sejoon Oh (Netflix)*; Moumita Bhattacharya (Netflix); Yesu Feng (Netflix); Sudarshan Lamkhede (Netflix)
Abstract: Recommender systems have played a critical role in diverse digital services such as e-commerce, streaming media, social networks, etc. If we know what a user’s intent is in a given session, it becomes easier to provide high-quality recommendations. We introduce IntentRec, a novel recommendation framework based on hierarchical multi-task architecture that tries to estimate a user’s latent intent using their short- and long-term implicit signals as proxies, and uses the intent prediction to predict the next item user is likely to engage with. By directly leveraging the intent prediction, we can offer accurate and personalized recommendations to users. IntentRec outperforms the SOTA baselines on Netflix user engagement data.

AFEN: Respiratory Disease Classification using Audio Machine Learning

Rahul Nadkarni (University of California, Santa Cruz )*; Emmanouil Nikolakakis (University of California, Santa Cruz); Razvan V Marinescu (UC Santa Cruz)
Abstract: Precision in respiratory disease classification can significantly impact patient outcomes. The complexity of respiratory sounds demands sophisticated approaches that can capture subtle variations and patterns indicative of various conditions. Therefore, developing advanced models integrating diverse methods is essential for enhancing diagnostic capabilities. In this paper, we propose a novel method for respiratory disease classification that advances previous techniques by introducing increased model complexity and including previously overlooked audio features. This method achieves state-of-the-art results in audio respiratory disease classification by employing ensemble learning.

Data Efficiency for Large Recommendation Models

Jingru Xie (Google)*; Kshitij Jain (Google); Kevin Regan (Google)
Abstract: Large recommendation models (LRMs) are fundamental to the multi-billion dollar online advertising industry. The massive scale of data directly impacts both computational costs and R&D velocity. This paper presents actionable principles and high-level frameworks to guide practitioners in optimizing training data requirements. These strategies have been successfully deployed in Google’s largest Ads CTR prediction models and are broadly applicable beyond LRMs. We outline the concept of data convergence, describe methods to accelerate this convergence, and finally, detail how to optimally balance training data volume with model size.

đŸŽ€ Integration of Graph Neural Network and Neural-ODEs for Tumor Dynamics Prediction

Omid Bazgir (Genentech)*; James Lu (Genentech); Marc Hafner (Genentech); Ji-won Park (Genentech); Zichen Wang (Genentech)
Abstract: In the development of anti-cancer drugs, a major scientific challenge is disentangling the complex interplay between high-dimensional genomics data derived from patient tumor samples, the organ of origin of the tumor, the drug targets associated with the specified treatments, and the ensuing treatment response. Furthermore, to realize the aspirations of precision medicine in identifying and adjusting treatments for patients depending on the therapeutic response, there is a need for building tumor dynamics models that can integrate the longitudinal tumor size measurements with multimodal, high-throughput data. In this work, we take a step towards enhancing personalized tumor dynamics predictions by proposing a heterogeneous graph encoder that utilizes a bipartite Graph Convolutional Neural networks (GCNs) combined with Neural Ordinary Differential Equations (Neural-ODEs). We apply the methodology to a large collection of patient-derived xenograft (PDX) data, spanning a wide variety of treatments (as well as their combinations) and tumor organs of origin. We first show that the methodology is able to discover a tumor dynamic model that significantly improves upon an empirical model in current use. Additionally, we show that the graph encoder is able to effectively incorporate multimodal data to enhance tumor predictions. Our findings indicate that the methodology holds significant promise and offers potential applications in pre-clinical settings.

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

Shiyuan Huang (University of California, Santa Cruz)*; Siddarth Mamidanna (University of California, Santa Cruz); Shreedhar Jangam (University of California, Santa Cruz); Leilani H Gilpin (UCSC); Yilun Zhou (Salesforce)
Abstract: Large language models (LLMs) have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.

AutoEvalTTS: An Automatic Evaluation Framework for Text and Audio Assessment

Michel Wong (Apple)*; Mara Maldonado (Apple); Deepanshu Gupta (Apple)
Abstract: Text-To-Speech (TTS) evaluation for overall voice model quality, speech naturalness, and pronunciation accuracy is traditionally done using crowd-sourcing. This offers relatively high-quality assessment results. However, large-scale evaluation that involves a large number of languages is both costly and labor-intensive, especially for low-resourced languages. Evaluation requires extensive human evaluation efforts, and considerable time investment. While human feedback remains a vital element in the assessment, it is not always scalable or cost-effective for comprehensive testing languages and scenarios. In this paper, we propose AutoEvalTTS, an auto-evaluation framework that addresses the limitations of traditional methods. AutoEvalTTS offers a high correlation with human-level grading and scales effectively across numerous languages, making it a cost-efficient and time-efficient solution for TTS evaluation. By automating the evaluation process, AutoEvalTTS provides early indications of the quality of voice assets, facilitates more informed decisions, and allows for the comparison of prior models without duplicating the cost and effort of previous evaluations. This innovative approach significantly reduces evaluation costs, accelerates the testing process, and enhances the overall performance of TTS systems, ultimately leading to higher-quality voice models and more efficient development cycles.

Adaptive Softmax Trees for Many-Class Classification

Rasul Kairgeldin (University of California, Merced)*; Miguel A Carreira-Perpinan (UC Merced)
Abstract: NLP tasks such as language models or document classification involve classification problems with thousands of classes. In these situations, it is difficult to get high predictive accuracy and the resulting model can be huge in number of parameters and inference time. A recent, successful approach is the softmax tree (ST): a decision tree having sparse hyperplane splits at the decision nodes (which make hard, not soft, decisions) and small softmax classifiers at the leaves. Inference here is very fast because only a small subset of class probabilities need to be computed, yet the model is quite accurate. However, a significant drawback is that it assumes a complete tree, whose size grows exponentially with depth. We propose a new algorithm to train a ST of arbitrary structure. The tree structure itself is learned optimally by interleaving steps that grow the structure with steps that optimize the parameters of the current structure. This makes it possible to learn STs that can grow much deeper but in an irregular way, adapting to the data distribution. The resulting STs improve considerably the predictive accuracy while reducing the model size and inference time even further, as demonstrated in datasets with thousands of classes. In addition, they are interpretable to some extent.

đŸŽ€ Learning to Route with Confidence Tokens

Yu-Neng Chuang (Rice University); Helen Zhou (Apple Inc.)*; Prathusha Sarma (Apple); John Boccio (Apple Inc.); Oliver Ruiz (Apple Inc.); Sara Bolouki (Apple Inc.)
Abstract: Large language models (LLM) have demonstrated impressive performance on several tasks, and are increasingly deployed in the real-world settings. However, especially in high-stakes settings (e.g. medical advice, digital assistants that can edit or delete personal data, etc.), it is vital to know when an LLM may be unreliable. In this work, we study the extent to which LLMs can reliably tell us whether they are confident in their answers, and how this can translate into downstream accuracy gains. We propose a novel strategy for reliable indication of model confidence through training confidence tokens, and compare this to conventional approaches such as verbalizing confidence and examining model logits. We show how confidence tokens can be used effectively for realistic downstream settings such as learning to route or abstain in the face of uncertainty, and demonstrate significant improvements over alternative approaches. Contributions include: 1. Introducing SELF-REF, a lightweight fine-tuning strategy that helps LLMs better learn when they should be confident or uncertain in their predictions 2. Introducing the LLM routing setting, where queries that a model is uncertain about can be optionally routed to a larger model (at some cost to latency) 3. Studying the LLM rejection learning setting, where the answer may be “none of the above” (e.g. if a model doesn’t have good actions to take, it should refrain from taking any). 4. Demonstrating that SELF-REF outperforms standard approaches in the downstream settings

Decima: Decoding gene expression in individual cell types and disease states

Avantika Lal (Genentech)*; Alex M Tseng (Genentech); Nathaniel Diamant (Genentech); Surag Nair (Genentech); Alexander Karollus (Genentech); Gracie Gordon (Genentech); Tommaso Biancalani (Genentech); Gabriele Scalia (Genentech); Gokcen Eraslan (Genentech)
Abstract: The human genome contains over 20,000 genes, whose levels of activity vary across different cell types in the body. The level of activity of a gene is controlled to a large extent by the DNA sequence of the gene and neighboring regions. Here we present Decima, a model that can predict the activity of any gene in a variety of cell types and disease conditions based on its sequence. This model opens new frontiers in the mechanistic understanding of human development and disease, and has numerous therapeutic applications, including the discovery of novel drug targets and the development of improved gene and cell therapies.

đŸŽ€ Visual Haystacks: Answering Hard Questions About Sets of Images

Tsung-Han Wu (UC Berkeley)*; Giscard Biamby (UC Berkeley); Jerome Quenum (University of California, Berkeley); Ritwik Gupta (University of California, Berkeley); Joseph E Gonzalez (UC Berkeley); Trevor Darrell (UC Berkeley); David Chan (University of California, Berkeley)
Abstract: Large Multimodal Models (LMMs) excel in single-image visual question answering but struggle with tasks involving extensive image collections, such as photo album searches or satellite imagery monitoring. In response, we’ve focused on the “Multi-Image Question Answering” (MIQA) task, requiring LMMs to retrieve and reason across multiple unrelated images. We found that the current Needle-In-A-Haystack (NIAH) benchmark, a challenge to evaluate long-context processing, primarily tests OCR/textual retrieval and reasoning, which doesn’t fully assess LMMs’ visual data processing capabilities. To better evaluate these models, we’ve introduced the “Visual Haystacks” (VHs) benchmark, which focuses on long-context visual data. Moreover, we developed MIRAGE (Multi-Image Retrieval Augmented Generation), a pioneering system tailored to tackle real-world MIQA challenges, capable of managing scenarios involving up to 10,000 images.

Tool-Augmented Compositional Reasoning LLMs with Weak Supervision: A Scalable Approach to Reduce Human Efforts in Agent Customization

Vishnou Vinayagame (Docugami)*; Gregory Senay (Docugami); Luis MartĂ­ (Docugami)
Abstract: Mathematical reasoning capabilities are increasing with tool-augmented language agents, but methods often rely on proprietary models to generate trajectories for training or human efforts for prompt engineering. This work introduces a progressive refinement learning paradigm through self-annotation and weak-supervision. By updating the model's beliefs and evolving from human inputs, the reliance on human supervision and stronger teacher models is minimized, jointly reducing the need to adapt prompting to the models.

Right this way: Can VLMs Guide Use to See More to Answer Questions?

Li Liu (UC SANTA CRUZ); Diji Yang (University of California Santa Cruz); Sijia Zhong (University of California, Santa Cruz); Kalyana Suma Sree Tholeti (University of California, Santa Cruz); Lei Ding (UCSC); Yi Zhang (University of California, Santa Cruz); Leilani H Gilpin (UCSC)*
Abstract: In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical yet challenging task in the Visual-Question-Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated pipeline that generates synthetic training data by simulating "where to know" scenarios. Our empirical results demonstrate significant performance improvements when the synthetic data is used to fine-tune mainstream VLMs. Our study highlights the potential to bridge the gap between human-like information assessment and acquisition process.

Beyond Item Dissimilarities: Diversifying by Intent in Recommender Systems

Yuyan Wang (Stanford University)*; Cheenar Banerjee (Google); Samer Chucri (Google); Fabio Soldo (Google); Sriraj Badam (Google); Ed H. Chi (Google); Minmin Chen (Google)
Abstract: It has become increasingly clear that recommender systems that overly focus on short-term engagement prevents users from exploring diverse interests. To tackle this challenge, numerous diversification algorithms have been proposed. These algorithms rely on measures of item similarity, aiming to maximize the dissimilarity across items in the final set of recommendations. However, in this work, we demonstrate the benefits of going beyond item-level similarities by utilizing higher-level user understanding—specifically, user intents that persist across multiple interactions or recommendation sessions—in diversification. Our approach is motivated by the observation that user behaviors on are driven by their underlying intents. Therefore, recommendations should ensure that a diverse set of user intents is represented. While user intents has primarily been studied in the context of search, it is less clear how to incorporate real-time dynamic intent predictions in the diversification stage. To address this gap, we develop a probabilistic intent-based whole-page diversification framework for the final stage of a recommender system. Starting with a prior belief of user intents, the proposed framework sequentially selects items for each position based on these beliefs and subsequently updates posterior beliefs about the intents. We experiment with the intent diversification framework on YouTube, the world's largest video recommendation platform, serving billions of users daily. Live experiments on a diverse set of intents show that the proposed framework increases Daily Active Users (DAU) and overall user enjoyment, validating its effectiveness in facilitating long-term planning. Specifically, it enables users to consistently discover and engage with diverse content that aligns with their underlying intents over time, leading to an improved long-term user experience.

Improved Microbiome Prediction through Functional Tree Input for Convolutional Neural Networks

Mohammad Soheilypour (Nexilico, Inc.)*; Vladimir Ivanov (Nexilico, Inc.); Wyatt Hartman (Nexilico, Inc.)
Abstract: The gut microbiome is a complex ecosystem of microbial interactions that has been implicated in numerous diseases, underscoring the critical importance of accurately modeling these interactions for the development of effective therapeutics. Microbial interactions are well known to be driven, in large part, by information encoded in microbial genes. Yet, this functional information is not incorporated in current state-of-the-art machine learning methods due to its sparse and high-dimensional nature, which causes model overfitting and poor generalizability when trained on small datasets that are prevalent in microbiome research. To address these limitations, we propose a novel method that encodes high-dimensional functional microbial data into a relational tree format that can be used to train convolutional neural networks. Our approach not only demonstrates better prediction performance compared to existing state-of-the-art methods, but also facilitates the use of various existing interpretation methods to automatically identify important functional groups of microbes. Overall, this work demonstrates a general approach to overcome high-dimensional and low-sample-size constraints that are commonplace to microbiome and other biological domains.

Lucy: Think and Reason to Solve Text-to-SQL

Nina Narodytska (VMware Research)*; Shay Vargaftik (VMware)
Abstract: Large Language Models (LLMs) have made significant progress in assisting users to query databases in natural language. While LLM-based techniques provide state-of-the-art results on many standard benchmarks, their performance significantly drops when applied to large enterprise databases. The reason is that these databases have a large number of tables with complex relationships that are challenging for LLMs to reason about. We analyze challenges that LLMs face in these settings and propose a new solution that combines the power of LLMs in understanding questions with automated reasoning techniques to handle complex database constraints. Based on these ideas, we have developed a new framework that outperforms state-of-the-art techniques in zero-shot text-to-SQL on complex benchmarks.

Detection Machine Revised Text via Style Preference Optimization

Chen Jiaqi (Fudan University)*; Xiaoye Zhu (South China University of Technology); Tianyang Liu (UC San Diego); Xinhui Chen (Independent researcher); Ying Chen (University of Illinois Urbana Champaign); Lei Zhang (University of California, San Diego); Yiwen Yuan (N/A)
Abstract: Large Language Models (LLMs) have revolutionized text generation, making machine-produced content nearly indistinguishable from human writing, thereby making detecting AI-generated text increasingly challenging. Although past research has accurately identified purely machine-generated text, detecting machine-revised that incorporates human input through prompts remains significantly more difficult. As content may originate from human prompts, detecting machine-revised text (rewriting, expansion, and polishing) often involves identifying distinctive machine styles. We propose measuring the distance between text style distributions and machine style distributions to determine if a text has been machine-revised. To this end, we first introduce Style Preference Optimization (SPO), which aligns a scoring model with machine style, using pairs of texts with the same content but different styles (one human-written and one machine-revised). Then, we use the machine-style scoring model to calculate the style conditional probability curvature, quantifying the log probability difference between the original and conditionally sampled texts for effective detection. We conduct extensive comparisons across a diverse range of scenarios, encompassing text revisions by six LLMs, four distinct text domains and three machine revision types. Compared to existing state-of-the-art methods, our method yields a 13% increase in AUC for detecting text revised by open-source LLMs, and improves performance by 5% and 19% for detecting ChatGPT and GPT-4o revised text, respectively. Notably, our method surpasses the commercially trained GPT-Zero with just 1,000 samples and five minutes of SPO, demonstrating its efficiency and effectiveness.

Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion

Quoc-Huy Tran (Retrocausal, Inc.)*; Muhammad Ahmed (Retrocausal ); Murad Popattia (Retrocausal); Muhammad Hassan Ahmed (Retrocaual Inc); Andrey Konin (Retrocausal); Zeeshan Zia (Retrocausal, Inc.)
Abstract: This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications. In contrast with the state-of-the-art method of CASA, where sequences of 3D skeleton coordinates are taken directly as input, our key idea is to use sequences of 2D skeleton heatmaps as input. Unlike CASA which performs self-attention in the temporal domain only, we feed 2D skeleton heatmaps to a video transformer which performs self-attention both in the spatial and temporal domains for extracting effective spatiotemporal and contextual features. In addition, we introduce simple heatmap augmentation techniques based on 2D skeletons for self-supervised learning. Despite the lack of 3D information, our approach achieves not only higher accuracy but also better robustness against missing and noisy keypoints than CASA. Furthermore, extensive evaluations on three public datasets, i.e., Penn Action, IKEA ASM, and H2O, demonstrate that our approach outperforms previous methods in different fine-grained human activity understanding tasks. Finally, fusing 2D skeleton heatmaps with RGB videos yields the state-of-the-art on all metrics and datasets. To our best knowledge, our work is the first to utilize 2D skeleton heatmap inputs and the first to explore multi-modality fusion for temporal video alignment.

End-To-End Recommendation Systems with Hybrid Graph Neural Networks

Matthias Fey (TU Dortmund University)*; Yiwen Yuan (kumo.ai); Jure Leskovec (Kumo.AI); Shenyang Huang (Kumo.AI); Jan Eric Lenssen (Kumo.AI); Zecheng Zhang (Kumo.AI); Xinwei He (Kumo.AI); Akihiro Nitta (Kumo.AI); Dong Wang (Kumo.AI); Manan Shah (Kumo.AI)
Abstract: We propose Hybrid-Graph Neural Networks, a novel single-stage GNN-based recommendation system. Hybrid-GNN fuses candidate generation and ranking into a single architecture that jointly predicts both collaborative (local) and content-based (global) recommendations. Local representations are learned for items within the subgraph of an entity (e.g., a user’s past purchases, viewed items, items purchased by friends), incorporating entity-specific knowledge (e.g., the frequency of repeated purchases of this specific entity/item pair). Global item embeddings are learned for items outside the subgraph, shared across all entities. A final network predicts a repetition scalar for each entity (balancing repetition vs. exploration), offsetting the contributions of local rankings. We demonstrate that Hybrid-GNN outperforms existing methods, traditional and GNN-based, on common recommendation tasks.

Evaluating Gender Bias Transfer between Pretrained and Prompt Adapted Language Models

Nivedha Sivakumar (Apple Inc.)*; Natalie Mackraz (Apple); Samira Khorshidi (Apple Inc); Krishna Patel (Apple); Barry-John Theobald (Apple); Luca Zappella (Apple Inc.); Nicholas Apostoloff (Apple Inc.)
Abstract: Large language models (LLMs) are increasingly being adapted to achieve task- specificity for deployment in real world decision systems. Several previous works have investigated the the bias transfer hypothesis (BTH) by studying the effect of adaptation strategies on model fairness. In this work, we expand the study of BTH to causal models under prompt adaptations, as prompting is an accessible and compute-efficient way to deploy models in real-world systems. While previous works find that fairness in pretrained masked language models have limited effect on the fairness of models when adapted using fine-tuning, we establish that intrinsic biases in pretrained Mistral models are correlated (ρ ≄ 0.97) with biases when the same models are zero- and few-shot prompted, using a co-reference resolution task that resolves a gender pronoun with one of two occupations in a sentence.

đŸŽ€ Lemur: Integrating Large Language Models in Automated Program Verification

Haoze Wu (Amherst College)*; Clark Barrett (Stanford University); Nina Narodytska (VMware Research)
Abstract: The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of derivation rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure, which led to practical improvements on a set of synthetic and competition benchmarks.

BayesCNS: A Unified Bayesian Approach to Address Cold Start and Non-Stationarity in Large-Scale Search Systems

Randy Ardywibowo (Apple Inc.)*; Rakesh Sunki (Apple Inc.); Lucy Kuo (Apple Inc.); Sankalp Nayak (Apple Inc.)
Abstract: Information Retrieval (IR) systems used in search and recommendation platforms frequently employ Learning-to-Rank (LTR) models to rank items in response to user queries. These models depend significantly on features derived from user interactions, such as clicks and engagement data. This reliance introduces cold start issues for new and tail items as well as non-stationarity distribution shifts, as user interaction signals are noisy, sparse, and dynamically change based on user behaviors over time. To address these challenges, we propose \textbf{BayesCNS}, a unified Bayesian approach for handling both cold start and non-stationarity in large-scale search systems. BayesCNS is formulated as an online Bayesian learning problem, utilizing an empirical Bayesian framework to establish an informed prior distribution from existing data. This prior is parameterized using deep neural networks, enabling flexible and efficient posterior updates. The method incorporates a Thompson sampling algorithm to facilitate online learning and integrates seamlessly with existing ranking models. This enables \textit{ranker-guided online learning}, allowing the system to adapt continuously to changing user interactions and feature distributions over time based on guidance provided by a ranker model. We demonstrate the efficacy of BayesCNS through comprehensive offline and online experiments, including an online A/B experiment showing a 10.59\% increase in new item interactions and a 1.05\% improvement in overall success rate.

Unsupervised End-to-End Task-Oriented Dialogue with LLMs: The Power of the Noisy Channel

Brendan D King (University of California, Santa Cruz)*; Jeffrey Flanigan (University of California, Santa Cruz)
Abstract: Training task-oriented dialogue systems typically requires annotations for tracking interactions with APIs and grounding responses, which are costly and error prone to produce. With advances in LLMs, we hypothesize that one can build a working dialogue agent in an unsupervised setting inferring these API interactions from unlabeled dialogues. We propose a novel approach using expectation maximization (EM) with a noisy-channel model to infer all of the turn-level annotations needed for training an agent as latent variables, and then use them to train an end-to-end agent. We find that an unsupervised agent trained using our approach more than doubles the dialogue success rate of a strong zero-shot GPT-3.5 baseline on the MultiWOZ benchmark.

Early Task-Adaptation of Language Models via Importance Sampling during Pretraining

David Grangier (Apple)*; Pierre Ablin (Apple); Angelos Katharopoulos (Apple); Skyler Seto (Apple)
Abstract: Language model pre-training is usually performed in a task-agnostic manner. However, practitioners often compose the pretraining set with implicit assumptions about the end-tasks. We consider that few examples from the targeted tasks are available to guide the composition of the large pretraining set. We show that sampling the pretraining set with an importance-sampling approach based on k-means clustering gives significant gains compared to task-agnostic pretraining.

Open-Source Molecular Processing Pipeline for Generating Molecules

Shreyas V (Birla Institute of Technology and Science, Pilani – Goa Campus)*; Jose Siguenza (Deep Forest Sciences); Karan C Bania (Birla Institute of Technology and Science, K. K. Birla Goa Campus); Bharath Ramsundar (DeepChem)
Abstract: Generative models for molecules have shown considerable promise for use in computational chemistry, but remain difficult to use for non-experts. For this reason, we introduce open-source infrastructure for easily building generative molecular models into the widely used DeepChem (Ramsundar et al., 2019) library with the aim of creating a robust and reusable molecular generation pipeline. In particular, we add high quality PyTorch (Paszke et al., 2019) implementations of the Molecular Generative Adversarial Networks (MolGAN) (Cao & Kipf, 2022) and Normalizing Flows (Papamakarios et al., 2021). Our implementations show strong performance comparable with past work (Kuznetsov & Polykovskiy, 2021; Cao & Kipf, 2022).

Sigmoid Self-Attention

Russ Webb (Apple)*
Abstract: Attention uses a sequence-to-sequence (seq-to-seq) map to build context-aware token representations. Usually, attention relies on the softmax function (SoftmaxAttn) to recover token representations as data-dependent convex combinations of values. Softmax in SoftmaxAttn can sometimes lead to a concentration of attention on a few features (Yang et al., 2018; Ganea et al., 2019), potentially neglecting informative aspects of the input. Moreover, applying SoftmaxAttn requires performing a reduction along the length of the input sequence, which slows down computation (Dao et al., 2022; Dao, 2023). In this work, we substitute the row-wise softmax operation with an element-wise sigmoid nonlinearity. As we show, it is critical to properly bias softmax to control the attention norms.

Utilizing Surrogate Modeling and Evolutionary Policy Search to Discover Effective Policies for Land-Use Planning

Daniel Young (Cognizant AI Labs)*; Olivier Francon (Cognizant AI Labs); Elliot Meyerson (Cognizant); Risto Miikkulainen (UT Austin; Cognizant Technology Solutions); Babak Hodjat (Cognizant AI Labs)
Abstract: Allocation of land for different uses significantly affects carbon balance and climate change. A surrogate model learned from historical land-use changes and carbon emission simulations allows efficient evaluation of such allocations. An evolutionary search then discovers effective land-use policies for specific locations. This system, built on the Project Resilience platform, generates Pareto fronts trading off carbon impact and amount of change customized to different locations, offering a useful tool for land-use planning.

đŸŽ€ Teaching an LLM To Explore Optimally

Allen Nie (Stanford University)*; Yi Su (Google); Bo Chang (Google); Ed H. Chi (Google); Quoc Le (Google DeepMind); Minmin Chen (Google)
Abstract: Learning to adapt quickly in a new environment is a crucial form of intelligence. When we make a decision in an unknown situation, we have to decide how to gather the information, balancing between exploration and exploitation. While traditional approaches have focused on well-structured domains, large language models offer a new paradigm for decision-making systems. In this work, we investigate the exploration capabilities of large language models in bandit environments, benchmarking their performance in a comprehensive set of environments against classical algorithms such as Upper-Confidence-Bound and Thompson Sampling. Furthermore, we delve into various strategies aimed at teaching LLMs to optimize exploration in decision-making scenarios.

Generative AI with Logical Reasoning

Kalyan Krishnamani (NVIDIA)*
Abstract: The proposed system infuses logical reasoning into Generative AI systems. It is extremely valuable in application domains that are regulated - finance, healthcare, etc., where explainability is a critical requirement. It involves (a) encoding natural language statements as logical formulas, (b) solving the logical formulas using a solver based on mathematical logic, and (c) encoding the proof results back to natural language. (a) and (c) are done using an accessible LLM. A prototype tool, the explain system, implementing the proposed technology, is also presented.

Enhanced Guardrails for Data Security in LLMs

Shubhi Asthana (IBM Research - Almaden)*; Bing Zhang (IBM Research); Anna Lisa Gentile (IBM Research); Chad Deluca (IBM Research); Pawan Chowdhary (IBM Research - Almaden); Jorge Sanz (IBM Research); Guang-Jie Ren (IBM Research)
Abstract: The rise of Large Language Models (LLMs) necessitates the implementation of guardrails to prevent the dissemination of harmful or inappropriate content, ensure the provision of accurate and reliable information within ethical boundaries, and mitigate biases inherent in training data. A critical aspect of these guardrails is the protection of user privacy, especially as the collection of Personally Identifiable Information (PII) intensifies in both consumer and enterprise environments. The task of PII detection becomes increasingly complex as models are trained on extensive datasets, making previously dispersed information accessible through model inference. This paper addresses the challenge of user data protection within the framework of guardrails. We present a suite of methods designed to mitigate privacy risks associated with extensive data usage, applicable during both the training and inference phases. Our approach employs a multi-stage process that integrates entity recognition, context classification, and sophisticated policies aligned with various regulatory standards. We demonstrate the superiority of our solution by benchmarking it against leading open-source PII detectors, showcasing enhanced extraction performance in most cases. Additionally, we provide insights into the real-world deployment of our method, specifically highlighting its application in identifying private data within public GitHub repositories to ensure code-of-conduct compliance. Furthermore, the techniques described in this paper have been utilized internally to mask sensitive information in large-scale training datasets for LLMs.

Encoded Modern Hopfield Networks - đŸŽ€ Addressing Practical Considerations for Large Scale Storage

Satyananda Kashyap (IBM Research)*; Niharika S. D'Souza (IBM Research); Luyao Shi (IBM Research); Ken C. L. Wong (IBM Research – Almaden Research Center); Hongzhi Wang (IBM Almaden Research Center); Tanveer Syeda-Mahmood (IBM Research)
Abstract: Content-addressable memories, such as Hopfield networks, have been studied as effective models for associative memory systems in humans. However, their practical application in large-scale storage systems is hindered by challenges like spurious meta-stable states. This paper presents a novel approach to address these challenges by encoding patterns before storage and decoding them upon recall. Experimental results show a substantial reduction in metastable states and increased storage capacity, enabling perfect recall of a significantly larger number of stored elements.

Distribution Agnostic Regression Paradigm for Watch Time Fitting and Prediction

Jiawei Huang (University of Southern California)*; zhengze zhou (Meta); minhan Li (Meta); Yiming Liao (Meta); Ziheng Huang (Meta); Nathan Berrebbi (Meta); James Yang (Meta); Arun Singh (Meta)
Abstract: In the realm of short-form video recommendation systems, understanding and predicting user watch time is crucial due to its direct correlation with user engagement and platform dwell time. Traditional methods, either focusing on pointwise watch time prediction or assuming a specific distribution of watch time, fail to capture the multimodal nature of watch time distribution observed in user behavior on short-form video-sharing platforms. We propose a smoothed softmax loss and a pairwise ranking index inference scheme that can be applied to the distribution agnostic watch time prediction problem and improve the prediction accuracy. Specifically, the smoothed softmax loss utilizes information from watch time distribution rather than a single point, and the pairwise ranking index inference employs the relative order of watch time rather than their absolute values for ranking purposes. Through comprehensive offline evaluation and online experiments, we demonstrate the superior performance of smoothed softmax loss and pairwise ranking index inference. Our approach has been implemented on the Instagram Reels platform, resulting in a significant boost in user engagement and increased platform dwell time due to more accurate watch time prediction.

Enhancing MRI Abdominal Protocol Selection with a Machine-Learning Decision-Support System Utilizing Electronic Health Records

Peyman Shokrollahi (Stanford University)*; Juan Zambrano Chaves (Stanford University); Avishkar Sharma (Stanford University); Jonathan Lam (Stanford University); Debashish Pal (GE Healthcare); Naeim Bahrami (JNJ ); Akshay S Chaudhari (Stanford University); Andreas Loening (Stanford University)
Abstract: Inaccurate selection of MRI protocols can impede diagnostics and therapeutic workflows, delay appropriate treatment, increase misdiagnosis likelihoods, and increase healthcare costs. Here, we developed a machine-learning (ML) based decision-support system to enhance MRI protocol selection. The system integrates various algorithms with an ensemble classifier, utilizing electronic medical records to predict the top-three MRI protocols. We validated the system with achieving a cumulative F1-score of 97.1%. This high level of accuracy demonstrates the system's potential to improve radiologists’ protocol selection.

Improving Open Vocabulary Tagging using Semantic Label Clustering and Maximum Bipartite Matching

Raziuddin Mahmood (Rensselaer Polytechnic Institute); Tanveer Syeda-Mahmood (IBM Research)*
Abstract: The performance of tagging with vision language models (VLMs) decreases with increase in the vocabulary of tags. In this paper, The performance of tagging with vision language models (VLMs) decreases with increase in the vocabulary of tags. In this paper, we develop a new method for improving image taggers using semantic label clustering and semantic ground truth matching. Specifically, we analyze the tags produced by VLMs and semantically cluster related tags to be represented by a single object category choice. We also derive a semantic match to the ground truth labels using maximum bipartite matching between the semantic encodings of predicted and ground truth labels. Experiments show the effectiveness of the approach leading to on an average 62.4% improvement in NDCG performance in open vocabulary tagging as tested on large label benchmark collections.

Hybrid LLM Architecture for Advanced In-Vehicle Voice Assistants

YAHYA SOWTI KHIABANI (MBRDNA)*; Faezeh Tafazzoli (MBRDNA)
Abstract: This work introduces an innovative hybrid large language model (LLM) architecture designed to revolutionize in-vehicle voice assistants. Our system addresses the unique challenges of the automotive industry by seamlessly integrating an on- board small language model (SLM) with a powerful cloud-based LLM, delivering a responsive, intelligent, and context-aware user experience. This hybrid architecture represents a significant step forward in applying innovative GenAI techniques to enhance the automotive user experience. By balancing on-board intelligence with cloud-based capabilities, our approach sets new standards for in-vehicle voice assistants, paving the way for more natural, context- aware, and capable automotive AI systems.

Slug Mobile: Test-Bench for RL Testing

Jonathan Wellington Morris (UC Santa Cruz)*; Vishrut Shah (UC Santa Cruz); Alex Besanceney (UC Santa Cruz); Daksh Shah (UC Santa Cruz); Leilani H Gilpin (UCSC)
Abstract: Sim-to real gap in Reinforcement Learning is when a model trained in a simulator does not translate to the real world. This is a problem for Autonomous Vehicles (AVs) as vehicle dynamics can vary from simulation to reality, and also from vehicle to vehicle. Slug Mobile is a one tenth scale autonomous vehicle created to help address the sim-to-real gap for AVs by acting as a test-bench to develop models that can easily scale from one vehicle to another. In addition to traditional sensors found in other one tenth scale AVs, we have also included a Dynamic Vision Sensor so we can train Spiking Neural Networks running on neuromorphic hardware.

Enhancing Temporal Activity Localization through Multimodal Large Language Models

Young Chol Song (Stanford University)*
Abstract: We evaluate the effectiveness of combining image-based and text-based large language models (LLMs) in a two-stage approach for temporal activity localization. Experimental results on a subset of the Charades-STA dataset highlight the potential of multimodal LLMs in advancing the field of temporal activity localization and video understanding.

Does the ‘most sinfully decadent cake ever’ taste good? Answering Yes/No Questions from Figurative Contexts

Geetanjali Rakshit (UC Santa Cruz)*; Jeffrey Flanigan (University of California, Santa Cruz)
Abstract: We investigate the robustness of Question Answering (QA) models on figurative text. Yes/no questions, in particular, are a useful probe of figurative language understanding capabilities of large language models. We propose FigurativeQA, a set of 1000 yes/no questions with figurative and non-figurative contexts, extracted from the domains of restaurant and product reviews. We show that state-of-the-art BERT-based QA models exhibit an average performance drop of up to 15% points when answering questions from figurative contexts, as compared to non-figurative ones. While models like GPT-3 and ChatGPT are better at handling figurative texts, we show that further performance gains can be achieved by automatically simplifying the figurative contexts into their non-figurative (literal) counterparts. We find that the best overall model is ChatGPT with chain-of-thought prompting to generate non-figurative contexts. Our work provides a promising direction for building more robust QA models with figurative language understanding capabilities.

Encoding Matters: Impact of Categorical Variable Encoding on Performance and Bias

Daniel R Kopp (Rensselaer Polytechnic Institute)*; Benjamin Maudet (Université Paris Saclay); Kristin Bennett (Rensselaer Polytechnic Institute)
Abstract: With the availability of ML packages that automate preprocessing to the greatest extent possible, it is tempting to elude the problem of categorical variable encoding and rely on default settings, e.g. label encoding or one-hot encoding. However, performance and bias are influenced by the choice of encoding in conjunction with the choice of model, the specific type of categorical variable, the number of categories, the size of the dataset, and the particular problem context. In this paper, we empirically revisit—on synthetic and real data—the issue of variable encoding and its impact on modeling bias. We introduce a new metric of feature influence to quantitatively evaluate such effect. The results indicate that contrary to common practice, ordinal and nominal variables should not necessarily be coded differently. Indeed, differences in coding can lead to differences in variable availability to the model, which dominate other effects and lead to modeling bias (and thus potentially to unfairness). We remark that well regularized universal approximators can handle variables encoded identically (regardless of whether they are ordinal or nominal), without adverse effects on generalization, while reducing problems of bias. Various encodings may yield better results in various contexts, so the choice of encoding should be considered a hyper-parameter of the workflow.

Combating Music Streaming Manipulation Fraud With Machine Learning

Sudheer B Tubati (Amazon)*; Amit Goyal (Amazon)
Abstract: The music streaming industry faces streaming manipulation, where bad actors inflate stream counts to boost chart rankings and royalty payments. This fraud diverts revenue from artists and content creators and undermines streaming service integrity. Detecting and mitigating this manipulation is critical for both industry and business. Our research leverages machine learning to identify and combat fraudulent behavior in paid tiers, ensuring fair and transparent royalty payments. We have deployed a high-precision, real-time machine learning model to detect streaming manipulation.

đŸŽ€ Do Music Generation Models Encode Music Theory?

Megan Wei (Brown University)*; Michael Freeman (Brown University); Chris Donahue (Carnegie Mellon University); Chen Sun (Brown University)
Abstract: Music foundation models possess impressive music generation capabilities. When people compose music, they may infuse their understanding of music into their work, by using notes and intervals to craft melodies, chords to build progressions, and tempo to create a rhythmic feel. To what extent is this true of music generation models? More specifically, are fundamental Western music theory concepts observable within the "inner workings" of these models? Recent work proposed leveraging latent audio representations from music generation models towards music information retrieval tasks (e.g. genre classification, emotion recognition), which suggests that high-level musical characteristics are encoded in these models. However, probing individual music theory concepts (e.g. tempo, pitch class, chord quality) remains under-explored. Thus, we introduce SynTheory, a synthetic MIDI and audio music theory dataset, consisting of tempos, time signatures, notes, intervals, scales, chords, and chord progressions concepts. We then propose a framework to probe for these music theory concepts in music foundation models (Jukebox and MusicGen) and assess how strongly they encode these concepts within their internal representations. Our findings suggest that music theory concepts are discernible within foundation models and that the degree to which they are detectable varies by model size and layer.


Call for Abstracts

BayLearn 2024 will be an in-person event hosted at Cupertino, CA.

The BayLearn 2024 abstract submission site is now open for submissions:

https://baylearn.org/submissions

The abstract submission deadline has been extended to Aug 5th, 2024 11:59pm PDT.

Please submit abstracts as a 2-page PDF in NeurIPS format. An extra page for acknowledgements and references is allowed.

About BayLearn

The BayLearn Symposium is an annual gathering of machine learning researchers and scientists from the San Francisco Bay Area. While BayLearn promotes community building and technical discussions between local researchers from academic and industrial institutions, it also welcomes visitors. This one-day event combines invited talks, contributed talks, and posters, to foster exchange of ideas.

Meet with fellow Bay Area machine learning researchers and scientists during the symposium that will be held on Thursday, October 10th, 2024

Feel free to circulate this invitation to your colleagues and relevant contacts.

Key Dates

We are planning for BayLearn 2024 to be a purely in-person (NOT hybrid) event at Cupertino. Details to be announced.

Submissions

We encourage submission of abstracts. Acceptable material includes work which has already been submitted or published, preliminary results, and controversial findings. We do not intend to publish paper proceedings; only abstracts will be shared through an online repository. Our primary goal is to foster discussion! For examples of previously accepted talks, please watch the paper presentations from previous BayLearn Symposiums: https://baylearn.org/previous

For more information about submissions, please look here:

https://baylearn.org/submissions

Submit your abstracts via CMT:

https://cmt3.research.microsoft.com/BAYLEARN2024

Please use the NeurIPS submission format: https://neurips.cc/Conferences/2023/PaperInformation/StyleFiles