Learning to Detect Novel and Fine-Grained Acoustic Sequences Using Pretrained Audio Representations

AuthorsVasudha Kowtha, Miquel Espi Marques, Jonathan Huang, Yichi Zhang, Carlos Avendano

  This work investigates pre-trained audio representations for few shot Sound Event Detection. We specifically address the task of few shot detection of novel acoustic sequences, or sound events, with semantically meaningful temporal structure without assuming access to non-target audio. We develop procedures for pre-training suitable representations and methods that transfer them to our few shot learning scenario. Our experiments evaluate the general purpose utility of our pre-trained representations on AudioSet, and the utility of proposed few shot methods via tasks constructed from real-world acoustic sequences. Our pre-trained embeddings are suitable to the proposed task and enable multiple aspects of our few shot framework.

Related readings and updates.

March 20, 2024research area Computer Vision, research area Speech and Natural Language Processing

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful…

March 4, 2024research area Data Science and Annotation, research area Speech and Natural Language Processingconference EACL

In-context learning with Large Language Models (LLMs) has emerged as a promising avenue of research in Dialog State Tracking (DST). However, the best-performing in-context learning methods involve retrieving and adding similar examples to the prompt, requiring access to labeled training data. Procuring such training data for a wide range of domains and applications is time-consuming, expensive, and, at times, infeasible. While zero-shot learning…

Learning to Detect Novel and Fine-Grained Acoustic Sequences Using Pretrained Audio Representations

Related readings and updates.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

SynthDST: Synthetic Data is All You Need for Few-Shot Dialog State Tracking

Discover opportunities in Machine Learning.