View publication



This work investigates pre-trained audio representations for few shot Sound Event Detection. We specifically address the task of few shot detection of novel acoustic sequences, or sound events, with semantically meaningful temporal structure without assuming access to non-target audio. We develop procedures for pre-training suitable representations and methods that transfer them to our few shot learning scenario. Our experiments evaluate the general purpose utility of our pre-trained representations on AudioSet, and the utility of proposed few shot methods via tasks constructed from real-world acoustic sequences. Our pre-trained embeddings are suitable to the proposed task and enable multiple aspects of our few shot framework.

Related readings and updates.

Pre-trained Model Representations and their Robustness against Noise for Speech Emotion Analysis

Pre-trained model representations have demonstrated state-of-the-art performance in speech recognition, natural language processing, and other applications. Speech models, such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT), have enabled generating lexical and acoustic representations to benefit speech recognition applications. We investigated the use of pre-trained model representations for…
See paper details

Generalized Reinforcement Meta Learning for Few-Shot Optimization

We present a generic and flexible Reinforcement Learning (RL) based meta-learning framework for the problem of few-shot learning. During training, it learns the best optimization algorithm to produce a learner (ranker/classifier, etc) by exploiting stable patterns in loss surfaces. Our method implicitly estimates the gradients of a scaled loss function while retaining the general properties intact for parameter updates. Besides providing improved…
See paper details