Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
AuthorsPengfei Zhao, Rongbo Luan, Wei Zhang, Peng Wu, Sifeng He
Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
AuthorsPengfei Zhao, Rongbo Luan, Wei Zhang, Peng Wu, Sifeng He
Despite Contrastive Language-Image Pretraining (CLIP)‘s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.
EMBridge: Enhancing Gesture Generalization from EMG Signals through Cross-Modal Representation Learning
March 3, 2026research area Human-Computer Interaction, research area Methods and Algorithmsconference ICLR
Hand gesture classification using high-quality structured data such as videos, im- ages, and hand skeletons is a well-explored problem in computer vision. Alterna- tively, leveraging low-power, cost-effective bio-signals, e.g., surface electromyo- graphy (sEMG), allows for continuous gesture prediction on wearable devices. In this work, we aim to enhance EMG representation quality by aligning it with embeddings obtained from structured,…
Promoting Cross-Modal Representations to Improve Multimodal Foundation Models for Physiological Signals
October 28, 2024research area Methods and Algorithmsconference NeurIPS
Many healthcare applications are inherently multimodal, involving several physiological signals. As sensors for these signals become more common, improving machine learning methods for multimodal healthcare data is crucial. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still in early exploration and it is unclear which pretraining strategies are most effective…