Contrasting Multiple Representations with the Multi-Marginal Matching Gap
AuthorsZoe Piran, Michal Klein, James Thornton, Marco Cuturi Cameto
AuthorsZoe Piran, Michal Klein, James Thornton, Marco Cuturi Cameto
Learning meaningful representations of complex objects that can be seen through multiple () views or modalities is a core task in machine learning. Existing methods use losses originally intended for paired views, and extend them to views, either by instantiating loss-pairs, or by using reduced embeddings, following a strategy. We propose the multi-marginal matching gap (M3G), a loss that borrows tools from multi-marginal optimal transport (MM-OT) theory to simultaneously incorporate all views. Given a batch of points, each seen as a -tuple of views subsequently transformed into embeddings, our loss contrasts the cost of matching these ground-truth -tuples with the MM-OT polymatching cost, which seeks optimally arranged -tuples chosen within these vectors. While the exponential complexity ) of the MM-OT problem may seem daunting, we show in experiments that a suitable generalization of the Sinkhorn algorithm for that problem can scale to, e.g., views using mini-batches of size . Our experiments demonstrate improved performance over multiview extensions of pairwise losses, for both self-supervised and multimodal tasks.