View publication

The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images-an off-grid structural relative positioning that is less tied to absolute appearance and can remain coherent under variations such as partial visibility or stylistic changes. In tasks requiring precise spatial understanding such as object detection and time series prediction, PART outperforms grid-based methods like MAE and DropPos, while maintaining competitive performance on global classification tasks. By breaking free from grid constraints, PART opens up a new trajectory for universal self-supervised pretraining across diverse datatypes-from images to EEG signals-with potential in medical imaging, video, and audio.

Related readings and updates.

Real world images often have highly imbalanced content density. Some areas are very uniform, e.g., large patches of blue sky, while other areas are scattered with many small objects. Yet, the commonly used successive grid downsampling strategy in convolutional deep networks treats all areas equally. Hence, small objects are represented in very few spatial locations, leading to worse results in tasks such as segmentation. Intuitively, retaining…

Read more

This paper was accepted at the workshop “Self-Supervised Learning - Theory and Practice” at NeurIPS 2022.

Many state of the art self-supervised learning approaches fundamentally rely on transformations applied to the input in order to selectively extract task-relevant information. Recently, the field of equivariant deep learning has developed to introduce structure into the feature space of deep neural networks, specifically with respect to such…

Read more