View publication

Understanding human motion from video is crucial for applications such as pose estimation, mesh recovery, and action recognition. While state-of-the-art methods predominantly rely on Transformer-based architectures, these approaches have limitations in practical scenarios. They are notably slower when processing a continuous stream of video frames in real time and do not adapt to new frame rates. Given these challenges, we propose an attention free spatiotemporal model for human motion understanding, building upon recent advancements diagonal state space models. Our model performs comparably to its Transformer-based counterpart, but offers added benefits like adaptability to different video frame rates and faster training, with same number of parameters for longer sequences. Moreover, we demonstrate that our model can be readily adapted to real-time video scenarios, where predictions rely exclusively on the current and prior frames. In such scenarios, during inference, our model is not only several times faster than causal Transformer-based counterpart but also consistently outperforms it in terms of task accuracy.

Related readings and updates.

Neural Face Video Compression using Multiple Views

Recent advances in deep generative models led to the development of neural face video compression codecs that use an order of magnitude less bandwidth than engineered codecs. These neural codecs reconstruct the current frame by warping a source frame and using a generative model to compensate for imperfections in the warped source frame. Thereby, the warp is encoded and transmitted using a small number of keypoints rather than a dense flow field…
See paper details

Video Frame Interpolation via Structure-Motion based Iterative Feature Fusion

Video Frame Interpolation synthesizes non-existent images between adjacent frames, with the aim of providing a smooth and consistent visual experience. Two approaches for solving this challenging task are optical flow based and kernel-based methods. In existing works, optical flow based methods can provide accurate point-to-point motion description, however, they lack constraints on object structure. On the contrary, kernel-based methods focus on…
See paper details