View publication

The task of novel view synthesis aims at generating unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the ever-evolving realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance field. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful backbone requires a notoriously long time and large demanding computational costs. To tackle this issue, we proposeEfficient-3DiM, a simple but effective framework, remarkably diminishing the training overhead to manageable scales. Taking inspiration from previous approaches, we initiate with a large-scale pre-trained text-to-image model (E.g., Stable Diffusion) and finetune its denoiser leveraging referenced image-conditional features. Motivated by the in-depth visual analysis of its synthesis process, we propose several pragmatic strategies ranging from the data level to the algorithm level, including an enhanced noise scheduling way, a superior 3D feature extractor, and a dataset pruning approach. Combining all these efforts, our final framework is able to reduce the total cost from 11.6 days to less than 1 day, accelerating the training process by more than 12x under the same computational platform --- an 8 Nvidia A100 instance. Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method on several common benchmarks.

Related readings and updates.

Fast and Explicit Neural View Synthesis

We study the problem of novel view synthesis from sparse source observations of a scene comprised of 3D objects. We propose a simple yet effective approach that is neither continuous nor implicit, challenging recent trends on view synthesis. Our approach explicitly encodes observations into a volumetric representation that enables amortized rendering. We demonstrate that although continuous radiance field representations have gained a lot of…
See paper details

Improving the Realism of Synthetic Images

Most successful examples of neural nets today are trained with supervision. However, to achieve high accuracy, the training sets need to be large, diverse, and accurately annotated, which is costly. An alternative to labelling huge amounts of data is to use synthetic images from a simulator. This is cheap as there is no labeling cost, but the synthetic images may not be realistic enough, resulting in poor generalization on real test images. To help close this performance gap, we've developed a method for refining synthetic images to make them look more realistic. We show that training models on these refined images leads to significant improvements in accuracy on various machine learning tasks.

See highlight details