We introduce Shape Tokens, a 3D representation that is continuous, compact, and easy to integrate into machine learning models. Shape Tokens serve as conditioning vectors, representing shape information within a 3D flow-matching model. This flow-matching model is trained to approximate probability density functions corresponding to delta functions concentrated on the surfaces of 3D shapes. By incorporating Shape Tokens into various machine learning models, we can generate new shapes, convert images to 3D, align 3D shapes with text and images, and render shapes directly at variable, user-specified resolutions. Additionally, Shape Tokens enable a systematic analysis of geometric properties, including normals, density, and deformation fields. Across tasks and experiments, the use of Shape Tokens demonstrates strong performance compared to existing baselines.
Figure 1: Our Shape Tokens representation can be readily used as input / output to machine learning models in various applications, including single-image-to-3D (left), neural rendering of normal maps (top right) and 3D-CLIP alignment (bottom right). The resulting models achieve strong performance compared to baselines for individual tasks.Video 1: The video shows our single image to 3D point cloud results. The images are unseen objects in Objaverse test set. Each video first shows the input image, then the generated point cloud. [Credits]Video 2: From the same unseen input image, we generate multiple point clouds independently. [Credits]
Figure 2: Overview of our architecture. (Left) We model a 3D shape as a probability density function that is concentrated on the surface, forming a delta function in 3D. (Right) Our tokenizer uses cross attention to aggregate information about the point cloud sampled on the shape into ST. The velocity estimator only use cross attention and MLP to maintain independence between points.
Figure 3: Reconstruction, densification, and normal estimation of unseen point clouds in GSO dataset. For each row, we are given a point cloud containing 16,384 points (xyz only), we compute ST and i.i.d. sample the resulted p(x|s) for 262,144 points. Different columns render the input and the sampled point clouds from different view points. Indicated by the label in the parenthesis, we color the input points according to their xyz coordinates and the sampled points according to their initial noise’s uvw coordinates and their estimated normal (last two columns). Note that we do not provide normal as input to the shape tokenizer.Video 4: We compute Shape Tokens on input point clouds (16,384 points) from unseen Google Scanned Objects. Then we sample 16x more points (262,144 points). The video shows the uvw-to-xyz trajectory of the flow-matching sampling process, i.e., the ODE trajectory. We color the points with their initial position in the noise space (uvw). [Credits]
Figure 4: The ODE integration trajectory defines a mapping from xyz (data) to uvw (noise).Video 3: The video shows recent single-image-to-3D methods on Google Scanned Objects, which are unseen to all methods.
From left to right:
Input image
Spatter-image (CVPR 2024): trained on Objaverse
Point-e (2022): trained on several million proprietary 3D meshes.
Make-a-shape (ICML 2024): trained on 18 datasets, including Objaverse
Ours: trained on Objaverse
Note that this video does not intend to compare individual methods --- these models differ in their training data (e.g., Point-e was trained on proprietary 3D meshes) and mechanisms (e.g., Splatter-image is not a generative model, our method assumes a known camera model). We provide the results for the viewer's reference. [Credits]Video 5: The page shows the neural rendering results on unseen point clouds. From Shape Tokens, we use a neural network to estimate independently each ray's intersection point and its surface normal.
Many app developers are interested in building on device experiences that integrate increasingly capable large language models (LLMs). Running these models locally on Apple silicon enables developers to leverage the capabilities of the user's device for cost-effective inference, without sending data to and from third party servers, which also helps protect user privacy. In order to do this, the models must be carefully optimized to effectively…
Texture cues on 3D objects are key to compelling visual representations, with the possibility to create high visual fidelity with inherent spatial consistency across different views. Since the availability of textured 3D shapes remains very limited, learning a 3D-supervised data-driven method that predicts a texture based on the 3D input is very challenging. We thus propose Texturify, a GAN-based method that leverages a 3D shape dataset of an…