LatentHOI: On the Generalizable Hand Object Motion Generation with Latent Hand Diffusion.

Abstract

Current research on generating 3D hand-object interaction motion primarily focuses on in-domain objects. Generalization to unseen objects is essential for practical applications, yet it remains both challenging and largely unexplored.In this paper, we propose LatentHOI, a novel approach designed to tackle the challenges of generalizing hand-object interaction to unseen objects.Our main insight lies in decoupling high-level temporal motion from fine-grained spatial hand-object interactions with a latent diffusion model coupled with a Grasping Variational Autoencoder (GraspVAE). This configuration not only enhances the conditional dependency between spatial grasp and temporal motion but also improves data utilization and reduces overfitting through regularization in the latent space. We conducted extensive experiments in an unseen-object setting on both single-hand grasping and bi-manual motion datasets, including GRAB, DexYCB, and OakInk.Quantitative and qualitative evaluations demonstrate that our method significantly enhances the realism and physical plausibility of generated motions for unseen objects, both in single and bimanual manipulations, compared to the state-of-the-art

Pipeline Overview

During training, grasp information is encoded into the latent code Z. The joint distribution of global motion O and H_γ is learned through the diffusion objective.
During sampling, given the object point cloud and text, latent diffusion produces a sequence of latent grasp (for both hands) as well as global motion. These frames are then sent into GraspVAE where they are decoded to bi-mannual grasp.
More details can be found in the paper.

Quantitative Results

Trained on GRAB training set, tested on ood object split from GRAB and OakInk

IV: The volume of hand object interpenetration.
ID: The depth of hand object interpenetration.
CR: Contact ratio between hand and object surface.
IVU: Interpenetration volume per contact unit.
Phy: Contact rate when object is off the ground.

Qualitative Results on OakInk Objects

Pour Teapot

Drink from Glass

Peel with Knife

Use Phone

More Qualitative Results

Pick up the camera.

Browse the camera

Shake the bottle

BibTeX

@InProceedings{Muchen_LatentHOI,
        author    = {Li, Muchen and Christen, Sammy and Wan, Chengde and Cai, Yujun and Liao, Renjie and Sigal, Leonid and Ma, Shugao},
        title     = {LatentHOI: On the Generalizable Hand Object Motion Generation with Latent Hand Diffusion.},
        booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
        month     = {June},
        year      = {2025},
        pages     = {17416-17425}
      }

LatentHOI: On the Generalizable Hand Object Motion Generation with Latent Hand Diffusion

We introduce a framework to learn generalizable hand object motion generation.