ELMO: Enhanced Real-time LiDAR Motion Capture through Upsampling

ACM Transactions on Graphics (Proc. SIGGRAPH ASIA), 2024

1 MOVIN Inc. , 2 KAIST
*Equal contribution Corresponding author

Our ELMO framework enables upsampling motion capture (60 fps) from LiDAR point cloud (20 fps) in real-time.

Abstract

This paper introduces ELMO, a real-time upsampling motion capture framework designed for a single LiDAR sensor. Modeled as a conditional autoregressive transformer-based upsampling motion generator, ELMO achieves 60 fps motion capture from a 20 fps LiDAR point cloud sequence. The key feature of ELMO is the coupling of the self-attention mechanism with thoughtfully designed embedding modules for motion and point clouds, significantly elevating the motion quality.

To facilitate accurate motion capture, we develop a one-time skeleton calibration model capable of predicting user skeleton offsets from a single-frame point cloud. Additionally, we introduce a novel data augmentation technique utilizing a LiDAR simulator, which enhances global root tracking to improve environmental understanding. To demonstrate the effectiveness of our method, we compare ELMO with state-of-the-art methods in both image-based and point cloud-based motion capture. We further conduct an ablation study to validate our design principles.

ELMO's fast inference time makes it well-suited for real-time applications, exemplified in our demo video featuring live streaming and interactive gaming scenarios. Furthermore, we contribute a high-quality LiDAR-mocap synchronized dataset comprising 20 different subjects performing a range of motions, which can serve as a valuable resource for future research.


Overall framework

Overall network architectures. (a) Detail of the feature extraction pipeline. (b) Overview of generator for real-time upsampling LiDAR motion capture in run-time.

Video


Data Construction

ELMO Dataset

We construct the ELMO dataset, a high-quality synchronized single LiDAR-Optical Motion Capture-Video dataset featuring 20 subjects.

We utilize a 4x4 meter space, Hesai QT128 LiDAR, and an Optitrack motion capture system equipped with 23 cameras. The point cloud and mocap data were recorded at 20 Hz and 60 Hz, respectively.


Augmentation

The goal of the augmentation is to make training dataset cover the entire motion capture space, as we use the global coordinate for the root transformation. We augment each motion clip by applying global rotations of 90, 180, and 270 degrees. However, a fixed LiDAR would capture different sides of the subject for rotated motion clips, resulting in altered shapes compared to the original point cloud.

To address this issue, we use a point cloud simulator. To compute collision points with simulated lasers, we use the SMPL body mesh, with shape parameters manually adjusted to match the subject's skeleton. During simulation, motion clips run at 60 fps, and point cloud data are captured every 3 frames (20 fps).


Augmentation results using mirroring and simulation for 90°, 180°, and 270° global rotations. The yellow character represents the original data, while the blue characters represent the augmented data.


Results

Skeleton calibration

For the qualitative evaluation of the skeleton calibration model, we conducted a wild test with three subjects: a 157cm female, a 171cm male, and a 182cm male. After acquiring a single-frame point cloud while each subject was in the A-pose at origin, our model predicted the user skeleton offsets.

Live streaming

We demonstrate the capability of our framework to stream output motion in real-time for single-subject actions. ELMO successfully captures not only general actions such as walking, running, and jumping but also more challenging actions such as lying down, doing push-ups, and performing cartwheels. Furthermore, our method predicts foot contact from the output, which will be used to eliminate foot-skating.

Comparison with SOTA methods

To validate the effectiveness of our method, we compare our results with state-of-the-art methods: MOVIN, a LiDAR-based method; NIKI, a vision-based method; and Xsens, a commercial IMU-based motion capture suit. All results were not post-processed to ensure a fair comparison.

* Subject 1: 172cm - Kicking.

* Subject 2: 165cm - Spinning.

* Subject 3: 175cm - Rotating foot.

* Subject 1: 172cm - Sitting.

* Subject 2: 165cm - Jumping.

* Subject 3: 175cm - Turn and run.


Related Links

There's a previous LiDAR based motion capture paper, "MOVIN: Real-time Motion Capture using a Single LiDAR".

If you are interested in AI-powered markerless LiDAR based motion capture, check out our website, MOVIN Inc.

BibTeX

@misc{jang2024elmoenhancedrealtimelidar,
      title={ELMO: Enhanced Real-time LiDAR Motion Capture through Upsampling}, 
      author={Deok-Kyeong Jang and Dongseok Yang and Deok-Yun Jang and Byeoli Choi and Donghoon Shin and Sung-hee Lee},
      year={2024},
      eprint={2410.06963},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2410.06963}, 
}