X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

We introduce X-Fi, the first foundation model that achieves modality-invariant multimodal human sensing. This model would require training only once, allowing all sensor modalities that participated in the training process to be utilized independently or in any combination for a wide range of potential applications. We evaluated X-Fi on HPE and HAR tasks in MM-Fi [1] and XRF55 [2], demonstrated that X-Fi surpasses previous methods by MPJPE 24.8% and PA-MPJPE 21.4% on HPE task, accuracy 2.8% on HAR task.

Abstract

Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multimodal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.

Motivation

Currently, human sensing tasks mainly rely on vision-based modalities like cameras, which face inherent limitations such as reliance on illumination and privacy concerns. Alternatives like LiDAR, mmWave radar, and WiFi address these challenges, each offering distinct advantages but also having limitations. Therefore, a multi-modal approach that leverages strengths from each modality is essential for advancing human sensing.
Numerous methods have been proposed for multi-modal perception based on sensor fusion. They usually predefine a fixed set of modalities for a specific scenario. Nevertheless, in any given scenario, once the model is trained, adding or removing even one modality requires a huge effort: adjusting the network and retraining it from scratch. In the real world, we may require versatile combinations of sensor modalities according to different scenarios.
Hence, we contemplate whether it is possible to design a one-for-all solution for modality-invariant human sensing. Such a model would require training only once, allowing all sensor modalities that participated in the training process to be utilized independently or in any combination for a wide range of potential applications.

Method

We propose a novel modality-invariant foundation model, X-Fi, for versatile human sensing.

X-Fi can take in any combination of modalities and activate corresponding parts to extract modality-specific features. A cross-modal transformer is designed to learn a cross-modal feature and each modality information is then preserved in the representation by executing cross-attention multi-modal fusion processes to preserve distinctive modal features.

The architecture of the proposed modality-invariant foundation model, X-Fi. X-Fi consists modality feature encoders and an X-Fusion module, which includes a cross-modal transformer and modality-specified cross-attention modules. The modalities with dotted lines represent inactivate modalities in the given scenario. The \(N\) in X-Fusion block represents the number of iterations.

Evaluations

Qualitative Results

Huamn Pose Estimation Qualitative Results

Loading...

The visualization results for HPE comprise two actions, 'picking up things' and 'throwing', each depicted through a sequence of four images. To facilitate a clearer comparison between the fused results and the single-modal inputs results, we incorporated blue and orange dashed lines in the fused result images to represent the results of the rgb and depth single-modality inputs, respectively.

Huamn Activity Recognition Qualitative Results

Comparison of multi-modal embedding distribution for HAR. To more closely analyze the distribution of sample points, we zoomed in on a small region containing points from two distinct categories. To quantify the distribution, we used the Silhouette score and Calinski–Harabasz index as indicators of clustering quality.

Quantitive Results

We train and evaluate our proposed X-Fi on the two largest human sensing multimodal public datasets, MM-Fi [1] and XRF55 [2], to assess its efficiency as a unified modality-invariant foundation model across diverse human sensing tasks, including Human Pose Estimation (HPE) and Human Activity Recognition (HAR).

MM-Fi includes 5 sensing modalities: RGB images (I), Depth images (D), LiDAR point clouds (L), mmWave point clouds (R), and WiFi-CSI (W).
XRF55 includes 3 sensing modalities: mmWave Range-Doppler & Range-Angle Heatmaps (R), WiFi-CSI (W), and RFID phase series data (RF)

Huamn Pose Estimation Quantitive Results

Performance comparisons of X-Fi with baseline methods on the MM-Fi dataset for HPE task.
'Baseline1' denotes the decision-level fusion results and 'Baseline2' denotes the feature-level fusion results.
Imp denotes the improvement achieved over baseline in percentage level.

Huamn Activity Recognition Quantitive Results

HAR accuracy (%) on the MM-Fi and XRF55 datasets.

Related works

Jianfei Yang, He Huang, Yunjiao Zhou, Xinyan Chen, Yuecong Xu, Shenghai Yuan, Han Zou, Chris Xiaoxuan Lu, and Lihua Xie. Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing. Advances in Neural Information Processing Systems, 36, 2024.

Fei Wang, Yizhe Lv, Mengdie Zhu, Han Ding, and Jinsong Han. Xrf55: A radio frequency dataset for human indoor action analysis. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(1):1–34, 2024.

BibTeX

@misc{chen2024xfimodalityinvariantfoundationmodel,
      title={X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing}, 
      author={Xinyan Chen and Jianfei Yang},
      year={2024},
      eprint={2410.10167},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.10167}, 
}