This is a key research topic currently being pursued by our team, which are conducted by
several PhD and graduate
students. This topic is also a widely recognized area in the industry, with significant applications in the
industrial
sector. In this domain, we are focused on developing advanced multi-modality models/systems to achieve a
comprehensive
understanding of video content, encompassing tasks such as video question answering, referring video
segmentation, and
video grounding, etc. These studies often require the joint processing of various modalities, including
video, images,
textual descriptions, and speech data.
Quote
[1] Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, Jian-Fang Hu*,"ReferDINO: Referring Video
Object Segmentation with Visual Grounding Foundations", International Conference on Computer Vision (ICCV),
2025.[code&model][project page][ReferDINO-Plus]
[2] Jinxuan Li, Zihang Lin, Jian-Fang Hu*, Chaolei Tan, Tianming
Liang, Zhi Jin, and Wei-Shi Zheng,Collaborative Static
and Dynamic Vision-Language Learning for Spatio-Temporal Video Grounding,submitted to IEEE TPAMI.
[3] Tianming Liang, Linhui Li, Jian-Fang Hu*, Xiangyang Yu, Wei-Shi
Zheng, and Jianhuang Lai. "Rethinking Temporal
Context in Video-QA: A Comprehensive Study of Single-frame Static Bias", IEEE Transactions on Multimedia
(TMM), 2024.
[4] Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi
Zheng,
and Jian-Fang Hu*, "SynopGround: A Large-Scale Dataset for
Multi-Paragraph Video Grounding from TV Dramas and Synopses",
ACM Multimedia (ACM MM), 2024.Link
for Dataset & Codes & Pretrained
Model
[5] Tianming Liang, Chaolei Tan, Beihao Xia, Wei-Shi Zheng, and Jian-Fang
Hu*, "Ranking Distillation for Open-Ended
Video Question Answering with Insufficient Labels",IEEE Computer Vision and Pattern Recognition (CVPR),
2024.
[6] Chaolei Tan, Jianhuang Lai, Wei-Shi Zheng, and Jian-Fang Hu*,
"Siamese Learning with Joint Alignment and Regression
for Weakly-Supervised Video Paragraph Grounding",IEEE Computer Vision and Pattern Recognition (CVPR), 2024.
[7] Zihang Lin, Chaolei Tan, Jian-Fang Hu*, Zhi Jin, Tiancai Ye, and
Wei-Shi Zheng, "Collaborative Static and Dynamic
Vision-Language Streams for Spatio-Temporal Video Grounding", IEEE Conference on Computer Vision and Pattern
Recognition
(CVPR), 2023.
[8] Chaolei Tan, Zihang Lin, Jian-Fang Hu*, Wei-Shi Zheng, and
Jianhuang Lai, "Hierarchical Semantic Correspondence
Networks for Video Paragraph Grounding", IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2023.
2.Human Motion Prediction and Generation
Generating unobserved motion sequences based on given conditions, such as textual
descriptions, music, and partially
observed sequences, is an important research direction with significant applications in human-computer
interaction,
virtual reality, and other fields. A key challenge in this area is developing a prediction system that can
effectively
comprehend and align motion with different modalities of input, while also addressing the issue of error
accumulation
during long-term motion propagation. We are dedicated to solving the above problems and develop a set of
good-performing
models including Continual Prior Compensation algorithm and action-guided motion prediction model.
Please refer to this for more visulization results!
Quote
[1]Jianwei Tang, Jian-Fang Hu*, Tianming Liang, Xiaotong Lin,
Jiangxin Sun, Wei-Shi Zheng, Jianhuang Lai, Human Motion
Prediction via Continual Prior Compensation, submitted to IEEE TPAMI.
[2]Jianwei Tang, Hong Yang, Tengyue Chen, Jian-Fang Hu*, "Stochastic
Human Motion Prediction with Memory of Action
Transition and Action Characteristic", IEEE Computer Vision and Pattern Recognition (CVPR), 2025.
[3]Heng Li, Xing Liufu, Xiaotong Lin, Jian Zhu, Jian-Fang Hu*,
Efficient Text-to-Motion via Multi-Head Generative Masked
Modeling, IEEE International Conference on Multimedia and Expo (ICME), 2025.
[4]Jianwei Tang, Jiangxin Sun, Xiaotong Lin, Lifang Zhang, Wei-Shi Zheng, and Jian-Fang Hu*, "Temporal Continual
Learning with Prior Compensation for Human Motion Prediction", Conference on Neural Information Processing
Systems
(NeurIPS), 2023.
[5]Jiangxin Sun, Chunyu Wang, Huang Hu, Hanjiang Lai, Zhi Jin, and Jian-Fang
Hu*,"You Never Stop Dancing: Non-freezing
Dance Generation via Bank-constrained Manifold Projection", Conference and Workshop on Neural Information
Processing
Systems (NeurIPS), 2022.
[6]Jiangxin Sun, Zihang Lin, Xintong Han, Jian-Fang Hu*, Jia Xu, and
Wei-Shi Zheng, "Action-guided 3D Human Motion
Prediction", Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2021.