April 17-19, 2026 | Suzhou, China
"Vision-Language-Action Models for Medical and Healthcare Robotics: From Immersive Behaviour Capture to Intelligent Assistive Systems"
Organizers:

Fan Zhang, Xi’an Jiaotong-Liverpool University, China
Fan Zhang received the B.Eng. degree in Mechanical Engineering and Automation from Shanghai Jiao Tong University, China, in 2011, the M.Sc. degree in Robotics from King’s College London, U.K., in 2013, and the Ph.D. degree in Perceptual Intelligence from Delft University of Technology, The Netherlands, in 2019. He worked as a Postdoctoral Researcher at TU Delft on multisensory perception and consumer knowledge in collaboration with Procter & Gamble (Germany) in 2019, and later worked as a Research Fellow at the University of Birmingham, U.K., with the Centre for Computational Neuroscience and Cognitive Robotics, from 2019 to 2023. He is currently an Assistant Professor with the School of Robotics, Xi’an Jiaotong-Liverpool University, Suzhou, China. His research interests include cognitive robotics, human-robot interaction, visual-tactile material perception, and computer vision for medical and personal health applications.

Yuanrui Huang, Xi’an Jiaotong-Liverpool University, China
Yuanrui Huang (Member, IEEE) received the B.Eng. degree in Electronic engineering from Chongqing University, Chongqing, China, in 2018, M.Sc. in the Department of Informatics at King’s College London, London, U.K., in 2019, and the Ph.D. degree from the School of Artificial Intelligence, University of Chinese Academy of Sciences, as well as from the Institute of Automation, Chinese Academy of Sciences, in 2024. He is currently an Assistant Professor with the School of Robotics, Xi’an Jiaotong-Liverpool University, Suzhou, China. His research interests include Medical Robots and Systems, Flexible Robotics and Magnetic actuation and localization.
Introcduction:
Medical and healthcare robotics are undergoing a rapid transformation driven by advances in multimodal sensing, personalised interaction, and intelligent assistance. As robots move closer to patients, caregivers, and home users, it becomes essential for them to understand human behaviour, interpret natural language instructions, adapt to diverse (non-)clinical contexts, and execute actions safely and transparently. However, realworld medical environments impose strict privacy, safety, and operational constraints, creating a significant bottleneck for collecting highquality human-robot interaction data and training robust action policies.
Recent progress in Vision-Language-Action (VLA) foundation models offers a promising path toward generalisable, human-centred healthcare robotics. VLA models can unify visual perception, language understanding, and action generation in a single framework, enabling robots to perform assistive tasks such as dressing support, grooming, rehabilitation guidance, mobility aid, or daily-living assistance. Yet, their development critically depends on large-scale, high-fidelity data capturing human behaviour, intention, and task structure—data that is extremely difficult to obtain in clinics or home-care settings.
Immersive technologies, e.g. VR/AR/XR, provide an effective solution. They enable controlled capture of therapeutic motions, patientcaregiver interactions, daily-living task demonstrations, and multimodal behavioural signals (e.g., gaze, motion trajectories, semantic task descriptions). VR/XR-based digital twins also allow safe simulation of high-risk or personalised healthcare scenarios that are impractical to record in real environments. Such immersive datasets are ideal for training VLA models, benchmarking healthcare robot policies, and accelerating the deployment of intelligent, humancentred assistive systems.
This Special Session aims to bring together researchers working at the intersection of medical robotics, healthcare robotics, immersive simulation, embodied AI, and VLA models, thereby establishing a dedicated forum for advancing next-generation intelligent healthcare systems grounded in human behaviour and multimodal interaction.
The session will be organised into the following two thematic blocks:
1. VLA Models and Immersive XR for Healthcare Robotics
This block focuses on foundational methods and data-generation techniques that enable intelligent assistive systems. Potential topics may include embodied foundation models, multimodal policy learning, language-guided action generation, XR-based behaviour capture, digital twins of clinical procedures, rehabilitation movement datasets, and cognitive–motor behaviour analysis.
2. Human-Robot Interaction and Intelligent Assistive Applications
This block covers system-level integration and real-world deployment of VLA-enabled healthcare robots. Potential topics may include interactive perception for patient state estimation, personalised and adaptive assistive behaviours, VR/AR-based therapy or training systems, safety-critical HRI, and intelligent robots for rehabilitation, personal health, and home-care support.