How do English as a lingua franca (ELF) speakers achieve multimodal cohesion on the basis of their specific interests and cultural backgrounds? From a dialogic and collaborative view of communication, this study focuses on how verbal and nonverbal modes cohere together during intercultural conversations. The data include approximately 160-minute transcribed video recordings of ELF interactions with 4 groups of university students who engaged in the following two classroom tasks: responding to a film excerpt and a music video. The results showed that individual participants engaged in the processes of initiation and response to support or challenge one another using a range of communication strategies. The results further indicated that during the discursive activities, the small groups achieved multimodal cohesion by deploying specific embodied resources in four types of participation structure: (1) interlock, (2) unison, (3) plurality and (4) dominance. Future research may broaden our understanding of the embodied interaction that is involved in intercultural conversation.