1. Introduction

Historically, the question of human language perception has been prevalently studied focusing on auditory processing of speech \cite{Friederici_2012}. Recent investigations, however, postulate that language perception rather rely on an interactive multi-modal system, including not only auditory but also visual \cite{Bernstein_2014} and motor systems \cite{Pulverm_ller_2010,Glenberg_2012}.
    The use of visual speech cues for language processing is present early in the ontogeny. Four month-old infants are capable to detect a switch from native to non-native language (and vice versa) in silent videos, suggesting that visual input alone is sufficient for language discrimination at these early ages \cite{Sebasti_n_Gall_s_2012,Weikum_2007}. Further specializations seems to occur during the second half of the first year of life, when visual attention shifts from the eyes towards articulatory movements of the mouth, helping to construct a sensory-motor model for the emerging speech production \cite{Tenenbaum_2012,Lewkowicz_2012}. Studies with adults have demonstrated that having access to the visual information afforded by the interlocutor’s face can be especially advantageous in a noisy environment \cite{Sumby_1954,Ross_2006} and when hearing acuity is impaired \cite{Bernstein_2000,Auer_2007}. Early audition deprivation leads to a greater dependence on vision during speech perception in deaf people, which is reflected behaviorally by a reorientation of visual attention in order to improve the perception of visual speech cues provided by orofacial movements \cite{Dole_2017,Letourneau_2013,Worster_2017}. On the other hand, it is known that adults often fail to hear the difference between certain non-native phonemic contrasts  (when auditory only presented) but they do successfully distinguish these contrasts when presented audiovisually \cite{Navarra_2005,Hirata_2010}. Paris, Kim and Davis (2013) reported that the access to visual speech form speeds up the processing of auditory speech compared to when speech is presented in the auditory modality only. They argued that the temporal priority of visual speech may serve as a potential cue to predict aspects of up-coming auditory signal \cite{Paris_2013}. Interestingly, the more the articulatory movements are salient and predictive of a possible speech sound, the speediest auditory signal is processed. The authors propose that human adults possess “abstract internal representations” that link a specific visual form of the mouth to a restrained set of possible subsequent auditory input \cite{van_Wassenhove_2005}. An alternative view to this abstract representational format would be emphasizing on the role of the motor system and the sensorimotor coupling as a mode of internal representation. The motor system seems to play an important role even in the most abstract forms of language \cite{2017,Gallese_2018,Kemmerer_2014,Cardona_2014}. Abstract concepts activate the mouth motor representation in a way that has been interpreted as “a re-enactment of acquisition experience, or re-explanation of the word meaning, possibly through inner talk” \cite{Borghi_2016}
    Consistently with behavioral studies, neuroimaging techniques revealed that silent lip-reading activate areas of the temporal auditory cortex that overlap considerably with those activated by auditory speech perception. Noteworthy, auditory cortex appear to be similarly activated by visual pseudospeech in contrast to mouth movements with no linguistic content. Considered as a central hub for multimodal integration, the left posterior superior temporal sulcus (pSTS) is thought to have a crucial role in predicting upcoming auditory speech on the basis of visual information that typically precede the acoustic signal in a natural face to face conversation. For instance, greater functional connectivity has been found between left pSTS and auditory-speech areas when visual cue mismatch upcoming auditory cue, suggesting the existence of predictive error signals \cite{Blank_2013}. Skipper, Nusbaum and Small (2005) used fMRI to examine brain activity associated with the comprehension of short stories presented in three different conditions: audiovisual, auditory-only and visual-only. They reported several interesting results. First, the activity of pSTS is modulated by the saliency of articulatory movements, becoming more active as visemic content increase. Second, Broca’s area and particularly of pars opercularis (BA 44) are activated to greater extent in the visual-only condition compared to the audiovisual condition. Based on their shared functional properties and connectivity, the authors suggest that pSTS and pars opercularis work together to associate the sensory patterns of phonemes and/or visemes with the motor commands needed to produce them. Finally, the activity in dorsal precentral gyrus and sulcus (i.e., premotor and motor cortices), similarly to the pSTS, is modulated by the amount of visemic content. These areas are postulated to be involved in the encoding of motor plans of the specific articulatory effectors (e.g., lips, tongue, jaws) corresponding to the sensori-motor representation generated by the pSTS and pars opercularis \cite{Skipper_2005}. Despite the fact that Broca’s area, premotor and motor cortices have traditionally been associated with language production, it seems that they also are an important part of a highly interactive network that “translate” orofacial movements into phonetic representation based on the motor commands required to generate those movements. We propose this network to support the development of a trimodal repertoire in which phoneme, viseme and ‘articuleme’ are linked to achieve a more ecological and seamless perception of speech.
    Whereas evidence of the spatial organization of the brain is increasingly robust and consistent, the temporal dimension of visual speech processing and its electrophysiological correlates remain poorly understood. The temporal dimension is crucial for audiovisual processing, as illustrated by the effects of desynchronization between auditory and visual speech inputs, but also because visual speech cues are perceived first and have the potential to disambiguate the upcoming acoustic signal. The high temporal resolution of EEG techniques makes them especially suited to address such temporal dynamic questions. In the current study, two experiments were performed. The first experiment aimed to elucidate whether or not the linguistic content of visual speech cues modulates the electrophysiological response elicited by perceiving orofacial movements. We recorded participant’s EEG signal while they attentively observe or imitate different type of orofacial movements (a- still mouth, b-syllables, c-backward played syllables, d-non-linguistic movements) and non-biological movements displayed in short videos. The second experiment aimed to investigate to what extent interfering with automatic mimicry can affect the electrophysiological dynamics underlying orofacial movements processing. To do so, the very same experiment was run a second time but participants were asked to hold an effector depressor between their teeth while observing the videos.