Figure 1. Experiment structure. a. Structure of a single learning block, consisting of one 20 s acquisition trial, followed by 6 test trials. During acquisition trials, participants either actively explored (agent condition) or passively observed (observer condition) the relationships between movement directions of a cursor and 8 different sound stimuli. In test trials, participants were tested on their memory of the associations. b. Structure of a contingency block. Each contingency block consisted of 7 learning blocks. The first three were considered the “early learning stage”, and the last three were considered the “late learning stage”.c. Structure of the experiment: The experiment consisted of 14 contingency blocks, 7 of which belonged to the agent condition and 7 of which to the observer condition.
In order to make the cursor move in a “gaze-like” style in the observer condition, it was computer-animated using the participant’s own movements in acquisition trials of the preceding agent contingency block. In case the experiment started with the observer condition, we used the eye movement recordings from the training block, which always involved active exploration. In order to make eye movements less recognisable to the participant, we randomized the order of previously recorded trials across the learning blocks.
Training
Before starting the experiment, participants underwent two stages of training. First, a “free training” session with the purpose of adjusting the eye tracker, allowing the participants to familiarise themselves with the equipment, and learn how to use the gaze-controlled cursor. Participants sat facing a screen at 60 cm distance from their eyes. Their head position was stabilized for eye tracking via a chin and forehead rest, and they were wearing a pair of headphones connected to the experiment computer.
Participants were then instructed to move their gaze across the screen and “explore” the sounds that they were able to trigger by moving the cursor (for details, see section “Gaze-controlled sound generation”). During the free training, the experimenter ensured that the participant understood how to use the gaze-controlled cursor and was familiar with the experiment structure. The duration of the free training was variable but lasted typically around 5 minutes.
The subsequent “structured training” followed the same pattern as an agent experimental block, but with only 3 instead of 6 test trials.
Visual stimulation and gaze-controlled sound generation
Before the start of the free training and before every agent experimental block, the eye tracker was calibrated collecting fixation samples from known target points in order to map raw eye data to the participant’s gaze position (standard in-built Eyelink calibration procedure). After the calibration was successful, the experiment screen appeared: a grid of 9 red squares over a black background. Each red square’s side had a visual angle of 5° 18’ 0.99”, with gaps of 1° 28’ 0.39” between squares. The center of each red square was marked by a small black square with a side length of 0° 49’ 0.11”. The gaze position of the participant appeared on the screen as a white dot (radius = 0° 19’ 0.64”). A fixation on a square was defined as the gaze resting within a radius of 0° 29’ 0.47” around the edges of the square. The distance between the chin and forehead rest and the screen was 60 cm, as suggested by the Eyelink 1000 user manual, which translates to an eye-screen distance of about 70 cm.
During the free training, the structured training and the agent experimental condition, participants were able to generate sounds by moving their gaze from one square on the screen to another, adjacent square. The possible movement directions that could trigger a sound were: vertical up and down, horizontal left and right, and diagonal up-right, up-left, down-right, and down-left. A participant could move their gaze from one square to another, and in order to trigger a sound, a fixation on the target square with a duration of 750 ms was required. In the case that the participant interrupted the fixation before the delay period of 750 ms ended, no sound was played.
Sound stimuli
Sound stimuli were synthesized speech sounds created with Google text-to-speech API through Python set to a male Spanish speaker with a sampling rate of 16000 Hz. The sound stimuli were then manually manipulated in Praat using the Vocal Toolkit (Boersma, 2002) to have the same duration and flat pitch. Sounds were normalized and resampled to 96000 Hz. Each sound was a 500 ms /CV/ syllable delivered at 70 dB, formed by a random combination of one of 8 different pitches, vowels and consonants. Pitch (in Hz) was either 90, 120, 150, 180, 210, 240, 270 or 300; the consonant was either [f], [g], [l], [m], [p], [r], [s] or [t]; the vowel was either [a], [e], [i], [o] or [u]. Per participant, 14 sets of 8 different sounds were generated. In each contingency-block, 8 sounds were randomly paired with the 8 possible movement directions.
Apparatus
An ATI Radeon HD 2400 monitor and Sennheiser KD380 PRO noise cancelling headphones were used for presentation of visual and auditory stimuli, respectively. A midi keyboard, the Korg nanoPAD2, was used to record participants’ responses. This keyboard was chosen because key presses don’t produce any sounds. The presentation of the stimuli and recording of participants’ responses was controlled using MATLAB R2017a (The Mathworks Inc.), the Psychophysics Toolbox extension (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997), and the Eyelink add-in toolbox for eyetracker control.
EEG was recorded using Curry 8 Neuroscan software and a Neuroscan SynAmps RT amplifier (NeuroScan, Compumedics, Charlotte, NC, USA). Continuous DC recordings were acquired using Ag/AgCl electrodes attached to a nylon cap (Quick-Cap; Compumedics, Charlotte, NC, USA) at 64 standard locations following the 10% extension of the international 10-20 system (Chatrian, Lettich, & Nelson, 1985; Oostenveld & Praamstra, 2001). Further electrodes were placed on the tip of the nose (online reference), and above and below the left eye (vertical electrooculogram, VEOG). Further two electrodes were placed next to the outer canthi of both eyes referenced to the common reference (horizontal electrooculogram, HEOG). The ground electrode was located at AFz. Impedances were required to be below 10 kΩ during the whole recording session and data was sampled at 500 Hz.
Horizontal and vertical gaze position of the left eye were recorded using the EyeLink 1000 desktop mount (SR Research) at a sampling rate of 1,000 Hz.

Behavioural data analysis

We analysed the percentage of correct responses (%Correct) to the question of whether the movement-sound pair presented in a test trial was congruent (“Did they match?”). Missing responses were counted as false. Test trials presenting unseen sound-movement pairs were excluded from the analysis to avoid forced guessing. After performing this exclusion, we calculated the %Correct of each participant per learning block, distinguishing between associations acquired in the agent and observer condition. We performed a repeated-measures ANOVA with the factors agency (agent/observer) and learning block (seven levels).
During initial stages of learning, participants were expected to perform very poorly on the memory task due to the little exposure to the associations. During late stages, they were expected to be proficient.

EEG data analysis

Preprocessing
EEG data was preprocessed using EEGLAB (Delorme & Makeig, 2004). After a high-pass filter was applied to the data (0.5 Hz high-pass, Kaiser window, Kaiser β 5.653, filter order 1812), the continuous recording of each participant was inspected, and non-stereotypical artefacts were manually rejected. Then, eye movements were removed from the data using Independent Component Analysis (SOBI algorithm). Independent components representing eye movement artefacts were rejected based on visual inspection and the remaining components were projected back into electrode space. A low-pass filter was applied (30 Hz low-pass, Kaiser window, Kaiser β 5.653, filter order 1812). Malfunctioning electrodes were interpolated (spherical interpolation). A −100 ms to 500 ms epoch was defined around each sound both during acquisition and test trials (-100 to 0 ms baseline correction). A 75 μV maximal signal-change per epoch threshold was used to reject remaining artefacts. Participant averages were calculated for each event of interest, as well as the grand averages using all participants. We obtained ERPs for acquisition sounds in agent and observer acquisition mode, as well as early (blocks 1 to 3) and late (blocks 5 to 7) learning stages. For test sound ERPs, we calculated averaged ERPs for test sounds acquired in agent versus observer mode, early versus late learning stages, and congruent versus incongruent test sounds (relative to the associations between movements and sounds learned in acquisition trials). The mean number of trials per subject-level average was 361, with a standard deviation of 185 trials.
Statistical analyses
Both in acquisition and test sounds, statistical comparisons were conducted to extract agency and learning stage effects and their interactions. In test sounds, we analysed the effects of congruency and the interaction between congruency and the factors agency and learning stage.