2.2. Vocabulary in the news
To analyse the vocabulary change through the period ranging from COVID-19 discovery to its spread outside China, we extracted terms from the whole corpus. For this purpose, we used a ranking function based on terms frequency and importance11Using the F-TFIDF-C measure (Lossio-Ventura et al., 2014) with the support of BioTex, a text-mining tool adapted to the biomedical area (Lossio-Ventura et al., 2016). BioTex is based on the use of (i) a relevant combination of information retrieval techniques and statistical methods, and (ii) a list of syntactic structures of the terms that have been learnt with relevant sources (e.g. MeSH). The terms extracted with BioTex can be simple (e.g. influenza), or compound (e.g. avian influenza), and are lowercased. We further identified the terms referring to COVID-19, such as “new virus” and “mystery pneumonia”. We manually categorized the terms as “mystery” (terms referring to the unknown threat), “pneumonia” (terms referring to the clinical signs), “coronavirus” (terms referring to the virus taxonomy) and “technical” (technical acronyms for the virus itself). One news can contain terms from different categories. We calculated the daily proportion of each category, expressed as the sum of the occurrences of the category divided by the total number of occurrences.
Results
3.1. Detection of newsProMED was the first to detect and report a news from a Chinese online source22https://promedmail.org/promed-post/?id=6864153}. The ProMED report dated from Dec. 30, 2019, one day before the first official notification of pneumonia-like cases in Wuhan (Wuhan Municipal Health Commission, 2020). PADI-web and HealthMap respectively detected three and one COVID-19-related news on Dec. 31, 2019, the same day as the first official notification of pneumonia-like cases in Wuhan (one HealthMap news from an English source, three PADI-web news from two English and one Chinese source). The news detected by the three EBS originated from five different media sources. From 275 COVID-19-related news retrieved by PADI-web, 45.5% (n=125) were retrieved by disease-specific RSS, and the remaining 54.5% (n=150) were retrieved by syndromic RSS feeds (Table 1). Content-wise, 31.7% (n=87) of the news compared COVID-19 to five animal diseases (avian influenza, African swine fever, classical swine fever, West Nile virus, and Rift Valley fever), 24.4% (n=67) of the news described the broad range of animal species susceptible to coronaviruses, 18.2% (n=50) described ruling-out avian influenza from diagnosis of COVID-19, and 7.7% (n=21) described ongoing outbreaks in addition to COVID-19 (avian influenza, African swine fever, classical swine fever, and foot-and-mouth disease), 2.5% (n=7) referred to animal species present in the Chinese markets as being the potential COVID-19 source, and 0.7% (n=2) news advised to avoid contact with animals. Irrelevant keywords matches were found in 12 news (e.g. finding a host keyword in the name of a source), and no link could be established between the RSS feed and the article for 29 remaining news (10.5%).