Web scraping shows how media coverage of COVID-19 turned “neutral” even as deaths increased

I like to use web data to answer any question. And questions about COVID-19 are no different.  Over the course of the year, I felt that I had noticed the tone of US corporate communications change from “emergency!” in the early stages of the pandemic to “this is the new normal” and I wondered if this change in tone could be detected in the wider US news media.

I used Import.io to collect 665,000 news stories from the websites of 6,800 US news sources, where the words: “covid”, “coronavirus” or “pandemic” appeared in the headline.  I calculated a sentiment score for each news story and then plotted that against the total number of COVID-19 news stories and the number of US COVID-19 deaths to see what relationships would be revealed.

The most notable result was that while media sentiment towards the pandemic started off negative (it’s a global pandemic after all) it quickly moved to be more neutral, even as reported US COVID-19 deaths peaked and then stabilized around 5,000 per week.

Interpreting the graph

  • The grey area represents 665,000 US news articles, plotted weekly over time that mention either the words: “covid” or “coronavirus” or “pandemic” in the headline.
  • The red area represents 200,000 US COVID-19 deaths, plotted weekly over time.
  • The headline and snippet of a random sample of 800 COVID-19 news articles from each day (30% of overall total) were scored for positive or negative sentiment (on a scale of +1 to -1).  The blue line represents the average news story sentiment, plotted weekly over time.


I used Import.io to gather 1.2 million news articles from the websites of 13,000 different global, English-language news sources where the publication date was between January 1st, 2020 and September 28th, 2020 where either “covid” or “coronavirus” or “pandemic” appeared in the headline.  Each news website was classified according to the country of primary audience and articles from non-US news websites and known US link aggregators (Reddit etc.) were excluded, leaving me with 665,000 news articles from 6,800 US news sources.  Duplicate articles were removed based on URL and headline + snippet combination prior to analysis.

For each of the 271 days since the beginning of the year, we randomly sampled up to 800 news articles published on each date and performed sentiment analysis on the article headline and snippet using Google’s Natural Language API.  200,000 news stories (30% of the total) were scored for sentiment in this way.

Google’s sentiment scoring gives ratings from +1 to -1.  The average weekly COVID-19 story sentiment ranges from -0.22 in February to -0.12 in June.  The notable result here is that there was a marked movement in the positive direction, even as cases and then deaths climbed steeply during March and April and have remained high since.  

Data from Johns Hopkins University was used for US COVID-19 deaths.  

The distribution of news stories per news source has a long tail, as you would expect: larger news organizations contributed more news stories to our dataset.  The 100 largest contributing news sources account for 60% of the total articles in the dataset.  

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s