[UPDATE: Charts and dataset are being updated automatically and are available for download, links below.]
Web scraping news stories reveals interesting trends about how the media cover the 2020 election campaigns of Donald Trump and Joe Biden. I used Import.io to access and analyze more than 100,000 news stories from the websites of 1,500 US news organizations in order to compare the media coverage of Trump and Biden in the run up to the 2020 US presidential election.
The web data story
During the September 29th, 2020 Presidential debate held in Cleveland, OH, Donald Trump said to Joe Biden:
“They give you good press, they give me bad press because that’s the way it is, unfortunately”
Is this a true statement? Kind of. I found that Trump does get consistently more negative press than Biden – but he also gets more than 5x the volume. The Trump media strategy seems to be best expressed by the old saying “there is no such thing as bad publicity”.
Events, dear boy, events
Both candidates’ media sentiment is in the “negative neutral” territory and has moved up and down in response to events. The presidential debate, for example, negatively impacted both candidates’ media sentiment scores. Trump’s COVID-19 diagnosis caused media sentiment towards him to turn more positive, raising his daily media sentiment score above Biden for the longest continuous period since I started measuring at the beginning of September. But his earlier than expected return to the White House reversed those gains.
Trump has more bad days, than good days
Overall, media sentiment towards Trump is lower than media sentiment towards Biden. In addition, media sentiment towards Trump swings more dramatically across a wider range than media sentiment towards Biden.
Trump gets much more coverage than Biden
Since September Trump has received more than double the media coverage than Biden (5x more if you compare stories just about Trump vs stories just about Biden).
The longest week
Did you feel like you experienced a year’s worth of news in the last week of September / first week of October? Turns out that in terms of the volume of election news stories, it was only 1.7x more than the average of the previous 4 weeks. Still…it felt like a lot.
Since the beginning of September I have used Import.io to collect 71,252 news stories, mentioning either Trump, Biden, or both from the websites of 2,135 English-language news sources. Each news website was classified according to the primary country of the audience. Articles from non-US news websites and US link aggregators (e.g. Reddit etc.) were excluded, leaving 49,682 news stories from 1,571 US news sources for sentiment analysis. Duplicate articles, identified by both URL and headline+snippet, were removed.
Entity-level sentiment analysis was performed on every one of the 49,682 headline+snippet combinations using Google’s Natural Language API. Entity-level sentiment analysis first identifies entities from the text and then calculates a sentiment score for each entity. That sentiment score rates how positively or negatively the entity is talked about in the news story based on an analysis of the language and scored as a decimal number on a range from +1 (positive sentiment) to -1 (negative sentiment). For example, here is a headline and snippet combination that was positive for Joe Biden and negative for Donald Trump,
September 3rd, 2020: “Former Michigan governor Rick Snyder: I am a Republican vote for Biden. Donald Trump is a bully who lacks a moral compass. Joe Biden would bring back civility. Forty-four years ago, I celebrated my 18th birthday at the 1976 Republican National Convention as part of Gera…”
You can see that in this news story, both Donald Trump and Joe Biden were identified. Google’s Natural Language service judged that sentiment towards Trump was very negative (“is a bully who lacks a moral compass”), while sentiment towards Joe Biden was more neutral positive (“would bring back civility”).
The Google Natural Language service returns a Wikipedia URL as entity metadata for each entity that it positively identifies with a high level of confidence. Entities and their associated sentiment scores were only included for analysis where the entity Wikipedia URL was either https://en.wikipedia.org/wiki/Donald_Trump or https://en.wikipedia.org/wiki/Joe_Biden. This was in order to exclude sentiment scores for different entities with similar names to the two candidates, for example, I did not want to include sentiment scores for the Trump Organization https://en.wikipedia.org/wiki/The_Trump_Organization.
Selection of sources
The selection of news sources was blind: I did not have a human sit down and choose news sources to use. Such a deliberate selection of news sources would have inevitably introduced bias that would have been difficult to control for. Instead, I monitored social media and news aggregators for the stories that people share and then as news stories appeared I included the news source into our catalogue of news sources to be searched for stories about Trump and Biden. These are the news stories that the electorate actually see – on social media, as alerts on their phone, as talking points on the television news – taken in aggregate I believe that this dataset and the media sentiment scores calculated from it represents a good estimate of the media coverage and media sentiment of the two candidates going into the 2020 election.
Individual news sources appear to have their own partisan biases reflected in the candidate sentiment scores that I calculated. If you didn’t know anything about The New York Times, Fox News or Bloomberg, you might be able to guess their preferred candidate just by inspecting the candidate sentiment scores,