Text Mining BBC Headlines With R

Recently I discovered Text Mining with R: A Tidy Approach, a new book by Stack Overflow data scientists Julia Silge and David Robinson. Text Mining's goal? Empower R users to easily complete common text analysis tasks by leveraging the tidyverse framework familiar to Hadley Wickham adherents.

Of course, to properly apply the book's techniques myself I needed some good data. Enter BBC headlines. I'm always intrigued by what the BBC chooses to cover - and especially how the site's focus has evolved over time. To get historical headlines I therefore scraped the BBC homepage since 2014 via the Wayback Machine. Mix in a typical data wrangling/wrestling process and I was left with a deduped dataset of 2,885 headlines.

Now for the tidyverse part. A simple application of Text Mining's unnest_tokens function generated an appropriately "tidy" dataset of one word per row. I did concatenate some obvious bigrams (e.g. North Korea) but otherwise stuck with individual words throughout. Next, I applied Text Mining's code to summarize word frequencies. We find some perhaps predictable terms in the top spots:

While not surprising that the U.S. and Trump have dominated headlines, it is interesting to examine how the focus of BBC coverage has shifted over the years. One approach suggested in Text Mining is to calculate a document's tf-idf score, where tf-idf refers to "term frequency–inverse document frequency." The goal is to find words that occur frequently in a particular document (e.g. all the headlines for a given year) but are not terribly common across an entire corpus (e.g. all headlines from 2014-2017). Applying tf-idf by year, we find that Gaza stories were prominent in 2014, the Greece debt crisis was on everybody's mind in 2015, while from 2016 onwards we have all been living in TrumpWorld:

There are other options beyond tf-idf. In particular, one of Text Mining's case studies demonstrates a model-based technique using Twitter archives. Here, separate GLM binomial models are fit to each word's count vs. total word frequency across time. A positive slope for a given word's model indicates the word is now appearing (relatively) more often, while a negative slope demonstrates reduced frequency. Given the high volume of models, significance is assessed with adjusted p-values to avoid multiple comparison issues. I used a .01 adjusted p-value threshold to assess significance (along with requiring a minimum of 50 appearances per word). Below are the relative frequencies across time for all words found to have significant slopes:

Unsurprisingly, words like Trump and Gaza appear again, but the GLM approach also identifies "Ukraine" as a significant decliner - a word missed by tf-idf scores.

You might be wondering what the "opportunity cost" has been of all the U.S. election/Trump stories from the past 1.5 years? Using the BBC's regional classification for stories - as parsed from headline URLs - we can see that stories from Europe, the Middle-East, and Africa have absorbed the bulk of reduced press coverage since 2016:

Sentiment Analysis

Switching topics, Text Mining has a nice sentiment analysis section that demonstrates usage of the various sentiment dictionaries found in the book's companion tidytext package. Below I break-out BBC headline word frequencies into positive and negative categories, as determined by the included "bing" lexicon.

Note that applying sentiment dictionaries does require some care! For example, "trump" was initially listed as a positive word. This may or may not ring true depending on the reader's political persuasion, but for the purpose of analytic objectivity I thought it best to exclude "trump" from either category.

In any case, we find that BBC headlines are dominated by one-off destructive events, from "attack" and "crash" to "strike" and "bomb." Terror is very effective at garnering media coverage.

Network Effects

On a brighter note, I'd like to close with my favorite visual from Text Mining: word networks using igraph and ggraph. I think these look terrific. The network below visualizes connections between words appearing in the same headline (minimum 8 matches). Here, we can see the vast quantity of BBC coverage devoted to U.S. foreign policy, from the crises in Ukraine and Syria to interactions with Russia, Iran, and China. Some anomalous stories are shown detached, such as the search for missing flight MH370 and news on Israel and the Gaza Strip.

Find all code for this analysis here.

Enjoy this post? Subscribe to my weekly newsletter on tech, sports, and more!