Recently I discovered Text Mining with R: A Tidy Approach, a new guide by Julia Silge and David Robinson that synthesizes common text analysis tasks with the tidyverse concepts familiar to all Hadley Wickham adherents.
To test out the book's techniques, I scraped BBC headlines since 2014 using the Wayback Machine. After the usual data wrestling/wrangling process, I was left with a de-duped dataset of 2,885 headlines that had appeared in the BBC's top headline slot. From here, a simple application of R's unnest_tokens gave me an appropriately "tidy" dataset of one word per row. I did concatenate some obvious bi-grams (e.g. North Korea) but otherwise stuck with individual words.
Jumping into some analysis, I leveraged Text Mining's code to summarize word frequencies. Here, we find some predictable terms in the top spots:
While not surprising that the U.S. and Trump have dominated headlines, it is interesting to examine how the focus of BBC coverage has shifted over the years. One approach suggested in Text Mining is to calculate a document's tf-idf score, where tf-idf refers to "term frequency–inverse document frequency." The goal is to find words that occur frequently in a particular document (e.g. all the headlines for a given year) but are not terribly common across an entire corpus (e.g. all headlines from 2014-2017). Applying tf-idf by year, we find that Gaza stories were prominent in 2014, the Greece debt crisis was on everybody's mind in 2015, while from 2016 onwards we have been living in TrumpWorld:
There are other options beyond tf-idf. In particular, one of Text Mining's case studies demonstrates a model-based technique using Twitter archives. Here, separate GLM binomial models are fit to each word's count vs. total word frequency across time. A positive slope for a given word's model indicates the word is appearing more often across time, while a negative slope demonstrates reduced frequency. Given the high volume of models, significance is assessed with adjusted p-values to avoid multiple comparison issues. I used a .01 adjusted p-value threshold to assess significance (along with a minimum of 50 total appearances). Below are the frequencies across time for all the words found to have significant slopes:
Unsurprisingly, words like Trump and Gaza appear again, but the GLM approach also identifies "Ukraine" as a signficant decliner - a word missed using tf-idf scores.
You might be wondering what the "opportunity cost" has been of all the U.S. election/Trump stories from the past 1.5 years? Using the BBC's regional classification for stories (parsed from headline URLs), we can see that stories from Europe, the Middle-East, and Africa have taken the bulk of the reduced press coverage since 2016:
Switching into a different topic, Text Mining has a nice sentiment analysis section that demonstrates usage of the sentiment dictionaries found in the book's associated tidytext package. Below I break-out BBC headline word frequencies by positive and negative categories, as found in the included "bing" lexicon.
Note that using these sentiment dictionaries does require some care. For example, "trump" was listed as a positive word. This may or may not ring true depending on the reader's political persuasion, but for the purpose of analytic objectivity I thought it best to remove the word from either category.
In any case, we find that BBC headline coverage is dominated by one-off destructive events, from "attack" and "crash" to "strike" and "bomb." Terror is still very effective at garnering media coverage.
I'd like to close with my favorite visual from Text Mining: word networks using igraph and ggraph. The network below visualizes connections between words appearing in the same headline (minimum 8 matches). Here, we can see the vast quantity of BBC coverage devoted to U.S. foreign policy, from Ukraine and Syria to interactions with Russia, Iran, and China. On the edges are some anomalous stories, such as the search for MH370 plus reports on Israel and the Gaza Strip.