Plotting Frequency Over Time

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I extract and plot the frequency of specific terms over time?

Objectives

We will use a NLTK’s ConditionalFreqDist class to extract the frequency of defined words.

We will use the US Presidential Inaugural Addresses and which are provided with NLTK.

Plotting frequency over time

Similarly to lexical dispersion, you can also plot frequency of terms over time. This is similarly to the Google n-gram visualisation for the Google Books corpus but we will show you how to do something similar for your own corpus.

You first need to import NLTK’s ConditionalFreqDist class from the nltk.probability package. To generate the graph, you have to specify the list of words to be plotted (see targets) and the x-axis labels (in this case the year the inaugural was held which appears at the start of each file: fileid[:4]).

The plot is created by:

looping through each file (speech)
then looping through each word in each speech
then looping though the list of specified target words and
checking if each target word matches the start of each word in the speeches (after being lower-cased).

The ConditionalFreqDist object (cfd) stores the number of times each of the target words appear in the each of the speaches and the plot() method is used to visualise the graph.

from nltk.probability import ConditionalFreqDist

# type this to set the figure size
plt.rcParams["figure.figsize"] = (12, 9)

targets=['great','good','tax','work','change', 'wom[ae]n']

cfd = nltk.ConditionalFreqDist((target, fileid[:4])
    for fileid in inaugural.fileids()
    for word in inaugural.words(fileid)
    for target in targets
    if word.lower().startswith(target))
cfd.plot()

Task 1: See how the plot changes when choosing different target words.

Answer

plt.rcParams["figure.figsize"] = (12, 9)
targets=['god','work']
cfd = nltk.ConditionalFreqDist((target, fileid[:4])
	for fileid in inaugural.fileids()
	for word in inaugural.words(fileid)
   	for target in targets
    if word.lower().startswith(target))
cfd.plot()

Task 2: Use regular expression searching to search for target words exactly instead of matching on words that start with the target words.

Answer

plt.rcParams["figure.figsize"] = (12, 9)
targets=['m[ea]n']
cfd = nltk.ConditionalFreqDist((target, fileid[:4])
    for fileid in inaugural.fileids()
    for word in inaugural.words(fileid)
    for target in targets
    if word.lower().startswith(target))
cfd.plot()

Key Points

Here we extracted the terms and the years from the files using NLTK’s ConditionalFreqDist class from the nltk.probability package

We then plotted these on a graph to visualise how the use changes over time

previous episode

Library Carpentry: Text & Data Mining

next episode

Plotting Frequency Over Time

Overview