This lesson is still being designed and assembled (Pre-Alpha version)

Plotting Frequency Over Time

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can I extract and plot the frequency of specific terms over time?

Objectives
  • We will use a NLTK’s ConditionalFreqDist class to extract the frequency of defined words.

  • We will use the US Presidential Inaugural Addresses and which are provided with NLTK.

Plotting frequency over time

Similarly to lexical dispersion, you can also plot frequency of terms over time. This is similarly to the Google n-gram visualisation for the Google Books corpus but we will show you how to do something similar for your own corpus.

You first need to import NLTK’s ConditionalFreqDist class from the nltk.probability package. To generate the graph, you have to specify the list of words to be plotted (see targets) and the x-axis labels (in this case the year the inaugural was held which appears at the start of each file: fileid[:4]).

The plot is created by:

The ConditionalFreqDist object (cfd) stores the number of times each of the target words appear in the each of the speaches and the plot() method is used to visualise the graph.

from nltk.probability import ConditionalFreqDist

# type this to set the figure size
plt.rcParams["figure.figsize"] = (12, 9)

targets=['great','good','tax','work','change', 'wom[ae]n']

cfd = nltk.ConditionalFreqDist((target, fileid[:4])
    for fileid in inaugural.fileids()
    for word in inaugural.words(fileid)
    for target in targets
    if word.lower().startswith(target))
cfd.plot()

Task 1: See how the plot changes when choosing different target words.

Answer

plt.rcParams["figure.figsize"] = (12, 9)
targets=['god','work']
cfd = nltk.ConditionalFreqDist((target, fileid[:4])
	for fileid in inaugural.fileids()
	for word in inaugural.words(fileid)
   	for target in targets
    if word.lower().startswith(target))
cfd.plot()

Task 2: Use regular expression searching to search for target words exactly instead of matching on words that start with the target words.

Answer

plt.rcParams["figure.figsize"] = (12, 9)
targets=['m[ea]n']
cfd = nltk.ConditionalFreqDist((target, fileid[:4])
    for fileid in inaugural.fileids()
    for word in inaugural.words(fileid)
    for target in targets
    if word.lower().startswith(target))
cfd.plot()

Key Points

  • Here we extracted the terms and the years from the files using NLTK’s ConditionalFreqDist class from the nltk.probability package

  • We then plotted these on a graph to visualise how the use changes over time