Counting Tokens in Text

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I count tokens in text?

Objectives

Learn how to count tokens in text.

Counting tokens in text

You can also do other useful things like count the number of tokens in a text, determine the number and percentage count of particular tokens and plot the count distributions as a graph. To do this we have to import the FreqDist class from the NLTK probability package. When calling this class, a list of tokens from a text or corpus needs to be specified as a parameter in brackets.

from nltk.probability import FreqDist
fdist = FreqDist(lower_india_tokens)
fdist

FreqDist({'the': 5923, ',': 5332, '.': 5258, 'of': 4062, 'and': 2118, 'in': 2117, 'to': 1891, 'is': 1124, 'a': 1049, 'that': 816, ...})

The results show the top most frequent tokens and their frequency count.

We can count the total number of tokens in a corpus using the N() method:

fdist.N()

And count the number of times a token appears in a corpus:

fdist['she']

We can also determine the relative frequency of a token in a corpus, so what % of the corpus a term is:

fdist.freq('she')

0.0002778638680787851

If you have a list of tokens created using regular expression matching as in the previous section and you’d like to count them then you can also simply count the length of the list:

len(womaen_strings)

Frequency counts of tokens are useful to compare different corpora in terms of occurrences of different words or expressions, for example in order to see if a word appears a lot rarer in one corpus versus another. Counts of tokens, documents and a entire corpus can also be used to compute simple pairwise document similarity of two documents (e.g. see Jana Vembunarayanan’s blogpost for a hands-on example of how to do that).

Key Points

To count tokens, one can make use of NLTK’s FreqDist class from the probability package. The N() method can then be used to count how many tokens a text or corpus contains.

Counts for a specific token can be obtained using fdist["token"].

previous episode

Library Carpentry: Text & Data Mining

next episode

Counting Tokens in Text

Overview