This lesson is still being designed and assembled (Pre-Alpha version)

Library Carpentry: Text & Data Mining: Glossary

Key Points

  • Text mining refers to different methods used for analysing text automatically.

Jupyter Notebook
  • Jupyter Notebook is a tool to run small pieces of code and create visualisations more easily than via the command line. It is useful for running tutorials and lessons such as this one.

Python Fundamentals
  • Use name = value to assign a value to a variable with a specific name in order to record it in memory

  • Use the print(variable) function to print the value of the variable

  • Create the list by giving it different values (list-name['value1','value2','value3']) and use a for loop to iterate through each value of the list

Tokenising Text
  • Tokenisation means to split a string into separate words and punctuation, for example to be able to count them.

  • Text can be tokenised using the a tokeniser, e.g. the punkt tokeniser in NLTK.

Pre-processing Data Collections
  • To open and read a file on your computer, the open() and read() functions can be used.

  • To read an entire collection of text files you can use the PlaintextCorpusReader class provided by NLTK and its words() function to extract all the words from the text in the collection.

Tokens in Context: Concordance Lists
  • A concordance list is a list of all contexts in which a particular token appears in a corpus or text.

  • A concordance list can be created using the concordance() method of the Text` class in NLTK.

Searching Text using Regular Expressions
  • To search for tokens in text using regular expressions you need the re module and its search function.

  • You will need to learn how to construct regular expressions. E.g. you can use a wildcard * or you can use a range of letters, e.g. [ae] (for a or e), [a-z] (for a to z), or numbers, e.g. [0-9] (for all single digits) etc. Regular expressions can be very powerful if used correctly. To find all mentions of the words woman or women you need to use the following regular expression wom[ae]n.

Counting Tokens in Text
  • To count tokens, one can make use of NLTK’s FreqDist class from the probability package. The N() method can then be used to count how many tokens a text or corpus contains.

  • Counts for a specific token can be obtained using fdist["token"].

Visualising Frequency Distributions
  • A frequency distribution can be created using the plot() method.

  • In this episode you have also learned how to clean data by removing stopwords and other types of tokens from the text.

  • A word cloud can be used to visualise tokens in text and their frequency in a different way.

Lexical Dispersion Plot
  • Lexical dispersion is a visualisation that allows us to see where a particular term appears across a document or set of documents

  • We used NLTK’s dispersion_plot .

Plotting Frequency Over Time
  • Here we extracted the terms and the years from the files using NLTK’s ConditionalFreqDist class from the nltk.probability package

  • We then plotted these on a graph to visualise how the use changes over time

  • We used NLTK’s BigramAssocMeasures() and BigramCollocationFinder to find the words commonly found together in this document set.

  • We then scored these collocations using bigram_measures.likelihood_ratio

Part-of Speech Tagging Text
  • We use a NLTK’s part-of-speech tagger, averaged_perceptron_tagger, to label each word with part of speech, tense, number, (plural/singular) and case.

  • We are used the text from the US Presidential Inaugaral speeches, in particular that from the last speech by Trump.

  • We then extracted all nouns both plural (NNS) and singular (NN).

  • We then visualise the nouns from these speeches using a plot of frequence distribution and a word cloud.