This lesson is still being designed and assembled (Pre-Alpha version)

Library Carpentry: Text & Data Mining: Glossary

Key Points

Introduction	Text mining refers to different methods used for analysing text automatically.
Jupyter Notebook	Jupyter Notebook is a tool to run small pieces of code and create visualisations more easily than via the command line. It is useful for running tutorials and lessons such as this one.
Python Fundamentals	Use `name = value` to assign a value to a variable with a specific name in order to record it in memory Use the `print(variable)` function to print the value of the variable Create the list by giving it different values (`list-name['value1','value2','value3']`) and use a for loop to iterate through each value of the list
Tokenising Text	Tokenisation means to split a string into separate words and punctuation, for example to be able to count them. Text can be tokenised using the a tokeniser, e.g. the punkt tokeniser in NLTK.
Pre-processing Data Collections	To open and read a file on your computer, the `open()` and `read()` functions can be used. To read an entire collection of text files you can use the `PlaintextCorpusReader` class provided by NLTK and its `words()` function to extract all the words from the text in the collection.
Tokens in Context: Concordance Lists	A concordance list is a list of all contexts in which a particular token appears in a corpus or text. A concordance list can be created using the `concordance()` `method of the` `Text`` class in NLTK.
Searching Text using Regular Expressions	To search for tokens in text using regular expressions you need the `re` module and its `search` function. You will need to learn how to construct regular expressions. E.g. you can use a wildcard `*` or you can use a range of letters, e.g. `[ae]` (for a or e), `[a-z]` (for a to z), or numbers, e.g. `[0-9]` (for all single digits) etc. Regular expressions can be very powerful if used correctly. To find all mentions of the words `woman` or `women` you need to use the following regular expression `wom[ae]n`.
Counting Tokens in Text	To count tokens, one can make use of NLTK’s `FreqDist` class from the `probability` package. The `N()` method can then be used to count how many tokens a text or corpus contains. Counts for a specific token can be obtained using `fdist["token"]`.
Visualising Frequency Distributions	A frequency distribution can be created using the `plot()` method. In this episode you have also learned how to clean data by removing stopwords and other types of tokens from the text. A word cloud can be used to visualise tokens in text and their frequency in a different way.
Lexical Dispersion Plot	Lexical dispersion is a visualisation that allows us to see where a particular term appears across a document or set of documents We used NLTK’s `dispersion_plot` .
Plotting Frequency Over Time	Here we extracted the terms and the years from the files using NLTK’s `ConditionalFreqDist` class from the `nltk.probability` package We then plotted these on a graph to visualise how the use changes over time
Collocations	We used NLTK’s `BigramAssocMeasures()` and `BigramCollocationFinder` to find the words commonly found together in this document set. We then scored these collocations using `bigram_measures.likelihood_ratio`
Part-of Speech Tagging Text	We use a NLTK’s part-of-speech tagger, `averaged_perceptron_tagger`, to label each word with part of speech, tense, number, (plural/singular) and case. We are used the text from the US Presidential Inaugaral speeches, in particular that from the last speech by Trump. We then extracted all nouns both plural (NNS) and singular (NN). We then visualise the nouns from these speeches using a plot of frequence distribution and a word cloud.

Glossary

FIXME