This lesson is still being designed and assembled (Pre-Alpha version)

Library Carpentry: Text & Data Mining

This text and data mining carpentries lesson is for beginners and scheduled to run over two days. It is targeted at people with no or little prior expertise in text mining who want to know how to be able to manipulate and analyse text in different ways and how to visualise the results.

If you use any of the material in this course for you own purposes, please cite the following reference:

Alex, Beatrice and Llewellyn, Clare. (2020) Library Carpentry: Text & Data Mining. Centre for Data, Culture & Society, University of Edinburgh.


This course is aimed at participants without any or little prior experience in text analysis. You will benefit from knowing how to start a terminal on your computer, how to run commands in the terminal and how to do basic coding in the programming language python but we will go over each step required in the course for anyone with no previous knowledge of these things.


Setup Download files required for the lesson
00:00 1. Introduction What is text mining?
00:00 2. Jupyter Notebook What is Jupyter Notebook?
00:00 3. Python Fundamentals How can I create a new variable in Python?
How do I print the value of a variable?
How can I create a list and iterate through it?
00:00 4. Tokenising Text What is tokenisation?
How can a string of raw text be tokenised?
00:00 5. Pre-processing Data Collections How can I load a file and tokenise it?
How can I load a text collection made up of multiple text files and tokenise them?
00:00 6. Tokens in Context: Concordance Lists What is a concordance list?
How can a concordance list be created for a particular search term?
00:00 7. Searching Text using Regular Expressions How can I search for tokens in text more flexibly? For example, to find all all mentions of woman and women.
00:00 8. Counting Tokens in Text How can I count tokens in text?
00:00 9. Visualising Frequency Distributions How can I draw a frequency distribution of the most frequent words in a collection?
How can I visualise this data as a word cloud.
00:00 10. Lexical Dispersion Plot How can I measure how frequently a word appears across the parts of a corpus?
How can I plot the occurrences of a word and how many words from the beginning of the corpus it appears?
00:00 11. Plotting Frequency Over Time How can I extract and plot the frequency of specific terms over time?
00:00 12. Collocations How can I see what terms are often used together in a text or corpus?
00:00 13. Part-of Speech Tagging Text How can I extract words that have a particular part of speech (POS) such as a noun or a verb?
How can I visualise those extracted words?
00:00 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.