Library Carpentry: Text & Data Mining

This text and data mining carpentries lesson is for beginners and scheduled to run over two days. It is targeted at people with no or little prior expertise in text mining who want to know how to be able to manipulate and analyse text in different ways and how to visualise the results.

If you use any of the material in this course for you own purposes, please cite the following reference:

Alex, Beatrice and Llewellyn, Clare. (2020) Library Carpentry: Text & Data Mining. Centre for Data, Culture & Society, University of Edinburgh. http://librarycarpentry.org/lc-tdm/.

Prerequisites

This course is aimed at participants without any or little prior experience in text analysis. You will benefit from knowing how to start a terminal on your computer, how to run commands in the terminal and how to do basic coding in the programming language python but we will go over each step required in the course for anyone with no previous knowledge of these things.

Schedule

	Setup	Download files required for the lesson
00:00	1. Introduction	What is text mining?
00:00	2. Jupyter Notebook	What is Jupyter Notebook?
00:00	3. Python Fundamentals	How can I create a new variable in Python? How do I print the value of a variable? How can I create a list and iterate through it?
00:00	4. Tokenising Text	What is tokenisation? How can a string of raw text be tokenised?
00:00	5. Pre-processing Data Collections	How can I load a file and tokenise it? How can I load a text collection made up of multiple text files and tokenise them?
00:00	6. Tokens in Context: Concordance Lists	What is a concordance list? How can a concordance list be created for a particular search term?
00:00	7. Searching Text using Regular Expressions	How can I search for tokens in text more flexibly? For example, to find all all mentions of woman and women.
00:00	8. Counting Tokens in Text	How can I count tokens in text?
00:00	9. Visualising Frequency Distributions	How can I draw a frequency distribution of the most frequent words in a collection? How can I visualise this data as a word cloud.
00:00	10. Lexical Dispersion Plot	How can I measure how frequently a word appears across the parts of a corpus? How can I plot the occurrences of a word and how many words from the beginning of the corpus it appears?
00:00	11. Plotting Frequency Over Time	How can I extract and plot the frequency of specific terms over time?
00:00	12. Collocations	How can I see what terms are often used together in a text or corpus?
00:00	13. Part-of Speech Tagging Text	How can I extract words that have a particular part of speech (POS) such as a noun or a verb? How can I visualise those extracted words?
00:00	Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.