The MLT Corpus
Download the MLT Corpus
The Māori Loanword Twitter Corpus (MLT Corpus) is a diachronic corpus of tweets from 2008-2018 that were harvested using 77 "query words" (Māori words of interest). It consists of three key components:
- Raw Corpus: 1.6 million Tweets containing at least one query word, some of which are not used in relevant (NZE) contexts.
- Labelled Corpus: 3,685 Tweets that were manually labelled as relevant (i.e. the query words they contain are used in relevant contexts).
- Processed Corpus: 1.1 million Tweets that were classified as relevant by a machine learning model which used the Labelled Corpus as training data.
Below is a description of these components and a flowchart outlining how the Processed Corpus was built.
|Description||Raw Corpus V1||Raw Corpus V2||Labelled Corpus||Processed V1||Processed V2|