The MLT Corpus


Download the MLT Corpus

The Māori Loanword Twitter Corpus (MLT Corpus) is a diachronic corpus of tweets from 2008-2018 that were harvested using 77 "query words" (Māori words of interest). It consists of three key components:

  1. Raw Corpus: 1.6 million Tweets containing at least one query word, some of which are not used in relevant (NZE) contexts.
  2. Labelled Corpus: 3,685 Tweets that were manually labelled as relevant (i.e. the query words they contain are used in relevant contexts).
  3. Processed Corpus: 1.1 million Tweets that were classified as relevant by a machine learning model which used the Labelled Corpus as training data.

Below is a description of these components and a flowchart outlining how the Processed Corpus was built.

Key Stats

Description Raw Corpus V1 Raw Corpus V2 Labelled Corpus Processed V1 Processed V2
Tokens (words) 28,804,640 70,964,941 49,477 21,810,637 46,827,631
Tweets 1,628,042 4,559,105 2,495 1,179,390 2,880,211
Tweeters (authors) 604,006 1,839,707 1,866 426,280 1,226,109

Building the MLT Corpus

Process