Māori Twitter Corpus
The Reo Māori Twitter (RMT) Corpus is a collection of 79,018 te reo Māori tweets, designed for linguistic analysis and to help in the development of new Natural Language Processing (NLP) resources for the Māori community. The corpus captures output from 2,302 users, including a mixture of personal and institutional accounts. These users were identified via Prof. Kevin Scannell's Indigenous Tweets website.
Download the RMT Corpus
The tweets and user metadata in the RMT Corpus can be hydrated (downloaded from Twitter) using the code provided. The source code is adapted from Twitter's sample code for API v2 endpoints.
Note: Some tweets in the corpus are no longer publicly available and, as such, cannot be downloaded. Please email David Trye if you would like access to the complete dataset, including additional metadata mentioned in our paper.
The speed at which you can download the corpus depends on the rate limit for your Twitter developer account (e.g. 300 or 900 requests per 15-minute window). If you exceed the allocated limit, a 429 'Too many requests' error will be returned.
- Apply for a Twitter developer account if you do not have one already.
- Ensure that Python 3 is installed on your machine. The code for hydrating the corpus uses
requests==2.24.0, which in turn uses
requests-oauthlib==1.3.0. You can install these packages as follows:
pip install requests pip install requests-oauthlib
Download and extract all files in the rmt-v1 folder. This folder contains a file called
rmt-corpus-v1.csv, which has the tweet IDs and selected metadata, as well as two Python scripts for downloading and formatting the data (namely,
Configure your API bearer token by running the following command in the terminal:
get_tweets_with_bearer_token.pyfrom the terminal.
python get_tweets_with_bearer_token.py > output.json
This will download the corpus in batches of 100 tweets. If you use the default settings, the script will take roughly 45 minutes to run, as it will attempt to download 30,000 tweets (300 requests x 100 tweets) every 15 minutes. The resulting file,
output.json, is only pseudo-JSON (each batch is separated by a line in the form "Batch
Y are numbers).
json_to_tsv.pyto convert the output file to TSV format.
This script will produce a file called
rmt-corpus-final.csv, which you can then open in a spreadsheet application. Tweet text is formatted consistently (special characters are removed, any HTML is decoded, and user mentions and links are standardised). The tweets are also supplemented with metadata from the original
rmt-corpus-v1.csv file. A description of the variables in each of the CSV files is given below.
Data Description: rmt-corpus-v1.csv
The following information (where known) is supplied in
rmt-corpus-v1.csv, even if the tweet is no longer available on Twitter.
|id||Twitter's unique identifier for the tweet. You can search for a tweet online by visiting twitter.com/user/status/XXX, where
|num_maori_words||The number of Māori words in the tweet.|
|total_words||The total number of words in the tweet.|
|percent_maori||The percentage of Māori text detected in the tweet (=
|favourites||The number of favourites (likes, retweets & quotes) that the given tweet received.|
|reply_count||The number of replies that the given tweet received.|
|user.alias||An alias for the author of the tweet in the form T
|user.status||The account status (as of Decemeber 2020) of the user who wrote the tweet: 'active', 'protected', 'suspended' or 'not found'.|
|user.followers||The user's number of followers (as of December 2020).|
|user.friends||The number of accounts that the user follows (as of December 2020).|
|user.num_tweets||The total number of tweets in the corpus that were written by this user.|
|user.region||The user's location, based on self-reported information. Where possible, the data has been aggregated into New Zealand regions and names of foreign countries.|
|year||The year the tweet was written (between 2007 and 2020).|
Data Description: rmt-corpus-final.csv
If the tweet is still publicly available on Twitter, the following variables will appear alongside those mentioned above. Missing values are indicated with 'None':
|text||The tweet content, with consistent formatting applied (special characters stripped, user mentions and links standardised).|
|conversation_id||The ID for the conversation that the tweet is part of.|
|in_reply_to_user_id||If the tweet is written in reply to another, this is the ID of the user who who wrote the original tweet.|
|author_id||Twitter's unique identifier for the user who wrote the tweet.|
|created_at||The timestamp when the tweet was posted, in the format
|lang||The two-letter code representing the language that the tweet was (erroneously) classified as (NOT Māori, as the API does not support te reo).|
|source||The device or third-party application from which the tweet was posted (e.g. 'Twitter Web Client', 'Twitter for iPhone').|
|error||The reason why the tweet could not be downloaded, if there was an error ('Authorization Error', 'Not Found Error', 'None').|
- Code for cleaning and analysing the RMT Corpus is available on the project GitHub repository.
- You can download a wordlist with frequencies for all words and hashtags in the corpus.
Citing the RMT Corpus
If you use the RMT corpus, please cite the following paper:
- 'Building a Corpus of Māori Language Tweets' by Trye et al. (full reference coming soon!).
We graciously acknowledge the generous support of:
- Ngā Pae o te Māramatanga
- The University of Waikato
The information on this page was last checked in April 2021. Please let us know if you notice any errors in the code and/or instructions. As of 12 April 2021, 72,575 tweets (91.85% of the RMT Corpus) could be successfully downloaded from Twitter.