Author Profiling for Abuse Detection (COLING 2018)
Code for paper “Author Profiling for Abuse Detection”, in Proceedings of the 27th International Conference on Computational Linguistics (COLING) 2018
If you use this code, please cite our paper:
@inproceedings{mishra-etal-2018-author,
title = "Author Profiling for Abuse Detection",
author = "Mishra, Pushkar and
Del Tredici, Marco and
Yannakoudakis, Helen and
Shutova, Ekaterina",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
month = aug,
year = "2018",
address = "Santa Fe, New Mexico, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/C18-1093",
pages = "1088--1098",
}
Python3.5+ required to run the code. Dependencies can be installed with pip install -r requirements.txt
followed by python -m nltk.downloader punkt
The dataset for the code is provided in the TwitterData/twitter_data_waseem_hovy.csv file as a list of [tweet ID, annotation] pairs.
To run the code, please use a Twitter API (twitter_access.py employs Tweepy) to retrieve the tweets for the given tweet IDs. Replace the dataset file with a
file of the same name that has a list of [tweet ID, tweet, annotation] triples.
Additionally, twitter_access.py contains functions to retrieve follower-following relationships amongst the authors of the tweets (specified in resources/authors.txt). Once the relationships have been retrieved, please use Node2vec (see resources/node2vec) to produce embeddings for each of the authors and store them in a file named authors.emb in the resources directory.
To run the best method (LR + AUTH):python twitter_model.py -c 16202 -m lna
To run the other methods:
python twitter_model.py -c 16202 -m a
python twitter_model.py -c 16202 -m ln
python twitter_model.py -c 16202 -m ws
python twitter_model.py -c 16202 -m hs
python twitter_model.py -c 16202 -m wsa
python twitter_model.py -c 16202 -m hsa
For the HS and WS based methods, adding the -ft
flag to the command ensures that the pre-trained deep neural models from the Models directory
are not used and instead all the training happens from scratch. This requires that the file of pre-trained GLoVe embeddings is downloaded from
http://nlp.stanford.edu/data/glove.twitter.27B.zip, unzipped and placed in the resources directory prior to the execution.
An overview of the complete training-testing flow is as follows:
In the 10-fold CV, steps 3-7 are run 10 times (each time with a different set of tweets as the test set) and the final precision, recall and
F1 are calculated by averaging results from across the 10 runs.