Corpus of German misogynistic hatespeech posts (GMHP7k)

A German Corpus on misogynistic hatespeech posts from Twitter

On this page we provide the data set for the corpus on German misogynistic hatespeech posts (GMHP7k), which was first presented on the 18th International AAAI Conference on Web and Social Media (ICWSM 2024) along with a dataset paper.

Description

We provide a German corpus consisting of 7,061 posts authored by users of social media platforms. A group of volunteers annotated each post according to hatespeech and misogynistic/misogynous hatespeech in a binary fashion. The inter-rater reliability over all annotators according to Fleiss’ Kappa is 0.6409 for hatespeech and 0.8258 for misogynistic hatespeech. Furthermore, baseline measurements with machine learning based text classification with BERT are presented. Initial experiments with the corpus achieve macro average F1-scores up to 0.79 for hatespeech and 0.75 for misogynistic hatespeech.

Classes to annotate

During annotation, volunteers rated two aspects of a post: the presence of hatespeech and misogynistic hatespeech. The availability of hatespeech depends on perception of the comment text by the annotators and can be rated as hatespeech or not hatespeech. The misogynistic hatespeech, on the other hand, can be either misogynistic hatespeech or not misogynistic hatespeech.

Data Description

Column Name Description
tweet_id Tweet ID Source ID from Twitter
review_text Text of the tweet or comment User mentions were replaced by @TwitterUser
hs Hatespeech annotation Binary (1 or 0)
m_hs Misogynistic hatespeech annotation Binary (1 or 0)
annotation_id ID of annotation Tweets of phase 2 were annotated by all experts
created_at Created timestamp of annotation
updated_at Updated timestamp of annotation
lead_time Elapsed time of annotation
phase Phase 1, 2.1, 2.2, 2.3 or 3
annotator_name Annotator name Pseudonym Identity of the annotators as consecutive numbers
source Source of text Souce dataset of the text
split_hs Source of text “train”, “test”, or “val”
split_m_hs Source of text “train”, “test”, or “val”

Statistics

In order to achieve a high quality of annotation, two preliminary training phases were carried out, whereby the volunteers evaluated 46, 43 and 46 posts in each phase. After each phase, an inter-rater reliability was conducted with Fleiss’ Kappa to measure the quality of the annotation. The resulting kappa values are shown in figure 1. The values for hatespeech are shown on the left, those of the misogynistic hatespeech on the right. In order to determine the impact of each volunteer on the kappa value, further kappa values were calculated for all combinations of n-1 volunteers.

Fig.1 - Interrater-reliability (LTR hatespeech and misogynistic hatespeech)
Fig.2 - Wordclounds (LTR neutral, hatespeech and misogynistic hatespeech)

After completion of the training phases, a further 7,061 posts were annotated, which form the core of the corpus. Their quality can be considered assured due to the solid inter-rater reliability of the training phases. Table 1 shows the quota of the 7,061 posts assigned to each class. The distribution of hatespeech reveals that 22.29 % of the post were annotated as hatespeech. The table also shows the distribution of the second criterion misogynistic hatespeech, with 6.51 % of all posts are being rated as misogynisitc hatespeech. Consequently, 29.22 % of hatespeech posts are also misogynistic.

Tab.1 - Number of posts per class in 7,061 posts
Posts Percent
Hot hatespeech 5,487 77.71 %
Hatespeech 1,574 22.29 %
Not misogynistic hatespeech 6,601 93.49 %
Misogynistic hatespeech 460 6.51 %

License

The corpus is provided under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License. By using the corpus you agree to this license.

license

Citation

The corpus was first presented at ICWSM 2024. You can find a copy of the paper here.

Jonas Glasebach, Max-Emanuel Keller, Alexander Döschl, Peter Mandl
GMHP7k: A Corpus of German Misogynistic Hatespeech Posts
Proceedings of the Eighteenth International AAAI Conference on Web and Social Media
Buffalo, NY, USA, June 6–9, 2024

If you are using the corpus, please cite the following publication. Reference in BibTeX format:

@article{Glasebach.2024, 
	author={Glasebach, Jonas and Keller, Max-Emanuel and Döschl, Alexander and Mandl, Peter},
	title={GMHP7k: A Corpus of German Misogynistic Hatespeech Posts}, 
	url={https://ojs.aaai.org/index.php/ICWSM/article/view/31438}, 
	DOI={10.1609/icwsm.v18i1.31438}, 
	journal={Proceedings of the International AAAI Conference on Web and Social Media},
	volume={18}, 	
	year={2024}, 
	number={1}, 
	month={May}, 
	pages={1946-1957} 
}

How to use the data set?

The repository to this page provides the data set to the corpus along with the statistics and instructions for use.

About

The presented corpus was developed during a project of the Competence Center Wirtschaftsinformatik (CCWI) at the Munich University of Applied Sciences.

Acknowledgement

Our special thanks goes to the experts who contributed to the annotation of the corpus. The presented work was conducted as part of a project funded by Forschungs- und Entwicklungsprogramm Informations- und Kommunikationstechnik des Freistaates Bayern. Funding reference number: DIK-2104-0033// DIK0278/01, DIK0278/02, DIK0278/03.

The methodology of this work was inspired by the great work of Schabus et. al. wo created the One Million Posts Corpus together with the Austrian newspaper Der Standard from user comments under online articles on the site of the newspaper.