On this page we provide the data set for the corpus on German topic classification and success (GTCS6k), which was first presented on the 25th IEEE FRUCT conference (FRUCT’ 25) along with a work in progress paper. Details can be found in the section Citation below.
The corpus consists of 6,000 annotated Facebook posts from the food delivery services sector in Germany. The posts belong to six of the industry’s most important brands and are in German. Among the brand pages are Call a Pizza, Deliveroo, Domino’s, Lieferando, Mundfein and Smiley’s. The annotation was conducted by the agency ALTHALLER communication GbR, which advises companies in corporate communications. Five experts of the agency, who regularly produce social media posts on behalf of the clients, took over the work of the annotation over a period of 2 months (May to June 2019). The experts annotated each post according to topic and success.
During the annotation, the experts evaluated two aspects of a post, its success and the topic. Success depends on the perception of the post by the users and can be rated as successful or not successful. The topic, on the other hand, describes the content of the post. Since it can have several topics at the same time, one or more topics can be chosen that best describe the content of the post. The following eleven thematic classes, developed by the experts, were available for selection:
In order to achieve a high quality of annotation, two training phases were carried out, whereby the experts evaluated 50 posts in each phase. After each phase, an inter-rater reliability was conducted with Fleiss’ Kappa to measure the quality of the annotation. The resulting kappa values are shown in figure 1. The values of the first phase are shown on the left, those of the second on the right. In order to determine the impact of each expert on the kappa value, further kappa values were calculated for all combinations of n-1 experts.
After completion of the training phases, a further 6,000 posts were annotated, which form the core of the corpus. Their quality can be considered assured due to the solid inter-rater reliability of the training phases. Table 1 shows the quota of the 6,000 posts assigned to each of the thematic categories. The distribution of the second criterion success is shown in table 2.
Posts | Percent | |
---|---|---|
1. Product/Service | 316 | 5.22 % |
2. Event/Fair | 368 | 6.07 % |
3. Interactions | 2370 | 39.12 % |
4. News | 547 | 9.03 % |
5. Entertainment | 978 | 16.14 % |
6. Knowledge | 390 | 6.44 % |
7. Recruiting/HR | 65 | 1.07 % |
8. Corporate Social Responsibility (CSR) | 40 | 0.66 % |
9. Advertising/Campaign | 4098 | 67.63 % |
10. Sponsoring | 322 | 5.31 % |
11. Other | 541 | 8.93 % |
Posts | Percent | |
---|---|---|
Not successful | 4578 | 76.3 % |
Successful | 1422 | 23.7 % |
The corpus is provided under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. By using the corpus you agree to this license.
The corpus was first presented at FRUCT’ 25.
Max-Emanuel Keller, Johannes Forster, Peter Mandl, Frederic Aich, Jacqueline Althaller
A German Corpus on Topic Classification and Success of Social Media Posts
Proceedings of the 25th Conference of Open Innovations Association FRUCT
Helsinki, Finland, November 2019
If you are using the corpus, please cite the following publication. You can find a copy of the paper here. Reference in BibTeX format:
@inproceedings{Keller.2019,
author = {Keller, Max-Emanuel and Forster, Johannes and Mandl, Peter and Aich, Frederic and Althaller, Jacqueline},
title = {A German Corpus on Topic Classification and Success of Social Media Posts},
booktitle = {Proceedings of the 25th Conference of Open Innovations Association FRUCT},
series = {FRUCT'25},
year = {2019},
location = {Helsinki, Finland},
publisher = {FRUCT Oy},
address = {Helsinki, Finland},
}
The repository to this page provides the data set to the corpus along with the experiments and instructions for use.
The presented corpus was developed during a joint project between the Competence Center Wirtschaftsinformatik (CCWI) at the Munich University of Applied Sciences and the agency ALTHALLER communication GbR.
Our special thanks goes to all the experts of ALTHALLER communication GbR who contributed to the annotation of the corpus. The presented work was conducted as part of a project funded by Forschungs- und Entwicklungsprogramm Informations- und Kommunikationstechnik des Freistaates Bayern. Funding reference number: IUK482/002.
The methodology of this work was inspired by the great work of Schabus et. al. wo created the One Million Posts Corpus together with the Austrian newspaper Der Standard from user comments under online articles on the site of the newspaper.