Abstract: Social media is a rich source of up-to-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity to process this information further. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. To avoid such an expensive labeling procedure, a generalizable model can be trained on data from one city and then applied to data from different cities. In this paper, we present Semantic Abstraction to improve the generalization of tweet classification. In particular, we derive features from Linked Open Data and include location and temporal mentions. A comprehensive evaluation on twenty datasets from ten different cities shows that Semantic Abstraction is indeed a valuable means for improving generalization. We show that this not only holds for a two-class problem where incident-related tweets are separated from non-related ones but also for a four-class problem where three different incident types and a neutral class are distinguished. To get a thorough understanding of the generalization problem itself, we closely examined rule-based models from our evaluation. We conclude that on the one hand, the quality of the model strongly depends on the class distribution. On the other hand, the rules learned on cities with an equal class distribution are in most cases much more intuitive than those induced from skewed distributions. We also found that most of the learned rules rely on the novel semantically abstracted features.
Keywords: Tweets, classification, Linked Open Data, Semantic Abstraction, incident detection