Excerpt from Natural Language Processing for Social Media

By Atefeh ‘Anna’ Farzindar and Diana Inkpen

February 6, 2018

This is an excerpt from Atefeh Farzindar and Diana Inkpen’s book Natural Language Processing for Social Media, Second Edition. 2017 Morgan & Claypool Publishers.

Discount for UXmatters Readers—Buy Natural Language Processing for Social Media from Morgan & Claypool Publishers, using the discount code uxreader, and save 20% off the retail price.

Chapter 3: Semantic Analysis of Social Media Texts

Cover: Natural Language Processing for Social Media In this chapter, we discuss current NLP methods for social media applications that aim at extracting useful information from social media data. Examples of such applications are geo-location detection, opinion mining, emotion analysis, event and topic detection, summarization, and machine translation. We survey the current techniques, and we briefly define the evaluation measures used for each application, followed by examples of results.

Champion Advertisement

Geo-location Detection

One of the important topics in semantic analysis in social media is the identification of geo-location information for social content such as blog posts or tweets. By geo-location we mean a real location in the world, such as a region, a city, or a point described by longitude and latitude. Automatic detection of event location for individuals or groups of individuals with common interests is important for marketing purposes and also for detecting potential threats to public safety.

Geo-location information could be readily available from the user profiles registered on the social network service; however, for several reasons, including privacy, not all users provide correct and precise information about their location. Therefore, other techniques such as inferring the location from the communication network infrastructure or from the text content need to be used in addition to the geo-location information available in some messages. It makes sense to combine multiple sources of evidence when they are available—geo-tags such as longitude, latitude, location names or other measures; the results from geo-location detection based on the content; and information from network infrastructure.

Mapping Social Media Information on Maps

Mapping Twitter conversations over time has become a popular way of visualizing conversations around events on the social media platform. A straightforward approach for visualizing tweets on a map is using GPS location information from the geo-tagged tweets. This information, however, is present only in around 1–5% of tweets, which makes for not-so-interesting visualizations of smaller events or events in smaller countries where there are fewer people to tweet.

Heravi and Salawdeh (2015) presented a method of mapping Twitter conversations on maps to visualize conversations around events on Twitter. The paper presented a tweet location detection system (Twiloc), which uses various features in a tweet for predicting the most likely location for the tweet. The researchers used this as a part of their journalism work for mapping the Twitter conversations around the Ireland-Scotland Euro Qualifiers game. Twiloc resulted in 70% location geo-referenced tweets for this dataset, as opposed to the 4.5% originally geo-tagged tweets—by users. They further used Twiloc for geo-tagging the Dublin Marathon tweets and compared their results with the results they got from using CartoDB Tweet Maps for the same event. Twiloc resulted in slightly more geo-referenced tweets in comparison to CartoDB, 66.5% to 60%, respectively. Overall, Twiloc shows promising results for location detection and geo-tagging tweets on the datasets presented in the paper. However, further testing and evaluation of results for determining the quality of detected locations is required.

Readily Available Geo-location Information

Information is becoming increasingly geographic as it becomes easier to geo-tag all forms of data, and many devices have embedded GPS (Backstrom et al., 2010). Hecht et al. (2011) showed that, left to their own devices, the vast majority of users—64% of the geo-located tweets—prefer to provide their locational information at the city level, with state level coming in as second choice. Stefanidis et al. (2013) reported that approximately 16% of the Twitter feeds they have collected had detailed location information—that is, coordinates—while another 45% had locational information at coarser granularity—for example, city level. Cheng et al. (2010) reported that 5% of users in their study listed locational information at the level of coordinates, with another 21% of users listing locational information at the city level. This has directed the research community to focus on the techniques discussed later as alternatives to improve the identification of event and user locations from social networks.

Geo-location Based on Network Infrastructure

The geo-location information can be deduced from the network infrastructure. Poese et al. (2011) and Eriksson et al. (2010) proposed using IP addresses. They use geo-location databases for linking IP addresses to locations. There are several databases that can be used for mapping between IP blocks and a geographic location. They are usually accurate at the country level, but they are a lot less accurate at the city level. Poese et al. (2011) showed that these databases are not very reliable for the following reasons. First, the vast majority of entries in the databases refer only to a few popular countries such as U.S. This creates an imbalance in the representation of countries across the IP blocks of the databases. Second, the entries do not always reflect the original allocation of IP blocks. Eriksson et al. (2010) used a Naï̈ve Bayes classifier to achieve a better accuracy for location prediction based on IP mappings from several sources.

Geo-location Based on the Social Network Structure

Another approach to geo-locating users of online social networks can be based solely on their lists of friends—“you are where your friends are”—or follower-followee relations. Users tend to interact more regularly with other users close to themselves and, in many cases, a person’s social network is sufficient to reveal their location (Rout et al., 2013). Backstrom et al. (2010) were the first to create a model for the distribution of the distances between pairs of friends; then they used this distribution to find the most likely location for a given user. The disadvantage of the approach is that it assumes that all users have the same distribution of friends in terms of distance, and it does not account for the density of the population in each area. Rout et al. (2013) showed that using the density of the population leads to more accurate user location detection. They also parsed the location field from the users’ Twitter profile as an additional source of information. An analysis of how users use the location field was presented by Hecht et al. (2011)

Content-Based Location Detection

Geo-location information can be determined from the content of the tweets, Facebook, and blog postings, although this is challenging because the location names mentioned in these texts are often ambiguous. For example, there might exist several cities with the same name, so a disambiguation module is needed. Another level of ambiguity is to detect locations in the first place, to not confuse them with proper names. For example, Georgia can be the name of a person, the name of a state in the U.S., or the name of a country. The challenges are even bigger in social media text where users might not use capital letters for names and locations, giving Named Entity Recognition (NER) tools a harder time.

Some NER tools detect entities such as People, Organizations, and Locations. Therefore, they include locations, but they do not target such more detailed information as: Is the location a city, a province, state, or county? In what country is it? Location detection should also go further to disambiguate the location when there is more than one geographic location with the same name. It is not trivial to decide which of the locations with the same name is referred to in a context. If there is evidence about the country or the state/province, a decision can be based on this information. A default choice is always to choose the city with the largest population, since there is a higher chance that more people from a larger place post messages about that location.

User Locations

Detecting the physical location of a user is a different task from detecting the locations of the events mentioned in the text content, but similar techniques can disambiguate the location when there exist several locations with the same name. A user who has a Twitter account can write anything in the space for user location, a fully correct name with province/state and country, but could also specify only the city name. Sometimes many string or misspellings are found in that field. Many users do not specify their location at all.

Several methods have been proposed to predict users’ locations based on the social media texts data they generate. One of the very first is by Cheng et al. (2010), who first learned the location distribution for each word, then inferred the location of users at the U.S. city level according to the words in their tweets.

Location Mentions

Detecting all locations mentioned in a message differs from detecting the location of each user. The mentions could refer to locations near the user’s homes, to places they travel to, or to events anywhere in the world. The methods of detecting locations mentioned in messages are similar to those used for NER. The most successful methods use machine learning classifiers such as CRF to detect sequences of words that represent locations. Gazetteers and other dictionaries or geographical resources play an important role. They could contain a list of places: cities, states/provinces/counties, countries, rivers, and mountains. Abbreviations for country codes, states, or provinces need to be considered when detecting the locations, as well as alternative spellings for places—for example, Los Angeles, L.A., or LA. One very useful resource for this kind of information is GeoNames, a geographical database that covers all countries and contains over eight million place names, available for download free of charge. It contains information about countries, cities, mountains, lakes, and a lot more. Another available resource is OpenStreetMap. It is open in the sense that people can add new locations and the resources it provides can be used for any purpose—with attribution. The main advantage of using it is that it offers the possibility to display any detected locations on the map.

Sequence classification techniques—such as CRF—are most useful when detecting location expressions mentioned in texts. Inkpen et al. (2015) proposed methods of extracting locations mentioned in texts and disambiguating them when a location can also be a proper name, or a common noun, or when multiple locations with the same name exist. This is a sub-task of named entity recognition, but with deeper focus on location mentions in text in order to classify them as cities, provinces/states, or countries. The authors annotated a dataset of 6,000 Twitter messages. An initial annotation was done using gazetteer lookups in GATE (Cunningham et al., 2002), then two annotators performed manual annotations in order to add, correct, or remove missing locations. Then CRF classifiers were trained with various sets of features—such as bag of word, gazetteer, part-of-speech, and context-based features—to detect spans of text that denote the locations. In the next stage, the authors applied disambiguation rules in case a detected location mention corresponded to more than one geographic location.

Another dataset of social media data annotated with location expressions was produced by Liu et al. (2014). It contains a variety of social media data: 500 blogs, 500 YouTube comments, 500 forums, 1,000 Twitter messages, and 500 English Wikipedia articles. The annotated location expressions are generic, without distinguishing the type of location or the exact geographical location.

Opinion Mining and Emotion Analysis

Sentiment Analysis

What people think is always an important piece of information. Asking a friend to recommend a dentist or writing a reference letter for a job application are examples of this importance in our daily life (Liu, 2012, Pang and Lee, 2008). On social media platforms such as Weblogs, social blogs, microblogging, wikis, and discussion forums, people can easily express and share their opinions. These opinions can be accessed by people who need more information in order to make decisions. The complexity of these decisions varies from simple things, such as choosing a restaurant for lunch or buying a smartphone, to such grave matters as approving laws in the parliament and even critical decisions such as monitoring public safety by security officers.

Due to the mass of information exchanged daily on social media, the traditional monitoring techniques are not useful. Therefore, a number of research directions aim to establish automated tools which should be intelligent enough to extract the opinion of a writer from a given text. Processing a text in order to identify and extract its subjective information is known as sentiment analysis, also referred to as opinion mining.16 The basic goal of sentiment analysis is to identify the overall polarity of a document: positive, negative, or neutral (Pang and Lee, 2008). The polarity magnitude is also taken into account, for example on a scale of 1–5 stars for movie reviews. Sentiment analysis is not an easy job even for humans, because sometimes two people disagree on the sentiment expressed in a given text. Therefore, such analysis is a difficult task for the algorithms and it gets harder when the texts get shorter. Another challenge is to connect the opinion to the entity that is the target of the opinion, and it is often the case that there are multiple aspects of the entities. Users could express positive opinions toward some aspects and negative toward other aspects—for example, a user could like a hotel for its location but not for its quality. Popescu and Etzioni (2005), among others, developed methods for extracting aspects and opinions on them from product reviews.

There is a huge demand from major companies, so most research has focused on product reviews, aiming to predict whether a review has a positive or a negative opinion. There also is, however, some research on investigating the sentiment of informal social interactions. From the perspective of social sciences, the informal social interactions provide more clues about the public opinion about various topics. For instance, a study was conducted to measure the levels of happiness based on the sentiment analysis of the songs, blogs, and presidential speeches (Dodds and Danforth, 2010). In this study, the age and geographic differences in the levels of happiness were analyzed as well.

One of the social concerns is the dramatic changes in social interactions when an important event occurs. These changes can be detected by a sharp increase in the frequency of terms related to the event. Identifiable changes are useful in detecting new events and determining their importance for the public (Thelwall et al., 2011).

Although the analysis of social interaction is similar to product review analysis, there are many differences between these two domains: the length of the content—a product review is longer than a typical social interaction; the topic of the context which could be anything in social interaction but is known in a product review; and the informality of the spelling and the frequent use of abbreviations in social media texts. Furthermore, in informal social interactions, no clear standard exists, while metadata—such as star rating and thumbing up/down—often accompany product reviews (Paltoglou and Thelwall, 2012).

There are many challenges when one applies typical opinion mining and sentiment analysis techniques to social media (Maynard et al., 2012). Microposts such as tweets are challenging because they do not contain much contextual information and assume much implicit knowledge. Ambiguity is a particular problem since we cannot easily make use of co-reference information. Unlike in blog posts and comments, tweets do not typically follow a conversation thread, and appear more in isolation from other tweets. They also exhibit more language variation, tend to be less grammatical than longer posts, contain non-standard capitalization, and make frequent use of emoticons, abbreviations, and hashtags, which can form an important part of the meaning. Typically, they also contain extensive use of irony and sarcasm, which are particularly difficult for a machine to detect. On the other hand, they tend to focus on the topics more explicitly and most often a tweet is about a single topic.

Twitter was often targeted by sentiment analysis projects in order to investigate how the public mood is affected by the social, political, cultural, and economic events. Guerra et al. (2014) showed that Twitter users tend to report more positive opinions than negative ones, and more extreme opinions rather than average ones. This has an effect on the training data that can be collected, because imbalanced data are more difficult for classification tasks.

A benchmark dataset was created for a shared task at SemEval 2013—sentiment analysis in Twitter. The dataset consists of approximately 8,000 tweets annotated with the labels: positive, negative, neutral, and objective (no opinion). There were two sub-tasks. Given a message that contains a marked instance of a word or phrase, the goal of Task A was to determine whether that instance is positive, negative, or neutral in that context. The goal of Task B was to classify whether the message expresses a positive, negative, or neutral sentiment. For messages conveying both a positive and negative sentiment, the stronger sentiment needed to be chosen. There were also messages annotated as objective, expressing facts not opinions. Here is an example of one of the annotated messages. It includes the message ID, the user ID, the topic, the label, and the text of the message:

100032373000896513 15486118 lady gaga “positive” Wow!! Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to her music!!!! WOW!!!

More editions have been held at SemEval 2014,18 2015,19 and 201620 and more datasets were released, for various sub-tasks—an expression-level task, a message-level task, a topic-related task, a trend task, and a task on prior polarity of terms.

The methods used in sentiment analysis are based on learning from annotated data, or on counting the number of positive and negative terms. Hybrid systems were also proposed. Many lists of positive and negative terms were developed, all of them with limited coverage. Also, some words have different polarity depending on the sense of the word in the context or in the domain. Here are several lists of positive/negative words, called polarity lexicons: the General Inquirer (Stone et al., 1962), the MPQA polarity lexicon (Wiebe et al., 2005), SentiWordNet (Baccianella et al., 2010), Bing Liu’s polarity lexicon (Hu and Liu, 2004), and LIWC (Linguistic Inquiry and Word Count) (Pennebaker et al., 2007). Intensity levels for these words could also be available in some of the resources. The lexicons can be used directly in methods that count polarity-bearing words (then choose as the text polarity the one that corresponds to the largest value according to some formula, possibly normalized by the length of the text), or these counts (values) can be used as features in machine learning techniques. The task is difficult because of the polarity of the words changes with the domain and even in the same domain it changes in different contexts (Wilson et al., 2009). Another drawback of using lexicons is their limited coverage, but they can still be useful as a basis to which domain specific words and their polarities can be added.

An early work that focused on sentiment classification in Twitter messages was done by Go et al. (2009). They classified messages as either positive or negative with respect to a query term. This is useful for consumers who want to research the sentiment of products before purchase, or companies that want to monitor the public sentiment of their brands. They used machine learning algorithms—such as Naï̈ve Bayes, Maximum Entropy, and SVM—for classifying the sentiment of Twitter messages using distant supervision. Distant supervision means that the training data was automatically collected by using positive and negative emoticons as noisy labels. This type of training data is easy to collect, but is not very reliable. Pak and Paroubek (2010b) also collected automatically a corpus for sentiment analysis and opinion mining purposes from Twitter and built a classifier to determine positive, negative, and neutral sentiments.

Adjectives were considered the most important features in sentiment analysis, starting from the early work on customer reviews (Hatzivassiloglou and McKeown, 1997). Moghaddam and Popowich (2010) determined the polarity of reviews by identifying the polarity of the adjectives that appear in them. Pak and Paroubek (2010a) studied ambiguous sentiment adjectives and presented experiments on the SemEval 2010 data, for the task of disambiguating sentiment ambiguous adjectives for Chinese.

Many of the methods from the sentiment analysis in the Twitter SemEval task are based on machine learning methods that use a large variety of features, from simple words to complex linguistic and sentiment-related features. Mohammad et al. (2013) used SVM classifiers with features such as: n-grams, character n-grams, emoticons, hashtags, capitalization information, parts of speech, negation features, word clusters, and multiple lexicons. In the 2015 sub-tasks, similarly to the previous two years, almost all systems used supervised learning. Popular machine learning approaches included SVM, Maximum Entropy, CRFs, and linear regression. In several of the subtasks, the top system used deep neural networks and word embeddings, and some systems benefited from special weighting of the positive and negative examples. The most important features were those derived from sentiment lexicons. Other important features included bag-of-word features, hashtags, handling of negation, word shape and punctuation features, and elongated words. Moreover, tweet pre-processing and normalization were an important part of the processing pipeline (Rosenthal et al., 2015).

Emotion Analysis

Emotion analysis emerged as a task somewhat more specific than opinion analysis, since it looks at fine-grained types of emotion. Research on emotion detection started with Holzman and Pottenger (2003) and Rubin et al. (2004) who investigated emotion detection on very small data sets. More recently, work was done on classifying blog sentences (Aman and Szpakowicz, 2007) and newspaper headlines (Strapparava and Mihalcea, 2007) into the six classes of emotions proposed by Ekman (1992). Classification of sentences by emotions was also done into the nine classes of emotions proposed by Izard (1971), for types of sentences (Neviarouskaya et al., 2009) and on sentences from fairy tales (Alm et al., 2005).

There is no consensus on how many emotion classes should be used. Plutchik’s wheel of emotions proposes many emotions and arranges them in a wheel where each emotion type has a corresponding emotion with inverse polarity (Plutchik and Kellerman, 1980). Ekman’s six emotions classes (happiness, anger, sadness, fear, disgust, and surprise) are the ones used more often because they have associated facial expressions (Ekman, 1992).

Most of the methods used in emotion classification are based on machine learning. SVM classifiers tend to achieve the best results on this task. Rule-based approaches were also proposed (Neviarouskaya et al., 2009). Lists of emotion words were developed in order to add term counting features to the classification. Examples of such emotion lexicons are WordNetAffect (Strapparava and Valitutti, 2004) and ANEW (Affective Norms for English Words) (Bradley and Lang, 1999). LIWC also has labeled emotions words in addition to the labels for positivity/negativity mentioned above. Mohammad and Turney (2013) collected a larger emotion lexicon by crowdsourcing.

In Bollen et al. (2011), other types of emotions were extracted, including tension, depression, anger, vigor, fatigue, and confusion. The results of this analysis showed that the events that cause these emotions have a significant, immediate, and highly specific effect on the public mood in various dimensions. Jung et al. (2006) used some common-sense knowledge from ConceptNet (Liu and Singh, 2004), and a list of affective words (Bradley and Lang, 1999) to treat four emotions classes—a subset of Ekman’s six emotions.

Among the work on emotion analysis, that on social media data was focused on blogs and on tweets. Aman and Szpakowicz (2007) applied SVM classifiers to the dataset of annotated blog sentences mentioned above, and used emotion words from Roget’s thesaurus as features for classification. Ghazi et al. (2010) applied hierarchical classification to the same dataset, by classifying the blog sentences into neutral or expressing emotions, then the latter ones into positive and negative emotions. The positive ones were mostly in the class of happiness, while the rest were negative. Surprise could be positive or negative, but it was mostly negative in that dataset. Syntactic dependency features were also explored for the same task (Ghazi et al., 2014).

On Twitter data, Mohammad and Kiritchenko (2014) used hashtags to capture fine-grained emotion categories. The hashtags were used to label the data, with the risk of obtaining noisy training data. The experiments showed that classification is still possible in this setting called distant supervision. Similarly, Abdul-Mageed and Ungar (2017) built a very large dataset for fine-grained emotion detection using carefully chosen hashtags for automatic labeling of the data. Because the training data was large enough, they were able to train deep learning models that achieved high accuracies.

Nakov et al. (2016) discuss the fourth year of the SemEval task called the“Sentiment Analysis in Twitter” task. This latest iteration included five subtasks. Subtasks A, C, and D predict tweets as having positive, negative, or neutral sentiment. The remaining subtasks challenged researchers to map the sentiment of tweets on given topics to a five-point scale. A total of 43 teams participated in this SemEval-2016 Task 4, representing 25 countries. Many top-ranked teams showcased the efficacy of deep learning, including convolutional neural networks, recurrent neural networks, and word embeddings in such analysis. The following are examples of teams that achieved high ranks in these tasks.

One such team was Deriu et al. (2016). Their sentiment classification model used an ensemble of convolutional neural networks with distant supervision (tweets labeled by hashtags rather then human annotators). This combination achieved a winning F-score 0.63 on the Twitter-2016 test set. Palogiannidi et al. (2016) presented a method of sentiment analysis using semantic-affective model adaptation. They used a large generic corpus which included 116M sentences. Balikas and Amini (2016) proposed a two-step approach. In the first step, they generated and validated diverse feature sets for Twitter sentiment evaluation. In the second step, they focused on the optimization of the evaluation measure of the different subtasks. This method included feature extraction, feature representation, and feature transformation, and ranked among the top ten teams in four out of five subtasks. Stojanovski et al. (2016) used a deep learning architecture for sentiment analysis that employed convolutional and gated recurrent neural networks. Their system leveraged preprocessing, pre-trained word embeddings, convolutional neural networks, gated recurrent neural networks, and network fusion—sharing layers across networks—and achieved the second-best average rank on the binary and 5-point classification and quantification subtasks.

A few researchers focused on mood classification in social media data. Moods are similar to emotions, but they express more transient states. LiveJournal is a Web site that allows users to write how they feel and to label their blog posts with one of the 132 existing moods, or even to created new labels. Mishne (2005) collected a corpus of posts from LiveJournal annotated with mood labels, and implemented an SVM classifier to automatically classify blogs into the 40 most frequent moods. He used features such as frequency counts, lengths, sentiment orientations, emphasized words, and special symbols. Keshtkar and Inkpen (2012) further investigated this dataset by adding more sentiment orientation features. Moreover, they proposed a hierarchical classifier based on the hierarchy of moods, using SVM in each branch of the hierarchy. They experimented with all 132 moods. Since 132 classes are easily confused—for humans and for the automatic system—the hierarchical approach was essential in order to obtain good classification results. The features used in the classification started with Bag-of-Word features and added semantic orientation features calculated by using multiple polarity lexicons.

Sarcasm Detection

One of the problems in opinion mining systems is that sarcastic or ironic statements could easily fool these systems. The difference between irony and sarcasm is subtle; it lies in the idea that irony can be involuntary, while sarcasm is deliberate.

Irony, generally speaking, can naturally occur in both language and circumstance; one experiences irony when the opposite of an expected situation or idea occurs. In essence, an individual does not need to go out of their way to experience an ironic situation or idea: they can occur naturally. Sarcasm, for its part, can make use of irony to make an observation or remark about an idea, person, or situation. Sarcasm is generally intended to express ridicule or reservation about an expression or idea, and that is why it tends to find broader usage than irony. For an automatic system, the difference between them is difficult to catch, and perhaps not necessary. From applications’ point of view, it is important to detect sarcastic/ironic statements in order to distinguish them from genuine opinions.

Several researchers attempted to detect sarcastic statements, mainly by using classification approaches. SVM and other classifiers were used, and many sets of features were tested. The features include specific punctuation—such as exclamation marks—Twitter-specific symbols, syntactic information, and world knowledge. Gonzá́lez-Ibá́ñez et al. (2011) explored lexical and pragmatic features and found that smileys, frowns, and ToUser features were among the most discriminating for the classification task. They also found that human judges have a hard time performing the sarcasm detection task. Barbieri et al. (2014) proposed lexical features that aim to detect sarcasm by its structure, by computing unexpectedness, intensity of terms, and imbalance between styles. Riloff et al. (2013) identified only sarcastic messages created by a contrast between a positive sentiment and a negative situation. Their bootstrapping method acquired a list of positive sentiment phrases and a list of negative activities and states.

Training data for detecting sarcasm can be manually annotated as sarcastic or not, or can be obtained automatically. Many researchers worked on Twitter data, and collected messages with the #sarcasm hashtag to use as training examples for the sarcasm class. For the negative class they collected other messages not including this hashtags; but there is no guarantee that some sarcastic messages were included as examples of the non-sarcastic class. Ideally, the latter examples should be manually checked, but this is time consuming, so it is not usually done. Davidov et al. (2010) proposed a semi-supervised approach in order to reduce the need for annotated training data. Lukin and Walker (2013) also used bootstrapping, but they worked on online dialog texts, unlike the previously cited work focused on Twitter messages.

Event and Topic Detection

Event detection in social media texts is important because people tend to post many messages about current events, and many users read those comments in order to find the information that they need. Event detection techniques can be classified according to the event type—specified or unspecified; the detection task—retrospective or new event detection; and the detection method—supervised or unsupervised—as described in the survey paper by Farzindar and Khreich (2013).

Specified Versus Unspecified Event Detection

Depending on the information available on the event of interest, event detection can be classified into techniques for specified and for unspecified events. When no prior information is available about the event, the unspecified event detection techniques rely on the temporal signal of social media streams to detect the occurrence of a real-world event. These techniques typically require monitoring for bursts or trends in social media streams, grouping the features with identical trends into events, and ultimately classifying the events into different categories. On the other hand, the specified event detection relies on specific information and features that are known about the event, such as a venue, time, type, and description, which are provided by the user or from the event’s context. These features can be exploited by adapting traditional information retrieval and extraction techniques (such as filtering, query generation and expansion, clustering, and information aggregation) to the unique characteristics of social media data.

Unspecified Event Detection

The nature of Twitter posts reflects events as they unfold, so tweets can be particularly useful for detecting unknown events. Unknown events of interest are typically driven by emerging events, breaking news, and general topics that attract the attention of a large number of Twitter users. Since no event information is available, unknown events are typically detected by exploiting the temporal patterns or signals of Twitter streams. New events of general interest exhibit a burst of features in the Twitter streams, yielding, for instance, a sudden increased use of specific keywords. Bursty features that occur frequently together in tweets can then be grouped into trends (Mathioudakis and Koudas, 2010). In addition to trending events, endogenous or non-event trends are also abundant on Twitter (Naaman et al., 2011). Techniques for unspecified event detection in Twitter must therefore distinguish trending events of general interest from the trivial or non-event trends (exhibiting similar temporal pattern) using scalable and efficient algorithms. The techniques described below attempted to meet these challenges. Most of them are based on detection topic words that might signal a new event, and then using similarity calculation or classification to detect more messages about the same event.

Sankaranarayanan et al. (2009) presented a system called TwitterStand that captures tweets that correspond to late breaking news. They employed a Naïve Bayes classifier to separate news from irrelevant information, and an online clustering algorithm based on weighted term vectors of TF-IDF21 values and on cosine similarity22 to form clusters of news. In addition, hashtags are used to reduce clustering errors. Clusters were also associated with time information. Other issues addressed included removing the noise and determining the relevant locations associated with the tweets. Similarly, Phuvipadawat and Murata (2010) collected, grouped, ranked, and tracked breaking news from Twitter. They collected sample tweets from Twitter API using predefined search queries—for example, #breakingnews—and index their content with Apache Lucene. Similar messages were then grouped to form a news story based on TF-IDF with an increased weight for proper noun terms, hashtags, and usernames. The authors used a weighted combination of reliability, and popularity of tweets with a time adjustment for the freshness of the messages to rank each cluster. New messages were included in a cluster if they were similar to the first message and to the top k terms in that cluster. The authors stressed the importance of proper noun identification in enhancing the similarity comparison between tweets, and hence improving the overall system accuracy. An application based on the proposed method called Hot-stream has been developed.

Petrovic et al. (2010) adapted the approach proposed for news media by Allan et al. (2000). Cosine similarity between documents was used to detect new events that have never appeared in previous tweets. Replies, retweets, and hashtags were not considered in their experiments, nor the significance of newly detected events—for example, trivial or not. Results have shown that ranking according to the number of users is better than ranking according to the number of tweets, and considering entropy of the message reduces the amount of spam messages in the output.

Becker et al. (2011b) focused on online identification of real-world event content and its associated Twitter messages using an online clustering technique, which continuously clusters similar tweets, and then classifies the cluster’s content into real-world events or non-events. These non-events involve Twitter-centric topics, which are trending activities in Twitter that do not reflect any real-world occurrences (Naaman et al., 2011). Twitter-centric activities are difficult to detect, because they often share similar temporal distribution characteristics with real-world events. Each message is represented as a TF-IDF weight vector of its textual content, and cosine similarity is used to compute the distance from a message to cluster centroids. In addition to traditional pre-processing steps such as stop-word elimination and stemming, the weights of hashtag terms were doubled since they are considered strongly indicative of the message content. The authors combined temporal, social, topical, and Twitter-centric features. Since the clusters constantly evolve over time, the features were periodically updated for old clusters and computed for newly formed ones. Finally, an SVM classifier was trained on a labeled set of cluster features, and used to decide whether the cluster—and its associated messages—contains real-world event information.

Long et al. (2011) adapted a traditional clustering approach by integrating some specific features into the characteristics of microblog data. These features are based on topical words, which are more popular than others with respect to an event. Topical words are extracted from daily messages based on word frequency, word occurrence in hashtag, and word entropy. A top-down, hierarchical divisive clustering is applied to a co-occurrence graph—connecting messages in which topical words co-occur—to divide topical words into event clusters. To track changes among events at different times, a maximum weighted bipartite graph matching is employed to create event chains, with a variation of Jaccard coefficient as similarity measures between clusters. Finally, cosine similarity augmented with a time interval between messages is used to find top k most relevant posts that summarize an event. These event summaries were then linked to event chain clusters and plotted on a time line. For event detection, the authors found that top-down divisive clustering outperforms both k-means and traditional hierarchical clustering algorithms.

A bursty topic or event in Twitter is one that triggers many related tweets in a short period of time. Ex post facto analysis of such events has long been a topic of social media research; however, real-time detection of bursty events is relatively novel. Xie et al. devised a sketch- based topic model called “TopicSketch” to solve this challenge. This approach involved a soft moving window over a Twitter stream to efficiently detect surges in rare words and rare tuples of words. These frequencies were then stored in a matrix, and decomposed into smaller matrices that approximated topics by using Singular Value Decomposition (SVD). Evaluated over 30 million tweets in real-time, this approach proved to be both more efficient and effective than previous models.

Weng and Lee (2011) proposed event detection based on clustering of discrete wavelet signals built from individual words generated by Twitter. In contrast with Fourier transforms, which have been proposed for event detection from more traditional media, wavelet transformations were used in both time and frequency domain, in order to identify the time and the duration of a bursty event within the signal. A sliding window was then applied to capture the change over time. Trivial words were filtered out based on a threshold set on signal cross-correlation, which measures similarity between two signals as function of a time-lag. The remaining words were then clustered to form events with a modularity-based graph partitioning technique, which splits the graph into subgraphs each corresponding to an event. Finally, significant events were detected from the number of words and the cross-correlation among the words related to an event.

Similarly, Cordeiro (2012) proposed a continuous wavelet transformation based on hashtag occurrences combined with a topic model inference using Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Instead of individual words, hashtags are used for building wavelet signals. An abrupt increase in the number of occurrences of a given hashtag is considered a good indicator of an event that is happening at a given time. Therefore, all hashtags were retrieved from tweets and then grouped in intervals of five minutes. Hashtag signals were constructed over time by counting the hashtag mentions in each interval, grouping them into separated time series—one for each hashtag—and concatenating all tweets that mention the hashtag during each time series. Adaptive filters were then used to remove noisy hashtag signals, before applying the continuous wavelet transformation and getting a time-frequency representation of the signal. Next, wavelet peak and local maxima detection techniques were used to detect peaks and changes in the hashtag signal. Finally, when an event was detected within a given time interval, LDA was applied to all the tweets related to the hashtag in each corresponding time series in order to extract a set of latent topics, in order to build an event description.

Specified Event Detection

Specified event detection includes known or planned social events. These events could be partially or fully specified with the related content or metadata information such as location, time, venue, and performers. The techniques described below attempt to exploit Twitter textual content or metadata information or both, using a wide range of machine learning, data mining, and text analysis techniques.

Popescu and Pennacchiotti (2010) focused on identifying controversial events that provoke public discussions with opposing opinions in Twitter. Their detection framework is based on the notion of a Twitter snapshot. Given a set of Twitter snapshots, an event detection module first distinguishes between event and non-event snapshots using a supervised Gradient Boosted Decision Trees (GBDT) (Friedman, 2001), trained on a manually labeled dataset. To rank these event snapshots, a controversy model assigns higher scores to controversial event snapshots, based on a regression algorithm applied to a large number of features. Feature analysis of the single-stage system revealed that the event’s core is the most relevant feature since it discriminates event from non-event snapshots. Hashtags are found to be important semantic features for tweets, since they help identify the topic of a tweet and estimate the topical cohesiveness of a set of tweets. In addition, the linguistic, structural, and sentiment features also provide considerable effects. The authors concluded that a rich, varied set of features is crucial for controversy detection.

In a follow-up, Popescu et al. (2011) employed the framework described above, but with additional features to extract events and their descriptions from Twitter. The key idea is based on the importance and the number of the entities to capture common-sense intuitions about event and non-event snapshots. As the authors observe: “Most event snapshots have a small set of important entities and additional minor entities while non-event snapshots may have a larger set of equally unimportant entities.” These new features are inspired by the document aboutness system (Paranjpe, 2009), and aim at ranking the entities in a snapshot with respect to their relative importance to the snapshot. This includes relative positional information—for example, offset of a term in snapshot; term-level information—that is, term frequency, Twitter corpus IDF—and snapshot-level information—length of snapshot, category, language. Part-of-speech tagging and regular expressions have also been applied for improved event and main entity extraction. The number of snapshots containing action verbs, the buzziness of an entity in the news on a given day and the number of reply tweets are among the most useful new features found by the authors.

Benson et al. (2011) presented a novel way of identifying Twitter messages for concert events using a factor graph model, which simultaneously analyzes individual messages, clusters them according to event type, and induces a canonical value for each event property. The motivation is to infer a comprehensive list of musical events from Twitter—based on artist-venue pairs—to complete an existing list—for example, city event calendar table—by discovering new musical events mentioned by Twitter users that are difficult to find in other media sources. At the message level, this approach relies on a CRF model to extract the name of the artist and the location of the event. The input features to CRF model included word shape; a set of regular expressions for common emoticons, time references, and venue types; a bag of words for artist names extracted from external source—for example, Wikipedia; and a bag of words for city venue names. Clustering was guided by term popularity, which is an alignment score among the message term labels— artist, venue, none—and some candidate value—for example, specific artist or venue name. To capture the large text variation in Twitter messages, this score was based on a weighted combination of term similarity measures, including complete string matching, and adjacency and equality indicators scaled by the inverse document frequency. In addition, a uniqueness factor—favoring single messages—was employed during clustering to uncover rare event messages that are dominated by the popular ones, and to discourage various messages from the same events to cluster into multiple events. On the other hand, a consistency indicator was employed to discourage messages from multiple events to form a single cluster. A factor graph model was then employed to capture the interaction between all components and provide the final decision. The output of the model was a clustering of messages based on a musical event, where each cluster was represented by an artist-venue pair.

Lee and Sumiya (2010) presented a geo-social local event detection system based on modeling and monitoring crowd behavior via Twitter, to identify local festivals. They relied on geographical regularities deduced from the usual behavior patterns of crowds using geotags. The authors found that an increased user activity combined with an increased number of tweets provide strong indicator of local festivals. Sakaki et al. (2010) exploited tweets to detect specific types of events like earthquakes and typhoons. They formulated event detection as a classification problem, and trained an SVM classifier on a manually labeled Twitter data set comprising negative events—earthquakes and typhoons—and positive events—or other events or non-events. Three types of features have been employed; the number of words (statistical), the keywords in a tweet message, and the words surrounding users query (contextual). Experiments have shown that the statistical feature by itself provided the best results, while a small improvement in performance was achieved by the combination of the three features. The authors have also applied Kalman filtering and particle filtering (Fox et al., 2003) for the estimation of earthquake center and typhoon trajectory from Twitter temporal and spatial information. They found that particle filters outperformed Kalman filters in both cases, due to the inappropriate Gaussian assumption of the latter for this type of problem.

Becker et al. (2011a) presented a system for augmenting information about planned events with Twitter messages, using a combination of simple rules and query building strategies. To identify Twitter messages for an event, they begin with simple and precise query strategies derived from the event description and its associated aspects—for example, combining time and venue. In addition, they build queries using URL and hashtag statistics from the high-precision tweets for an event. Finally, they build a rule-based classifier to select among this new set of queries, and then use the selected queries to retrieve additional event messages. In a related work, Becker et al. (2011c) proposed centrality-based approaches to extract high-quality, relevant, and useful Twitter messages related to an event. These approaches are based on the observation that the most topically central messages in a cluster are more likely to reflect key aspects of the event than other, less central cluster messages. The techniques from both works have recently been extended and incorporated into a more general approach that aims at identifying social media contents for known events across different social media sites (Becker et al., 2012).

Massoudi et al. (2011) employed a generative language modeling approach based on query expansion and microblog “quality indicators” to retrieve individual microblog messages. However, the authors only considered the existence of a query term within a specific post and discarded its local frequency. The quality indicators include part of the blog “credibility indicators” proposed by Weerkamp and De Rijke (2008) extended with specific microblog characteristics such as a recency factor, and the number of reposts and followers. The recency factor is based on the difference between the query time and the post time. The query expansion technique selects top k terms that occur in a user-specified number of posts close to the query date. The final query is therefore a weighted mixture of the original and the expanded query. The combination of the quality indicator terms and the microblog characteristics has been shown to outperform each method alone. In addition, tokens with numeric or non-alphabetic characters have turned out beneficial for query expansion.

Rather than retrieving individual microblog messages in response to an event query, Metzler et al. (2012) proposed retrieving a ranked list (or timeline) of historical event summaries. The search task involves temporal query expansion, timespan retrieval, and summarization. In response to a user query, this approach retrieves a ranked set of timespans based on the occurrence of the query keywords. The idea is to capture terms that are heavily discussed and trending during a retrieved timespan because they are more likely to be related to the query. To produce a short summary for each retrieved time interval, a small set of query-relevant messages posted during the timespan are then selected. These relevant messages are retrieved as top-ranked messages according to a weighted variant of the query likelihood scoring function, which is based on the burstiness score for expansion terms and a Dirichlet smoothed language modeling estimate for each term in the message. The authors showed that their approach is more robust and effective than the traditional relevance-based language models (Lavrenko and Croft, 2001) applied to the collected Twitter corpus and to English Gigaword corpus.

Gu et al. (2011) proposed an event modeling approach called ETree for event modeling from Twitter streams. ETree employs n-gram-based content analysis techniques to group a large number of event-related messages into semantically coherent information blocks, an incremental modeling process to construct hierarchical theme structures, and a life cycle based temporal analysis technique to identify potential causal relationships between information blocks. ETree is more efficient than its non-incremental version and to TSCAN—a widely used algorithm that derives major themes of events from the eigenvectors of a temporal block association matrix (Chen and Chen, 2008).

New Versus Retrospective Events

Similar to event detection from conventional media (Allan, 2002, Yang et al., 1998, 2002), event detection in Twitter can also be classified into retrospective and new event detection depending on the task and application requirements, and on the type of event. Since new event detection (NED) techniques involve continuous monitoring of the Twitter signal for discovering new events in near real-time, they are naturally suited for detecting unknown real-world events or breaking news. In general, trending events on Twitter could be aligned with real-world breaking news. However, sometimes a comment, person, or photo related to real-world breaking news may become more trending on Twitter than the original event. One such example is Bobak Ferdowsi’s hair style on social media during NASA’s operation in 2012, when the media reported: “Mohawk guy Bobak Ferdowsi’s hair goes viral as Curiosity lands on Mars.”

Although NED approaches do not impose any assumption on the event, they are not restricted to detecting unspecified event. When the monitoring task involves specific events—such as natural disasters or celebrities—or a specific information about the event—for example, geographical location—this information could be integrated into the NED system by, for instance, using filtering techniques (Sakaki et al., 2010) or exploiting additional features such as the controversy (Popescu and Pennacchiotti, 2010) or the geo-tagged information (Lee and Sumiya, 2010), to better focus on the event of interest. Most NED approaches could also be applied to historical data in order to detect and analyze past events.

While most research focuses on new event detection to exploit the timely information provided by Twitter streams, recent studies show an interest in retrospective event detection from Twitter’s historical data. Existing microblog search services, such as those offered by Twitter and Google, only provide limited search capabilities that allow individual microblog posts to be retrieved in response to a query (Metzler et al., 2012). The challenges in finding Twitter messages relevant to a given user query are mainly due to the sparseness of the tweets and the large number of vocabulary mismatches—because the vocabulary dynamically evolves. For example, relevant messages may not contain any query term, or new abbreviations or hashtags may emerge with the event. Traditional query expansion techniques rely on terms that co-occur with query terms in relevant documents. In contrast, event retrieval from Twitter data focuses on temporal and dynamic query expansion techniques. Recent research effort has begun to focus on providing more structured and comprehensive summaries of Twitter events.

Emergency Situation Awareness

Event detection in Twitter and other social media can be used for emergency situation awareness. New events can be detected and classified as an emergency, and then updates on the situation can be processed in order to keep people informed and to help resolve or alleviate the situation. We present two examples of systems that focus on this kind of monitoring. Both are based on machine learning techniques in order to classify the Twitter messages as being of interest or not.

Yin et al. (2012) implemented a system that extracts situation awareness information from Twitter messages generated during various disasters and crises. They collected tweets for specific areas of interest in Australia and New Zealand since March 2010. The data contained 66 million tweets from approximately 2.51 million distinct Twitter profiles that cover a range of natural disasters and security incidents, including: the tropical cyclone Ului (March 2010), the Brisbane storms (June 2010), the gunman in Melbourne (June 2010), the Christchurch earthquake (September 2010), the Qantas A380 incident (November 2010), the Brisbane floods (January 2011), the tropical cyclone Yasi (February 2011), and the Christchurch earthquake (February 2011). The method started with burst detection for expected incidents, followed by a classification for impact assessment. The classifiers (Naïve Bayes and SVM) used lexical features and Twitter-specific features for classification. These features included unigrams, bigrams, word length, the number of hashtags contained in a tweet; the number of user mentions, whether a tweet is retweeted; and whether a tweet is replied to by other users. In a next step, online clustering was applied to topic discovery—using cosine similarity and Jaccard similarity to group messages in the same clusters.

Cobb et al. (2014) described automatic identification of Twitter messages that contribute to situational awareness. They collected tweets broadcasted during each of the emergency events, based on selected keywords. The four datasets were: the 2009 Oklahoma wildfire (527 tweets), the Red River flooding 2009 (453 tweets), the 2010 Red River flooding (499 tweets), and the 2010 Haiti earthquake (486 tweets). Their method was based on Naï̈ve Bayes and Maximum Entropy (MaxEnt) classifiers, in order to differentiate the tweets across several dimensions: subjectivity, personal or impersonal style, and linguistic register—formal or informal style. The features used for classification included: unigrams, bigrams, part-of-speech tags, the subjectivity of the message (objective / subjective), its style (formal / informal), and its tone (personal / impersonal). The last three features were calculated automatically by classifiers designed specifically for this. In an alternative experiment, they were manually annotated.