четверг, 12 декабря 2013 г.

Detection of Community Anomalies in Twitter Trends


   Users, when forming their own views on different trends, pay great attention on other users' points of view. Very important in user's view formation is the ratio of number of users with different opinions. Obviously, there emerge some forces that are interested in the formation of users' trends and opinions.  Such methods of influence are much more complicated than mere spam. In particular,  a whole community with given trend may be created artificially.
 When a user finds himself in such a community, he/she may get a wrong feeling that the trend of this community is being supported by a great number of users and, thus, this trend should be well-reasoned, analyzed and unbiased. Having only selective acquaintance with trends, it is very difficult for a user to detect that the communities, which give rise to these trends, are artificial. Such artificial trends may be created while discussing various political, social, economic, or financial issues.
 One may detect artificial communities through long-lasting observing of informational streams on given topic. Based on the analysis of quantitative characteristics of created communities, one can reveal some anomalies. The communities, created on the grounds of these anomalies, may be regarded as anomalous and, thus, excluded from further consideration and informational stream.
 In Ukraine, for the last few weeks there have been nonviolent mass protests against government policy and particularly against the breakdown of association agreement with EU, against coercion to peaceful demonstrators, etc. Evidently, these processes have their reflections in social networks. It is also obvious that some forces are trying to influence network users' viewpoints towards these events.
 That is why it is interesting and important to analyze social network informational streams concerning events in Ukraine for revealing both the anomalous communities and productive communities with effective discussions written by real users.
 Using Twitter API, we have been loading the tweets for several days with such filtering keywords as Ukraine, Euromaidan, etc. The analysis was conducted using R and Python languages. From our point of view, the most effective analysis of tweets can be based on: the theory of formal concept analysis, the theory of frequent itemsets and association rules, network theories, supervised and unsupervised classifications.
 Users mention other users in their tweets. They also quote other users by retweeting their messages. It makes possible to create connections among users and to build a graph, which will demonstrate users' connections. On such a graph, one may single out different communities using various existing approaches. One of popular approaches is based on the modularity notion, which describes the relation of connections between the vertices inside and outside of the community.
 To identify the communities that were formed dynamically in the discussion, we used a fast greedy modularity optimization algorithm. To build a graph, we used a Fruchterman-Reingold algorithm. This algorithm belongs to force algorithms, or spring algorithms. The character of the graph is due to the model which is used in force algorithms. The distinctive feature of the model is that its vertices are considered as the balls, affected by repulsive forces; and the edges are considered as  spring models that attract the vertices which are connected by these edges. We have built a network with user communities marked with different colours: 




3000 random tweets samples  
10000 random tweets samples  


50000 random tweets samples  

 Then we noticed that the obtained graph contains two big communities, which appeared to be totally isolated. The analysis of those communities revealed that one of them has only one influencer who is being quoted by all the community members. Those users appeared to have no connections with any other user in the network. It is evidently an anomalous community, since one can hardly believe in the existence of a real big community the members of which quote only one source and nobody of them either writes his/her own tweets or retweets other users. Using the adhesion coefficient, we can define the measure of community isolation. For isolated communities the adhesion coefficient is equal to zero. The other feature may be found with the analysis of influencers' activity, the ratio of their tweets and retweets, the study of users' activity inside the community, etc. All these characteristics may be used for the training of classifiers for anomaly revealing.
 In our analysis, we have detected several big communities with different adhesion coefficients and different quantitative characteristics of influencers' activities. The analysis of top trends in the communities with zero adhesive coefficient showed that their influencers are anomalous, and it is rather difficult to establish their social identity. On the other hand, for the big communities with the maximum adhesion coefficient, the top influencers are well-known Ukrainian and European politicians and news agencies.
 One of conclusions for the study conducted is the fact that zero or minimum adhesion coefficient points to the anomality of given community; and high adhesion coefficient indicates that the community is effective and productive.
 Our next step was to remove the tweets belonging to the users, who were defined as members of of anomalous communities. As a result we obtained the following graph of users:

3000 random tweets samples  (2 anomalous communities were found and removed) 
10000 random tweets samples  (3 anomalous communities were found and removed) 
10000 random tweets samples  (10 anomalous communities were found and removed) 



 There is no doubt that the removal of anomalous communities is very important in data mining of trends, as it enables to get a real picture of users' minds. One more useful thing in social networks may be an additional service that would filter the activity of users from anomalous communities; or it may inform other users about any suspicious users and informational streams.
 In our further studies we are planning to analyze the other types of anomalous informational streams, using the theory of formal concept analysis, the theory of semantic fields, the theory of frequent itemsets and association rules.