My Analytic Research: 2013

четверг, 12 декабря 2013 г.

Detection of Community Anomalies in Twitter Trends

Users, when forming their own views on different trends, pay great attention on other users' points of view. Very important in user's view formation is the ratio of number of users with different opinions. Obviously, there emerge some forces that are interested in the formation of users' trends and opinions. Such methods of influence are much more complicated than mere spam. In particular, a whole community with given trend may be created artificially.
When a user finds himself in such a community, he/she may get a wrong feeling that the trend of this community is being supported by a great number of users and, thus, this trend should be well-reasoned, analyzed and unbiased. Having only selective acquaintance with trends, it is very difficult for a user to detect that the communities, which give rise to these trends, are artificial. Such artificial trends may be created while discussing various political, social, economic, or financial issues.

One may detect artificial communities through long-lasting observing of informational streams on given topic. Based on the analysis of quantitative characteristics of created communities, one can reveal some anomalies. The communities, created on the grounds of these anomalies, may be regarded as anomalous and, thus, excluded from further consideration and informational stream.

In Ukraine, for the last few weeks there have been nonviolent mass protests against government policy and particularly against the breakdown of association agreement with EU, against coercion to peaceful demonstrators, etc. Evidently, these processes have their reflections in social networks. It is also obvious that some forces are trying to influence network users' viewpoints towards these events.
That is why it is interesting and important to analyze social network informational streams concerning events in

Ukraine

for revealing both the anomalous communities and productive communities with effective discussions written by real users.

Using Twitter API, we have been loading the tweets for several days with such filtering keywords as Ukraine, Euromaidan, etc. The analysis was conducted using R and Python languages. From our point of view, the most effective analysis of tweets can be based on: the theory of formal concept analysis, the theory of frequent itemsets and association rules, network theories, supervised and unsupervised classifications.

Users mention other users in their tweets. They also quote other users by retweeting their messages. It makes possible to create connections among users and to build a graph, which will demonstrate users' connections. On such a graph, one may single out different communities using various existing approaches. One of popular approaches is based on the modularity notion, which describes the relation of connections between the vertices inside and outside of the community.
To identify the communities that were formed dynamically in the discussion, we used a fast greedy modularity optimization algorithm. To build a graph, we used a Fruchterman-Reingold algorithm. This algorithm belongs to force algorithms, or spring algorithms. The character of the graph is due to the model which is used in force algorithms. The distinctive feature of the model is that its vertices are considered as the balls, affected by repulsive forces; and the edges are considered as spring models that attract the vertices which are connected by these edges. We have built a network with user communities marked with different colours:

3000 random tweets samples

10000 random tweets samples

50000 random tweets samples

Then we noticed that the obtained graph contains two big communities, which appeared to be totally isolated. The analysis of those communities revealed that one of them has only one influencer who is being quoted by all the community members. Those users appeared to have no connections with any other user in the network. It is evidently an anomalous community, since one can hardly believe in the existence of a real big community the members of which quote only one source and nobody of them either writes his/her own tweets or retweets other users. Using the adhesion coefficient, we can define the measure of community isolation. For isolated communities the adhesion coefficient is equal to zero. The other feature may be found with the analysis of influencers' activity, the ratio of their tweets and retweets, the study of users' activity inside the community, etc. All these characteristics may be used for the training of classifiers for anomaly revealing.

In our analysis, we have detected several big communities with different adhesion coefficients and different quantitative characteristics of influencers' activities. The analysis of top trends in the communities with zero adhesive coefficient showed that their influencers are anomalous, and it is rather difficult to establish their social identity. On the other hand, for the big communities with the maximum adhesion coefficient, the top influencers are well-known Ukrainian and European politicians and news agencies.

One of conclusions for the study conducted is the fact that zero or minimum adhesion coefficient points to the anomality of given community; and high adhesion coefficient indicates that the community is effective and productive.

Our next step was to remove the tweets belonging to the users, who were defined as members of of anomalous communities. As a result we obtained the following graph of users:

3000 random tweets samples (2 anomalous communities were found and removed)

10000 random tweets samples (3 anomalous communities were found and removed)

10000 random tweets samples (10 anomalous communities were found and removed)

There is no doubt that the removal of anomalous communities is very important in data mining of trends, as it enables to get a real picture of users' minds. One more useful thing in social networks may be an additional service that would filter the activity of users from anomalous communities; or it may inform other users about any suspicious users and informational streams.

In our further studies we are planning to analyze the other types of anomalous informational streams, using the theory of formal concept analysis, the theory of semantic fields, the theory of frequent itemsets and association rules.

вторник, 15 октября 2013 г.

Data Mining of Informational Stream in Social Networks

Data Mining of Informational Stream in Social Networks from Bohdan Pavlyshenko

вторник, 23 июля 2013 г.

Can Twitter predict royal baby's name? (Updated)

One of the main news today is the birth of royal baby, the crown prince. We congratulate Kate and William on this event and wish much health and happiness to them and their son!

Is it possible to predict the crown prince's name on the basis of the analysis of tweets? Using NLP methods, the theory of frequent itemsets and association rules, I have analysed the tweets. For my analysis, I used the R environment and the algorithms I used in my previous studies. I've obtained the following distribution of names:

So, we'll see if there is really the crown prince's name among all these male names.

After the Royal baby's name was announced

At last the Royal baby's name has been announced: Prince George of Cambridge!

As a result, we can see tweets mining could predict the Royal baby's name! What does this mean? Somebody writes me that this study is nuts. It is really not serious problem and nuts, if to take it literally. But the main goal of this study is to test whether there is a correlation between social network users' opinions and the decisions that can be made by individuals who are highly influential in certain spheres of the society. As the obtained results show such correlation does exist. The Crown Prince's full name is George Alexander Louis. Unfortunately I don't know the history of England very well and I didn't take into account that the full name of the Royal baby may consist of three names. I studied the tweets array once again which had been downloaded before the Crown Prince's name was announced. Using the theory of frequent itemsets and association rules, we studied which names occur in tweets together. As the analysis showed the three names George, Alexander and Louis are the part of the top 5 of frequent itemsets with the biggest level of support.

Top of frequent itemsets:

items                                         support
1 {alexander,george,james}     0.135593220
2 {george,henry,james}     0.121725732
3 {george,james,louis}     0.104776579
4 {alexander,james,louis}     0.098613251
5 {alexander,george,louis}     0.098613251
6 {george,henry,louis}     0.095531587
7 {alexander,henry,james}     0.093990755
8 {alexander,george,henry}     0.093990755
9 {henry,james,louis}     0.092449923
10 {alexander,henry,louis}     0.090909091

The formation of frequent itemsets can be represented as the following graph:

On the basis of frequent itemsets with three elements, we analysed the association rules with high level of support and confidence. The names George, Alexander and Louis also form the top 5 of association rules, grouped by the value of confidence:

Top of association rules:

1 {james,louis} => {george} 0.10477658 0.9855072 1.714730

2 {henry,louis}     => {george}    0.09553159 0.9841270 1.712328
3 {alexander,louis}    => {james}     0.09861325 0.9696970 2.192799
4 {alexander,louis}    => {george}    0.09861325 0.9696970 1.687221
5 {james,louis}     => {alexander} 0.09861325 0.9275362 4.459045
6 {george,louis}     => {james}     0.10477658 0.9189189 2.077973
7 {alexander,james}    => {george}    0.13559322 0.8888889 1.546619
8 {george,louis}     => {alexander} 0.09861325 0.8648649 4.157758
9 {alexander,george}   => {james}     0.13559322 0.8543689 1.932005
10 {george,louis}     => {henry}     0.09553159 0.8378378 3.649374

The top 5 of obtained association rules can be represented as the following:

Consider the set structure of the users who participated in the discussion of the prince's name. To identify the communities that were formed dynamically in the discussion under analysis, we used a fast greedy modularity optimization algorithm. To build a graph, we used a Fruchterman-Reingold algorithm. This algorithm belongs to force algorithms, or spring algorithms. The character of the graph is due to the model which is used in force algorithms. The distinctive feature of the model is that its vertices are considered as the balls, affected by repulsive forces; and the edges are considered as spring models that attract the vertices which are connected by these edges . In the tweet arrays, we have found 6919 users that sent 37191 tweets. These tweets mentioned 2645 users. An essential part of these mentions is relates to retweets. For further analysis, we take active users who sent more than on tweet in the process of discussion or who were mentioned in tweets more than once. We have found 2,300 active users who sent more than one tweet, and 923 users who were mentioned in tweets more than once. Figure 6 shows the graph of users' interrelations, the shades of colors on it mark the users' communities. On this graph, we can see that there are several numerous users' communities.

Revealed users' communities.

Our next step is to conduct the analysis after removing the most popular users that were mentioned in tweets 100 times or more. We have found only 6 such users. Having removed these users from the analysis, we received the community graph. Removed users constitute nearly 0.2% of all the users mentioned in tweets. As follows from the obtained data, that if to remove only the most popular users from the analysis, the community structure will be changed significantly, and only numerous small communities will be left.

Users' communities without six most popular users.

The results of the study demonstrate that tweets mining could predict the Royal baby's name. We showed that the major name of newborn Prince George was dominant in the spectrum of names before the official announcement. It follows from the obtained data that the theory of frequent sets allows to get a more precise prediction for the full name if to compare with the analysis of the name frequency range which allows to predict a major name only. The three prince's component names George, Alexander, Louis form a frequent itemset of words and this itemset was the part of the top 5 largest frequent itemsets by the support value. We also showed that the structure of dynamically formed users' communities that participated in the discussion is defined by only several leaders who have a significant influence on the position of other users. What do these results mean? It is really not a serious problem, if to take it literally. But the main goal of this study is to test whether there is any correlation between social network users' opinions and the decisions that can be made by individuals who are highly influential in certain spheres of the society. In our studies, we revealed that such a correlation does exist. This means that there is a certain correlation between the bloggers' viewpoints and the decision-making of the Royal family as to the prince's name.

Populare retweets about Royal Baby name:

"RT @Lord_Voldemort7: They should name the #RoyalBaby 'Weasley' so that in future people can go around singing "Weasley is our King." "
"RT @Lord_Voldemort7: #RoyalBabyName It seems only fitting that the son of Prince William and commoner Kate Middleton be named Severus Snape…"
"RT @PrincessKateNOT: We have decided to name our #RoyalBaby with a popular British boys name. Mohammed."
"RT @eonline: We still don't know the #RoyalBaby's name...but we may have an idea of what his surname could be!"
"RT @AmazingPhil: I think they should name him after his great grandfather! Prince Philip. #RoyalBaby"
"RT @AdamCatterall: I woke to see #thunderstorm was trending. For a moment I thought they'd let Kanye West name the #RoyalBaby"
"RT @gracehelbig: Should've named it "Norther Wester." #RoyalBaby"
"RT @wescraven: Suggested name for the new little prince... Freddy. #RoyalBaby"
"RT @Lord_Voldemort7: The #RoyalBaby has not yet been named. They should just call him 'You Know Who.'"
"RT @MelissaJoanHart: I seriously don't wanna go to bed without knowing the #royalbaby name. Isn't that ridiculous?! "
"RT @Telegraph: #RoyalBaby: George is the bookies' favourite for the new prince's name, followed by James, Alexander, Louis and Henry"

суббота, 1 июня 2013 г.

The analysis of travelling trends, using tweets data mining.

In our previous studies, we showed that it is possible to use the data mining of Twitter microblogs for events forecasting (see Eurovision 2013 forecasting), goods marketing, stock market analysis (see article at arxqv.org).

In this research, I’m trying to show that tweets mining is possible to use in the sphere of services. As an example, we take the theme of travelling.

So, we analyse the messages of Twitter microblogs concerning the “travelling” themes. In our analysis, we use the theory of frequent sets and association rules, the theory of semantic fields. We conducted the analysis using the R language and corresponding special-purpose packages.

For the testing, we take such semantic frames as: time of a trip, city, country, and some associated key concepts. For the analysis, we downloaded the tweets for the last week of May.

We have found the following countries for travelling, which were mentioned most often:

The similar calculations for cities:

The data mining for months is described on the following diagram:

Our further analysis can be made or each separate month:

The similar data maning was made for tweets posted by users, located in London and New York:

Users from London:

Users from New York:

Using the theory of frequent sets, we can find the cities, which are often mentioned together. These frequent itemsets can be displayed by the following graph:

In the loaded tweets, you can find the following association rules:

For each of revealed trends or association rules, it is possible to find a list of users, whose messages create these trends and rules. Such list of users may be used for target marketing.

The research conducted shows the availability of using the data mining of Twitter microblogs for the marketing research of services, in particular, travelling.

суббота, 25 мая 2013 г.

Tweets mining using NLP can help in goods marketing.

Recently the topic of data mining of messages in social networks, particularly in Twitter, has been widely discussed, from the point of view of sociology, politics, economics, marketing, etc.

In my previous studies, I have analysed Eurovision 2013 forecasting, tweets miner for stock markets, Granger test for financial tweets, the "End of the world" concept in twitter microblogs, the use of data mining for sports events forecasting.

In this study, I want to show that tweets mining can be valuable for the marketing research of goods. I took smartphones as an example. I have downloaded several thousands of tweets that refer to smartphones and contain the keywords like: iPhone, Apple, Galaxy, Blackberry, etc. To study the iPhone concept, we used the theory of frequent sets and association rules. The research was conducted using the language of statistical calculations R.

Here are the results obtained:

We received a matrix for aggregated association rules for concept 'iphone':

The association rules for concept 'iphone' with high support can be represented with the help of such graphs:

The frequent sets for concept 'iphone' with the high value of support can be represented by the following graphs:

Here is an example of associations for different smartphones :

The popularity of different models:

The comparison of two models based on tweets mining, using syntactic parsing:

воскресенье, 19 мая 2013 г.

About Eurovision 2013 forecasting using NLP: the day after

Previous part.

Yesterday the final of Eurovision Song Contest took place.
Before the final, I made a forecasting on the basis of data mining of tweets.
The results of my analysis were the following:
winner is going to be a singer from Denmark, the next three places will go to Ukraine, Russia and Ireland.
The anounced results of the final are:
1st place - Denmark, 2nd place - Azerbaijan, 3rd place - Ukraine, 4th place - Norway, 5th place - Russia.
Our data mining analysis has correctly detected the winner and the top places for Ukraine and
Russia. However, Ireland was mistakenly included into top five, and Azerbaijan and Norway were not mentioned at all.
I have conductd analysis again and found a basic mistake in the algorithm. In the analysis, a great number of associative rules appears and they must be filtered out. For the filtration, I chose the words of high frequency, which included "Ireland" and ignored "Azerbaijan" and "Norway". That is why the tweets with those words were excluded from the analysis. The mistake is in the fact that high-frequency keywords may be used in some other contexts and have nothing to do with the analysis with the favourites of the competition. I have conducted the analysis once again, it was on the basis of the same tweets which were loaded in May 17. All countries participating in the contest were taken into account. And this forecasting turned out to be very close to the real results of Eurovision Song Contest.

Besides, it is worth saying that Twitter is not evenly widespread in all countries, that is why the number of tweets from different countries was also different. Additionally, there is also an unpleasant political factor. E.g., Ukraine is in the top 3 due to the results, but Russia (Ukraine's neighbour) gave it only one point.

The results presented are for the same source of tweets, obtained with the consideration of previous errors:

lhs rhs support confidence lift

1 {denmark,
    norway}      => {win}    0.014610390 0.9000000 1.3137441
2 {denmark,
    favourites} => {win}    0.011363636 1.0000000 1.4597156
3 {azerbaijan,
    norway}      => {win}    0.011363636 0.8750000 1.2772512
4 {denmark,
    ukraine}     => {win}    0.008116883 0.8333333 1.2164297
5 {azerbaijan,
    russia}      => {win}    0.008116883 0.8333333 1.2164297
6 {azerbaijan,
    denmark}     => {win}    0.008116883 0.7142857 1.0426540
7 {finland,
    sweden}      => {win}    0.008116883 1.0000000 1.4597156
8 {russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
9 {azerbaijan,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
10 {norway,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
11 {norway,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
12 {denmark,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
13 {denmark,
    sweden}      => {win}    0.006493506 0.8000000 1.1677725
14 {azerbaijan,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
15 {norway,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
16 {denmark,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
17 {azerbaijan,
    norway,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
18 {azerbaijan,
    denmark,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
19 {denmark,
    norway,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
20 {azerbaijan,
    norway,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
21 {azerbaijan,
    denmark,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
22 {denmark,
    norway,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
23 {azerbaijan,
    denmark,
    norway}      => {win}    0.006493506 0.8000000 1.1677725
24 {azerbaijan,
    norway,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
25 {azerbaijan,
    denmark,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
26 {denmark,
    norway,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
27 {azerbaijan,
    denmark,
    norway,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
28 {azerbaijan,
    denmark,
    norway,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
29 {azerbaijan,
    denmark,
    norway,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
30 {azerbaijan,
    georgia}     => {win}    0.004870130 1.0000000 1.4597156
31 {georgia,
    norway}      => {win}    0.004870130 1.0000000 1.4597156
32 {favourites,
    norway}      => {win}    0.004870130 1.0000000 1.4597156

четверг, 12 декабря 2013 г.