My Analytic Research

вторник, 23 июля 2013 г.

Can Twitter predict royal baby's name? (Updated)

One of the main news today is the birth of royal baby, the crown prince. We congratulate Kate and William on this event and wish much health and happiness to them and their son!

Is it possible to predict the crown prince's name on the basis of the analysis of tweets? Using NLP methods, the theory of frequent itemsets and association rules, I have analysed the tweets. For my analysis, I used the R environment and the algorithms I used in my previous studies. I've obtained the following distribution of names:

So, we'll see if there is really the crown prince's name among all these male names.

After the Royal baby's name was announced

At last the Royal baby's name has been announced: Prince George of Cambridge!

As a result, we can see tweets mining could predict the Royal baby's name! What does this mean? Somebody writes me that this study is nuts. It is really not serious problem and nuts, if to take it literally. But the main goal of this study is to test whether there is a correlation between social network users' opinions and the decisions that can be made by individuals who are highly influential in certain spheres of the society. As the obtained results show such correlation does exist. The Crown Prince's full name is George Alexander Louis. Unfortunately I don't know the history of England very well and I didn't take into account that the full name of the Royal baby may consist of three names. I studied the tweets array once again which had been downloaded before the Crown Prince's name was announced. Using the theory of frequent itemsets and association rules, we studied which names occur in tweets together. As the analysis showed the three names George, Alexander and Louis are the part of the top 5 of frequent itemsets with the biggest level of support.

Top of frequent itemsets:

items                                         support
1 {alexander,george,james}     0.135593220
2 {george,henry,james}     0.121725732
3 {george,james,louis}     0.104776579
4 {alexander,james,louis}     0.098613251
5 {alexander,george,louis}     0.098613251
6 {george,henry,louis}     0.095531587
7 {alexander,henry,james}     0.093990755
8 {alexander,george,henry}     0.093990755
9 {henry,james,louis}     0.092449923
10 {alexander,henry,louis}     0.090909091

The formation of frequent itemsets can be represented as the following graph:

On the basis of frequent itemsets with three elements, we analysed the association rules with high level of support and confidence. The names George, Alexander and Louis also form the top 5 of association rules, grouped by the value of confidence:

Top of association rules:

1 {james,louis} => {george} 0.10477658 0.9855072 1.714730

2 {henry,louis}     => {george}    0.09553159 0.9841270 1.712328
3 {alexander,louis}    => {james}     0.09861325 0.9696970 2.192799
4 {alexander,louis}    => {george}    0.09861325 0.9696970 1.687221
5 {james,louis}     => {alexander} 0.09861325 0.9275362 4.459045
6 {george,louis}     => {james}     0.10477658 0.9189189 2.077973
7 {alexander,james}    => {george}    0.13559322 0.8888889 1.546619
8 {george,louis}     => {alexander} 0.09861325 0.8648649 4.157758
9 {alexander,george}   => {james}     0.13559322 0.8543689 1.932005
10 {george,louis}     => {henry}     0.09553159 0.8378378 3.649374

The top 5 of obtained association rules can be represented as the following:

Consider the set structure of the users who participated in the discussion of the prince's name. To identify the communities that were formed dynamically in the discussion under analysis, we used a fast greedy modularity optimization algorithm. To build a graph, we used a Fruchterman-Reingold algorithm. This algorithm belongs to force algorithms, or spring algorithms. The character of the graph is due to the model which is used in force algorithms. The distinctive feature of the model is that its vertices are considered as the balls, affected by repulsive forces; and the edges are considered as spring models that attract the vertices which are connected by these edges . In the tweet arrays, we have found 6919 users that sent 37191 tweets. These tweets mentioned 2645 users. An essential part of these mentions is relates to retweets. For further analysis, we take active users who sent more than on tweet in the process of discussion or who were mentioned in tweets more than once. We have found 2,300 active users who sent more than one tweet, and 923 users who were mentioned in tweets more than once. Figure 6 shows the graph of users' interrelations, the shades of colors on it mark the users' communities. On this graph, we can see that there are several numerous users' communities.

Revealed users' communities.

Our next step is to conduct the analysis after removing the most popular users that were mentioned in tweets 100 times or more. We have found only 6 such users. Having removed these users from the analysis, we received the community graph. Removed users constitute nearly 0.2% of all the users mentioned in tweets. As follows from the obtained data, that if to remove only the most popular users from the analysis, the community structure will be changed significantly, and only numerous small communities will be left.

Users' communities without six most popular users.

The results of the study demonstrate that tweets mining could predict the Royal baby's name. We showed that the major name of newborn Prince George was dominant in the spectrum of names before the official announcement. It follows from the obtained data that the theory of frequent sets allows to get a more precise prediction for the full name if to compare with the analysis of the name frequency range which allows to predict a major name only. The three prince's component names George, Alexander, Louis form a frequent itemset of words and this itemset was the part of the top 5 largest frequent itemsets by the support value. We also showed that the structure of dynamically formed users' communities that participated in the discussion is defined by only several leaders who have a significant influence on the position of other users. What do these results mean? It is really not a serious problem, if to take it literally. But the main goal of this study is to test whether there is any correlation between social network users' opinions and the decisions that can be made by individuals who are highly influential in certain spheres of the society. In our studies, we revealed that such a correlation does exist. This means that there is a certain correlation between the bloggers' viewpoints and the decision-making of the Royal family as to the prince's name.

Populare retweets about Royal Baby name:

"RT @Lord_Voldemort7: They should name the #RoyalBaby 'Weasley' so that in future people can go around singing "Weasley is our King." "
"RT @Lord_Voldemort7: #RoyalBabyName It seems only fitting that the son of Prince William and commoner Kate Middleton be named Severus Snape…"
"RT @PrincessKateNOT: We have decided to name our #RoyalBaby with a popular British boys name. Mohammed."
"RT @eonline: We still don't know the #RoyalBaby's name...but we may have an idea of what his surname could be!"
"RT @AmazingPhil: I think they should name him after his great grandfather! Prince Philip. #RoyalBaby"
"RT @AdamCatterall: I woke to see #thunderstorm was trending. For a moment I thought they'd let Kanye West name the #RoyalBaby"
"RT @gracehelbig: Should've named it "Norther Wester." #RoyalBaby"
"RT @wescraven: Suggested name for the new little prince... Freddy. #RoyalBaby"
"RT @Lord_Voldemort7: The #RoyalBaby has not yet been named. They should just call him 'You Know Who.'"
"RT @MelissaJoanHart: I seriously don't wanna go to bed without knowing the #royalbaby name. Isn't that ridiculous?! "
"RT @Telegraph: #RoyalBaby: George is the bookies' favourite for the new prince's name, followed by James, Alexander, Louis and Henry"

суббота, 1 июня 2013 г.

The analysis of travelling trends, using tweets data mining.

In our previous studies, we showed that it is possible to use the data mining of Twitter microblogs for events forecasting (see Eurovision 2013 forecasting), goods marketing, stock market analysis (see article at arxqv.org).

In this research, I’m trying to show that tweets mining is possible to use in the sphere of services. As an example, we take the theme of travelling.

So, we analyse the messages of Twitter microblogs concerning the “travelling” themes. In our analysis, we use the theory of frequent sets and association rules, the theory of semantic fields. We conducted the analysis using the R language and corresponding special-purpose packages.

For the testing, we take such semantic frames as: time of a trip, city, country, and some associated key concepts. For the analysis, we downloaded the tweets for the last week of May.

We have found the following countries for travelling, which were mentioned most often:

The similar calculations for cities:

The data mining for months is described on the following diagram:

Our further analysis can be made or each separate month:

The similar data maning was made for tweets posted by users, located in London and New York:

Users from London:

Users from New York:

Using the theory of frequent sets, we can find the cities, which are often mentioned together. These frequent itemsets can be displayed by the following graph:

In the loaded tweets, you can find the following association rules:

For each of revealed trends or association rules, it is possible to find a list of users, whose messages create these trends and rules. Such list of users may be used for target marketing.

The research conducted shows the availability of using the data mining of Twitter microblogs for the marketing research of services, in particular, travelling.

суббота, 25 мая 2013 г.

Tweets mining using NLP can help in goods marketing.

Recently the topic of data mining of messages in social networks, particularly in Twitter, has been widely discussed, from the point of view of sociology, politics, economics, marketing, etc.

In my previous studies, I have analysed Eurovision 2013 forecasting, tweets miner for stock markets, Granger test for financial tweets, the "End of the world" concept in twitter microblogs, the use of data mining for sports events forecasting.

In this study, I want to show that tweets mining can be valuable for the marketing research of goods. I took smartphones as an example. I have downloaded several thousands of tweets that refer to smartphones and contain the keywords like: iPhone, Apple, Galaxy, Blackberry, etc. To study the iPhone concept, we used the theory of frequent sets and association rules. The research was conducted using the language of statistical calculations R.

Here are the results obtained:

We received a matrix for aggregated association rules for concept 'iphone':

The association rules for concept 'iphone' with high support can be represented with the help of such graphs:

The frequent sets for concept 'iphone' with the high value of support can be represented by the following graphs:

Here is an example of associations for different smartphones :

The popularity of different models:

The comparison of two models based on tweets mining, using syntactic parsing:

воскресенье, 19 мая 2013 г.

About Eurovision 2013 forecasting using NLP: the day after

Previous part.

Yesterday the final of Eurovision Song Contest took place.
Before the final, I made a forecasting on the basis of data mining of tweets.
The results of my analysis were the following:
winner is going to be a singer from Denmark, the next three places will go to Ukraine, Russia and Ireland.
The anounced results of the final are:
1st place - Denmark, 2nd place - Azerbaijan, 3rd place - Ukraine, 4th place - Norway, 5th place - Russia.
Our data mining analysis has correctly detected the winner and the top places for Ukraine and
Russia. However, Ireland was mistakenly included into top five, and Azerbaijan and Norway were not mentioned at all.
I have conductd analysis again and found a basic mistake in the algorithm. In the analysis, a great number of associative rules appears and they must be filtered out. For the filtration, I chose the words of high frequency, which included "Ireland" and ignored "Azerbaijan" and "Norway". That is why the tweets with those words were excluded from the analysis. The mistake is in the fact that high-frequency keywords may be used in some other contexts and have nothing to do with the analysis with the favourites of the competition. I have conducted the analysis once again, it was on the basis of the same tweets which were loaded in May 17. All countries participating in the contest were taken into account. And this forecasting turned out to be very close to the real results of Eurovision Song Contest.

Besides, it is worth saying that Twitter is not evenly widespread in all countries, that is why the number of tweets from different countries was also different. Additionally, there is also an unpleasant political factor. E.g., Ukraine is in the top 3 due to the results, but Russia (Ukraine's neighbour) gave it only one point.

The results presented are for the same source of tweets, obtained with the consideration of previous errors:

lhs rhs support confidence lift

1 {denmark,
    norway}      => {win}    0.014610390 0.9000000 1.3137441
2 {denmark,
    favourites} => {win}    0.011363636 1.0000000 1.4597156
3 {azerbaijan,
    norway}      => {win}    0.011363636 0.8750000 1.2772512
4 {denmark,
    ukraine}     => {win}    0.008116883 0.8333333 1.2164297
5 {azerbaijan,
    russia}      => {win}    0.008116883 0.8333333 1.2164297
6 {azerbaijan,
    denmark}     => {win}    0.008116883 0.7142857 1.0426540
7 {finland,
    sweden}      => {win}    0.008116883 1.0000000 1.4597156
8 {russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
9 {azerbaijan,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
10 {norway,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
11 {norway,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
12 {denmark,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
13 {denmark,
    sweden}      => {win}    0.006493506 0.8000000 1.1677725
14 {azerbaijan,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
15 {norway,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
16 {denmark,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
17 {azerbaijan,
    norway,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
18 {azerbaijan,
    denmark,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
19 {denmark,
    norway,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
20 {azerbaijan,
    norway,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
21 {azerbaijan,
    denmark,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
22 {denmark,
    norway,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
23 {azerbaijan,
    denmark,
    norway}      => {win}    0.006493506 0.8000000 1.1677725
24 {azerbaijan,
    norway,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
25 {azerbaijan,
    denmark,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
26 {denmark,
    norway,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
27 {azerbaijan,
    denmark,
    norway,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
28 {azerbaijan,
    denmark,
    norway,
    russia}      => {win}    0.006493506 0.8000000 1.1677725
29 {azerbaijan,
    denmark,
    norway,
    russia,
    ukraine}     => {win}    0.006493506 0.8000000 1.1677725
30 {azerbaijan,
    georgia}     => {win}    0.004870130 1.0000000 1.4597156
31 {georgia,
    norway}      => {win}    0.004870130 1.0000000 1.4597156
32 {favourites,
    norway}      => {win}    0.004870130 1.0000000 1.4597156

суббота, 18 мая 2013 г.

Eurovision 2013 forecasting using NLP and association rules

About Eurovision 2013 forecasting using NLP: the day after

Today, on the 18th of May, in Sweden, the Eurovision Song Contest 2013 will be held. I tried to make a forecasting of the winners, on the basis of Twitter messages, using natural language processing (NLP), a theory of association rules and semantic fields. I downloaded the tweets with the keywords (eurovision) just for one day - May 17. The analysis conducted shows that the winner is going to be a singer from Denmark, the next three places will go to Ukraine, Russia and Ireland. Well, we'll see :)
If my forecasting is correct, I will write the algorithms of my analysis in detail.

Here are some of obtained association rules and their presentations :

lhs          rhs          support confidence     lift
1 {denmark,
   russia,
   ukraine} => {win}    0.009779951        0.8 1.216357
2 {ireland,
   russia,
   ukraine} => {winner} 0.002444988        1.0 2.655844
3 {denmark,
   ireland,
   ukraine} => {winner} 0.002444988        1.0 2.655844
4 {denmark,
   ireland,
   russia} => {winner} 0.002444988        1.0 2.655844
5 {denmark,
   ireland,
   russia,
   ukraine} => {winner} 0.002444988        1.0 2.655844

   lhs              rhs          support confidence     lift
1 {denmark,
    favourites} => {win}    0.017114914 1.0000000 1.520446
2 {denmark,
    ukraine}     => {win}    0.012224939 0.8333333 1.267038
3 {russia,
    ukraine}     => {win}    0.009779951 0.8000000 1.216357
4 {denmark,
    russia}      => {win}    0.009779951 0.8000000 1.216357
5 {denmark,
    russia,
    ukraine}     => {win}    0.009779951 0.8000000 1.216357
6 {holland,
    netherlands} => {win}    0.004889976 1.0000000 1.520446
7 {denmark,
    ireland}     => {win}    0.004889976 0.6666667 1.013631
8 {lithuania,
    moldova}     => {win}    0.002444988 1.0000000 1.520446
9 {ireland,
    ukraine}     => {winner} 0.002444988 1.0000000 2.655844
10 {ireland,
    russia}      => {winner} 0.002444988 1.0000000 2.655844
11 {ireland,
    russia,
    ukraine}     => {winner} 0.002444988 1.0000000 2.655844
12 {denmark,
    ireland,
    ukraine}     => {winner} 0.002444988 1.0000000 2.655844
13 {denmark,
    ireland,
    russia}      => {winner} 0.002444988 1.0000000 2.655844
14 {denmark,
    ireland,
    russia,
    ukraine}     => {winner} 0.002444988 1.0000000 2.655844

1 {netherlands} => {vote}        0.06060606        0.8 2.200000
2 {russia}      => {love}        0.04545455        1.0 3.300000
3 {denmark}     => {win}         0.04545455        0.5 1.375000
4 {moldova}     => {votes}       0.03030303        1.0 16.500000
5 {irish}       => {votes}       0.01515152        1.0 16.500000
6 {holland}     => {netherlands} 0.01515152        1.0 13.200000
7 {holland}     => {win}         0.01515152        1.0 2.750000
8 {belgium}     => {love}        0.01515152        0.5 1.650000
9 {belgium}     => {win}         0.01515152        0.5 1.375000
10 {holland,
    netherlands} => {win}         0.01515152        1.0 2.750000
11 {denmark,
    ukraine}     => {favourite}   0.01515152        1.0 7.333333

lhs          rhs         support confidence      lift
1 {denmark} => {win}          0.24 0.8571429 1.785714
2 {denmark} => {favourite}    0.16 0.5714286 2.857143
3 {belgium} => {win}          0.08 1.0000000 2.083333
4 {ireland} => {win}          0.08 0.5000000 1.041667
5 {ukraine} => {favourite}    0.04 1.0000000 5.000000
6 {serbia} => {fun}          0.04 1.0000000 25.000000
7 {pretty} => {russia}       0.04 1.0000000 8.333333
8 {pretty} => {denmark}      0.04 1.0000000 3.571429
9 {pretty,
    russia} => {denmark}      0.04 1.0000000 3.571429
10 {denmark,
    pretty} => {russia}       0.04 1.0000000 8.333333
11 {denmark,
    russia} => {pretty}       0.04 1.0000000 25.000000