четверг, 9 января 2014 г.

Mining of Social Network Streams in Marketing, Predictive Analytics, and Risk Management.


   The analysis of modern social networks is widely used in many business areas such as marketing, forecasting, financial and stock markets, etc. Marketing, Predictive Analytics & Risk Management are the parts of Business intelligence (BI). BI provides reporting and analysis that can help make business decisions and show what happened and why.  We would like to consider the ability of using data mining methods which were applied to unstructured data streams in BI  solutions. One can gather such data from different sources, e.g. social network streams, specialized forums, RSS channels, etc. We especially study how such type of analysis can be applied to predictive analytics and risk management. Let us consider the grounds of  these areas. Predictive analytics is an area of data mining that deals with extracting information from data and using it to predict trends and behavior patterns. Predictive models analyze past performance to assess how likely a customer is to exhibit a specific behavior in order to improve marketing effectiveness. With the number of competing services available, businesses need to focus efforts on maintaining continuous consumer satisfaction, rewarding consumer loyalty.  So, it is important to analyze users' opinion which can be retrieved from users' messages in social networks. Predictive analytics can also predict this behavior, so that the company can take proper actions to increase customer activity.  Apart from identifying prospects, predictive analytics can also help to identify the most effective combination of product versions, marketing material, communication channels and timing that should be used to target a given consumer. Predictive analytics and  mining of social network streams can also be used to identify high-risk fraud candidates in business or the public sector. Another area where we can implement social network stream is risk management, this is about the identification, assessment, and prioritization of risks. Social network streams make it possible to reveal quantitative characteristics of background factor for the processes under analysis. Monitoring of indicators of social network streams allows to control the probability and/or impact of unfortunate events or to maximize the realization of opportunities. Lack of knowledge can be retrieved from semistructured data of message streams in social networks. This additional knowledge can also help to optimize analyzed processes and minimize overall risk. Text stream mining enables us to reveal the dynamics of different risk sources by analyzing quantitative indicators, retrieved from social network data. One of important factors in risk management is users' opinion about some entity, e.g. process, services, etc. Such opinions can be retrieved using sentiment mining approach applied to informational streams of social networks.  Modern systems of business intelligence widely use the analytical methods of non-structured and semi structured data, gathered from different sources.
   We would like to show the possibility of analyzing economical and financial indicators using the stream of textual data and informational streams of social networks, special-purpose forums, and RSS channels.  Consider the data mining of social network streams. To receive information streams, we used Twitter API and special Python software for web scraping of special-purpose forums.  The theoretical basis for the analysis was the theory of semantic fields, the analysis of formal concepts, and the theory of frequent sets and association rules, sentiment mining methods. For the predictive analytics we used ARIMA and VAR models. The Granger test was used to find causality between time series.  As a result of data mining of text messages we will receive the time series of various quantitative characteristics of blog messages, e.g. support and confidence of association rules.  The next step is to find correlations between the time series, which are the results of social network data mining and the time series that represent real stock markets. On this step, we need to find such time series of social media trends that not only correlate with stock market series but also have predictive potential.  Very important for decision-making in risk management are the visualization of data and infographics, on the basis of which an expert makes his decision. That is why we attached a great importance to various methods how to represent our results. As our previous studies show it is very important to detect and remove anomalous communities that were dynamically formed in tweet streams. We also showed   that it is very important to single out the tweets of competent users and main influencers. We can find them using different methods of graph theory.
    As an example, consider the dynamics of popularity of some cosmetic brands, based on the downloaded tweet streams. Fig. 1-4 show the results obtained. We used various types to visualize our results in graph presentation. Such types of graphs may be used in business intelligence dashboards. They may also provide additional business information for the experts in marketing, predictive analytics, and risk management spheres.  

Fig.1


Fig.2


Fig.3

Fig.4


Let us consider the dynamics of chosen brands, based on the analysis of messages from economic forums. Those messages were downloaded from forums using corresponding Python software. Fig.5-7 shows the obtained results in different graphical presentations.    
Fig.5


Fig.6

Fig.7

Now we consider the dynamics of quantitative characteristics of one company, based on the analysis of downloaded tweet streams.  We chose Apple company as an example. Fig.8-9 shows the graphs with the dynamics of keyword frequent itemsets and the dynamics of users' opinionThese results reflect the dynamics of the popularity of Apple products and users' opinion towards them.
Fig.8

Fig.9

Our next step is to consider if it is possible to predict Apple stock prices on the basis of obtained time series of keyword frequent itemsets. In our previous studies, we conducted the Granger test for the time series of frequent itemsetsand Apple stock prices. This test showed that the time series of frequent itemsets of analyzed tweet stream causes the peculiarities of the dynamics of stock prices. We use the VAR model to analyze the possibility to predict stock prices. This model takes into account both the dynamics of stock prices and the dynamics of some chosen frequent itemsets. Fig.10-12 show the calculation results with different sets of frequent itemsets.  The bold points are the predicted values that were calculated on the basis of previous historical data. Fig.10-11 shows the calculations for three days ahead , and the fig.12 shows the calculations of the prediction for one day ahead. Confidential interval is marked by grey color.
Fig.10

Fig.11

Fig.12

Into VAR model, we included the time series of keywords and users' opinions of frequent itemsets.  The obtained data show that on some analyzed intervals VAR model has appeared to be effective in predictive analytics approach to stock market forecasting In our further studies we are going to concentrate on the algorithms how to select effectively the sets of time series of frequent itemsets for the purpose of reducing the confidence interval and more accurate prediction for longer time periods.  

Our previous similar investigations can be found at:
We also give our  selected scientific e-prints and links where we described the theoretical grounds of social network mining, which we used in our studies:

B. Pavlyshenko
Tweets Miner for Stock Market Analysis
             In this paper, we present a software package for the data mining of Twitter microblogs with the purpose of their usage in the stock market analysis. The package is written in R language using appropriate R packages. We considered the model of tweets and then compared stock market charts with frequent sets of keywords in Twitter microblog messages.
B. Pavlyshenko
Can Twitter Predict Royal Baby's Name?
             We analyze the existence of possible correlation between public opinion of twitter users and the decision-making of persons who are influential in the society. In our study, we use the methods of quantitative processing of natural language, the theory of frequent sets, the algorithms of visual displaying of users' communities. It was revealed that the structure of dynamically formed users' communities participating in the discussion is determined by only a few leaders who influence significantly the viewpoints of other users.
B. Pavlyshenko
Forecasting of Events by Tweet Data Mining
             This paper describes the analysis of quantitative characteristics of frequent sets and association rules in the posts of Twitter microblogs related to different event discussions. For the analysis, we used a theory of frequent sets, association rules and a theory of formal concept analysis. We revealed the frequent sets and association rules which characterize the semantic relations between the concepts of analyzed subjects. The support of some frequent sets reaches its global maximum before the expected event but with some time delay. Such frequent sets may be considered as predictive markers that characterize the significance of expected events for blogosphere users. We showed that the time dynamics of confidence in some revealed association rules can also have predictive characteristics. Exceeding a certain threshold may be a signal for corresponding reaction in the society within the time interval between the maximum and the probable coming of an event. In this paper, we considered two types of events: the Olympic tennis tournament final in London, 2012 and the prediction of Eurovision 2013 winner.

четверг, 2 января 2014 г.

Graph Approach for Data Mining in BI Applications

    I would like to share some thoughts about using graph theory in modern business intelligence (BI). Business intelligence is about providing reporting and analysis solutions that show business users what happened and why. On the other hand, advanced analytics solutions deliver deeper insight into what might happen in future, based upon high volumes of historical data and sophisticated modeling techniques.  BI includes advanced econometric, statistical analyses, machine learning, predictive analytics and other modern approaches including methods of graph theory. Recently IBM revealed technological trend prediction for the next 5 years  (CNN). These are Smarter Classrooms, Smarter Stores, Smarter Medicine, Smarter Privacy and Protection, Smarter Cities.  The structure of data in these technology areas can be effectively represented by graphs. I would like to explain briefly the prospects of using the methods of graph theory. Graphs make it possible to establish connections between different component elements in the system. One of the main advantages of graph approach is in the fact that one can analyze some entity not only as an element of some set or transaction, but also take into account type relations with nodes of all network system.  Graph approaches give new abilities in such areas as fraud detection, finding key influencers, gatekeepers, suspicious users and processes, predictive analytics, anomaly detection, detections of artificial communities, recommendation systems, etc. Community detection, detecting fraudulent personalities in networks, an important part of network analysis, has become a popular area of investigations. Recommendation systems process relations between users and some other entities e.g. products, processes, services which are based on users' activity such as purchasing,  etc.   Graphs are the most widely used while describing social ties between users. Many methods of social network analysis are based on the graph theory. Graphs allow to establish connections not only between users but also to include various entities  into these structures. For these needs N-mode graph presentation can be used. For a specific analysis N-mode graphs can be transformed into one-mode graph using different aggregation functions. So, using graph approach, we can describe different types of users' relations together with processes and other entity relations.  To build a practical system it is necessary to consider graph quantitative characteristics, which may be further used in the algorithms of supervised and unsupervised classifications. We can consider different graph characteristics which can be used for graph vertices analysis. The simple ones are betweenness  and closeness centralities, vertex PageRank, authority, coreness, etc. These measures allow to estimate the importance of a node in the network. As an example, I took a social graph that was created using tweet stream from my previous investigation Detection of Community Anomalies in Twitter Trends .   Fig. 1 shows the graph with detected communities.
Fig. 1
 
 
 
Fig. 2 shows the subgraph of the vertices with the authority > 0.
 
Fig. 2
 
 Fig. 3 shows the subgraph of the vertices with the coreness > 1.
Fig. 3
 
Fig. 4  shows the subgraph of vertices with the  coreness > 3.
 
Fig. 4
 
Using these scores, we can find both artificial anomalous communities and the communities with the users who are really important in the analyzed tweet stream. The removal of anomalous communities is very important in data mining of trends, since it enables us to get a real picture of users' minds. One more useful thing in social networks may be an additional service that would filter out anomalous communities; or it may inform other users about any suspicious users and informational streams.
   Here are some other results. For the last several months we have been loading the tweet stream with such given keywords as apple, cosmetics, etc. Consider tweet stream with the keyword 'apple'.  Fig.5 shows the revealed users' communities, formed on the basis of the analysis of graph relations.
Fig.5
 
 
Fig.6 demonstrates the cloud of keyword frequencies.
Fig.6 
 
As a result of graph analysis, we revealed competent users and key influencers. The connections between most competent users and influencers are shown on the fig. 7.  
Fig.7
 
 
For further consideration we took the tweets of the most competent and authoritative users that were defined by many graph characteristics.  The cloud of keyword frequencies for such tweet array is shown on the fig. 8.  The users' authority was defined on the basis of user's connections with other users, taking into account the connections of those users with the others. In this analysis, we calculated the principal eigenvector for the product of transposed adjacency matrix and adjacency matrix of the graph. The most authoritative user appeared to be the one with the name jastinbieber. A very interesting thing is that the Apple products occur in the tweets of this user not directly but implicitly, in the contexts of other topics, e.g. as a link to itunes. In my opinion this is the most effective type of advertising. 
 
Fig. 8
 
We carried out the similar analysis  for the tweet stream with two keywords 'phone' and 'galaxy'.  Fig.9 shows the revealed users' communities, fig.10 shows the cloud of keyword frequencies,  fig.11 shows   the cloud of keyword frequencies in the tweet array of most competent and authoritative users.   The most authoritative user in this analysis  appeared to be the one with the name hayyouapp.
Fig. 9
Fig. 10
Fig. 11
 

    In our next studies, we intend to research users' communities and entities denoted by keywords in financial and business tweet streams. Our aim is to find out if  stream graph scores have predictive features allowing to predict important time series business and financial time series. In our previous studies we found out that frequent sets, based on the tweet streams, can be used for financial predictive analytics. You can find these results here:
Forecasting of Stock Financial Series Using Multivariate Vector Autoregressive Model 
Granger Causality Test for Frequent Itemsets of Keywords in Financial Tweets
Tweets Miner for Stock Markets
  We intend to research how graph approach in the analysis of social networks can be used in financial analytics and business intelligence applications.