четверг, 2 января 2014 г.

Graph Approach for Data Mining in BI Applications

    I would like to share some thoughts about using graph theory in modern business intelligence (BI). Business intelligence is about providing reporting and analysis solutions that show business users what happened and why. On the other hand, advanced analytics solutions deliver deeper insight into what might happen in future, based upon high volumes of historical data and sophisticated modeling techniques.  BI includes advanced econometric, statistical analyses, machine learning, predictive analytics and other modern approaches including methods of graph theory. Recently IBM revealed technological trend prediction for the next 5 years  (CNN). These are Smarter Classrooms, Smarter Stores, Smarter Medicine, Smarter Privacy and Protection, Smarter Cities.  The structure of data in these technology areas can be effectively represented by graphs. I would like to explain briefly the prospects of using the methods of graph theory. Graphs make it possible to establish connections between different component elements in the system. One of the main advantages of graph approach is in the fact that one can analyze some entity not only as an element of some set or transaction, but also take into account type relations with nodes of all network system.  Graph approaches give new abilities in such areas as fraud detection, finding key influencers, gatekeepers, suspicious users and processes, predictive analytics, anomaly detection, detections of artificial communities, recommendation systems, etc. Community detection, detecting fraudulent personalities in networks, an important part of network analysis, has become a popular area of investigations. Recommendation systems process relations between users and some other entities e.g. products, processes, services which are based on users' activity such as purchasing,  etc.   Graphs are the most widely used while describing social ties between users. Many methods of social network analysis are based on the graph theory. Graphs allow to establish connections not only between users but also to include various entities  into these structures. For these needs N-mode graph presentation can be used. For a specific analysis N-mode graphs can be transformed into one-mode graph using different aggregation functions. So, using graph approach, we can describe different types of users' relations together with processes and other entity relations.  To build a practical system it is necessary to consider graph quantitative characteristics, which may be further used in the algorithms of supervised and unsupervised classifications. We can consider different graph characteristics which can be used for graph vertices analysis. The simple ones are betweenness  and closeness centralities, vertex PageRank, authority, coreness, etc. These measures allow to estimate the importance of a node in the network. As an example, I took a social graph that was created using tweet stream from my previous investigation Detection of Community Anomalies in Twitter Trends .   Fig. 1 shows the graph with detected communities.
Fig. 1
Fig. 2 shows the subgraph of the vertices with the authority > 0.
Fig. 2
 Fig. 3 shows the subgraph of the vertices with the coreness > 1.
Fig. 3
Fig. 4  shows the subgraph of vertices with the  coreness > 3.
Fig. 4
Using these scores, we can find both artificial anomalous communities and the communities with the users who are really important in the analyzed tweet stream. The removal of anomalous communities is very important in data mining of trends, since it enables us to get a real picture of users' minds. One more useful thing in social networks may be an additional service that would filter out anomalous communities; or it may inform other users about any suspicious users and informational streams.
   Here are some other results. For the last several months we have been loading the tweet stream with such given keywords as apple, cosmetics, etc. Consider tweet stream with the keyword 'apple'.  Fig.5 shows the revealed users' communities, formed on the basis of the analysis of graph relations.
Fig.6 demonstrates the cloud of keyword frequencies.
As a result of graph analysis, we revealed competent users and key influencers. The connections between most competent users and influencers are shown on the fig. 7.  
For further consideration we took the tweets of the most competent and authoritative users that were defined by many graph characteristics.  The cloud of keyword frequencies for such tweet array is shown on the fig. 8.  The users' authority was defined on the basis of user's connections with other users, taking into account the connections of those users with the others. In this analysis, we calculated the principal eigenvector for the product of transposed adjacency matrix and adjacency matrix of the graph. The most authoritative user appeared to be the one with the name jastinbieber. A very interesting thing is that the Apple products occur in the tweets of this user not directly but implicitly, in the contexts of other topics, e.g. as a link to itunes. In my opinion this is the most effective type of advertising. 
Fig. 8
We carried out the similar analysis  for the tweet stream with two keywords 'phone' and 'galaxy'.  Fig.9 shows the revealed users' communities, fig.10 shows the cloud of keyword frequencies,  fig.11 shows   the cloud of keyword frequencies in the tweet array of most competent and authoritative users.   The most authoritative user in this analysis  appeared to be the one with the name hayyouapp.
Fig. 9
Fig. 10
Fig. 11

    In our next studies, we intend to research users' communities and entities denoted by keywords in financial and business tweet streams. Our aim is to find out if  stream graph scores have predictive features allowing to predict important time series business and financial time series. In our previous studies we found out that frequent sets, based on the tweet streams, can be used for financial predictive analytics. You can find these results here:
Forecasting of Stock Financial Series Using Multivariate Vector Autoregressive Model 
Granger Causality Test for Frequent Itemsets of Keywords in Financial Tweets
Tweets Miner for Stock Markets
  We intend to research how graph approach in the analysis of social networks can be used in financial analytics and business intelligence applications.   

1 комментарий:

  1. Making data visual and "readable" is most valuable.

    By the way, Justin Bieber is a very famous (and rather ill-behaved) Canadian singer.