вторник, 18 декабря 2012 г.

Investigation of the Concept "End of the World" in Blogs Using Data Mining Methods

   The information space is discussing intensively the end of the world, which  is supposedly to take place on December 21, 2012, according to the Mayan calendar. There are no reasons for natural disasters now. Even NASA has provided a clarification on this matter (http://www.nasa.gov/topics/earth/features/2012.html). However, the very stream of information, which is associated with the "end of the world" concept, may have an impact both on individuals and on the entire society.
    Certain combinations of semantic concepts, which were periodically repeated from different information sources,  can play a role of information viruses and affect people’s mentality and behavior, cause certain trends in society.
 Therefore, in my opinion, the development of methods for detecting trends in the information stream  is very promising. I have downloaded Twitter messages with the keywords "End of the world" and tried to analyze them using data mining methods including  the semantic fields theory, frequency sets, associative rules, opinion mining, Galois lattices. My previous investigations of twitter blogs messages can be found  here.    This approach makes it possible to identify the semantic network of concepts that are associated with the subject of the analysis, and construct the corresponding semantic rules. While analyzing the dynamics of characteristics of semantic rules and relevant frequent sets, one can find connections with various important indicators.
 The tweets with the keywords “world end”, "#endoftheworld", “December 21”, “Dec 21” , “#December 21”, “#end of the world”, were being loaded into separate  files.
 The analysis was conducted in the following sequence: the stop words and the words of the highest and lowest frequency were removed from the loaded arrays. Then the frequent sets of words with given level of support were found. Based on the analysis of the array of frequent sets the associative rules were constructed. I have analyzed the daily dynamics of such characteristics of associative rules as support and confidence. Then the semantic field of words that reflects the semantic frame of the analyzed region was formed. Using that very semantic field I have constructed Galois lattices, which represent semantic relations between the analyzed concepts. On the new-formed lattice the ideals and filters were marked, which reflect the process of the semantic concept formation by other concepts, and the value of support of these concepts.
 On the basis of semantic fields of words with positive and negative sentiment I have calculated the daily dynamics of frequency sets with positive and negative sentiment.
I give the first results as investigations are still in process.

Some associative rules with support and confidence for channel  #endoftheworld :

Semantic field for messages filtering :
 {21st, die, apocalypse, dying, dead, predict, prediction, calendar,
stupid, zombies, world, friday, mayan, hahaha}

associative rules:
{die}  => {friday} 6.79% 66.04%
{apocalypse} => {mayan}  0.76% 38.09%
{apocalypse} => {friday} 0.62% 30.95%
{21st}  => {world}  5.35% 40.57%
{dying}  => {friday} 1.09% 41.81%
{dead}  => {friday} 0.9% 50.0%
{predict} => {world}  1.0% 75.0%
{calendar} => {mayan}  3.1% 60.18%
{stupid} => {friday} 0.9% 46.34%
{zombies} => {friday} 0.52% 52.38%

Ideal & Filter of Galois latice for different concepts (channel #endoftheworld) :
Ideal & Filter for concept {world,die,21st}

Ideal & Filter for concept {world,die,friday}

Ideal & Filter for concept {world,apocalypse,21st}

Ideal & Filter for concept {die,21st, hahaha}

 The first results for the tweets  with the keyword {"end", "world”}:

 Examples of  associative rules with support and confidence :
die => 21st 0.0018 0.11
worry  => 21st   0.00337  0.31
apocalypse => 21st 0.00043  0.06
The dynamics of support for associative rule {end, world, mayan} => {friday}

    Later on I will describe the results in detail. I also intend to develop software that will analyze information trends on this or that subject area using the semantic fields theory, frequency sets, associative rules, opinion mining, Galois lattices.
    I think it is time to study a new information object, which can be called “an informational mental virus” and which can produce latent associative rules in the subconsciousness and, thus, it can influence human behavior. 

воскресенье, 16 декабря 2012 г.

The Clustering of Author's Texts of English Fiction in the Vector Space of Semantic Fields

   The clustering of text documents in the vector space of semantic fields and in the semantic space with orthogonal basis has been analysed. It is shown that using the vector space model with the basis of semantic fields is effective in the cluster analysis algorithms of author’s texts in English fiction. The analysis of the author’s texts distribution in cluster structure showed the presence of the areas of semantic space that represent the author's ideolects of individual authors. SVD factorization of the semantic fields matrix makes it possible to reduce significantly the dimension of the semantic space in the cluster analysis of author’s texts.

Quantum Algorithm of Evolutionary Analysis of 1D Cellular Automata

It is shown that irreversible classical cellular automata can be performed by quantum algorithm using additional ancilla registers. The algorithm for cellular automata states analysis has been proposed. This algorithm is based on the elements of Grover’s algorithm - the inversion of amplitude of searched states and unitary transform of inversion about the average. The inversion of searched states amplitudes can be performed by quantum Toffoli gate.

Genetic Optimization of Keywords Subset in the Classification Analysis of Texts Authorship

   The genetic selection of keywords set, the text frequencies of which are considered as attributes in text classification analysis, has been analyzed. The genetic optimization was performed on a set of words, which is the fraction of the frequency dictionary with given frequency limits. The frequency dictionary was formed on the basis of analyzed text array of texts of English fiction. As the fitness function which is minimized by the genetic algorithm, the error of nearest k neighbors classifier was used. The obtained results show high precision and recall of texts classification by authorship categories on the basis of attributes of keywords set which were selected by the genetic algorithm from the frequency dictionary.

вторник, 11 декабря 2012 г.

The Model of Semantic Concepts Lattice For Data Mining Of Microblogs

        The methods of modern data mining are used effectively in Web content resources processing. The system of microblogs Twitter is one of the most popular for users’ interaction with the help of short messages. The model of semantic concept lattice for data mining of microblogs has been proposed in this work. It is shown that the use of this model is effective for the semantic relations analysis and for the detection of associative rules of keywords in the microblogs messages array. For the experimental research the package of applied programs in the language Perl has been developed.  With the help of this package and using the API of Twitter the test array of messages that contain the word "software" and the hash tag "# software" has been downloaded. A set of thematic messages associated with the software themes has been selected. The lattice of formal concepts for the semantic fields of different size and content has been considered. The tweets containing words of different semantic fields have been analysed. The semantic concepts lattice reflects the interaction of concepts in microblogs messages.  After filtering the array of input messages by given semantic field, there was received an array of 8920 tweets.   The package of programs Lattice Miner was used for calculating the concepts lattice. On the basis of concepts lattice the associative rules that represent the relations between semantic concepts of analysed subjects have been found. The application of the theory of formal concept analysis is effective in the processing of intellectual microblogs messages. The use of lattice models of semantic concepts allows to analyse semantically related sets of words and to construct associative rules. The formation of semantic fields based on the array of identified frequent sets enables to narrow significantly the search of associative rules and lattice size of semantic concepts in algorithms of text mining.
     Similar investigations were carried out for Tweeter messages array with the hash tags "#london2012" and "olympics", which were loaded during the Olympic Games in London (2012). We studied the events on the Olympic Games, in particular the final of tennis tournament.

Examples of Galois Lattices:

Ideal & Filter for concept {android, developer, london}

Ideal & Filter for concept {london}

Ideal & Filter for concept {android, developer}

Ideal & Filter for concept {browser}

Ideal & Filter for concept {android}

Ideal & Filter for concept {android, phones, popular}

Investigation of tennis final on the Olympic Games (London 2012)

Galois Lattice

 Ideal & Filter for concept {aug_05, men, federer, murrey}

 Ideal & Filter for concept {aug_04, women, williams, sharapova}

The dynamics of support for associative rules Gold->Sharapova, Gold->Williams

The dynamics of confidence for associative rules Gold->Sharapova, Gold->Williams


    In this work we consider the use of vector space of semantic fields in the classification analysis of authors’texts in fiction. We also analyze the precision and recall of naive Bayesian classifier and the knearest neighbors classifier for a vector model of text documents in the space of semantic fields.that reflect the denotative and connotative characteristics of thevocabulary.