DiffusionPredictor

Introduction

DiffusionPredictor focuses on the prediction of the structures of diffusion events on social networks based on public opinion.

It’s the project I worked on as a research assistant at Network Security Lab of Tsinghua University in Beijing, China, where I was supervised by Professor Jun Li.

Details

1. Problems and Motivations

Influential factors that lead to viral diffusions have been studied widely, but the potential influence by public opinion has not been considered yet.

However, public opinion may have a great impact.

As we know, in certain social networks (like Twitter and Weibo), users are allowed to share information with their own comments, which often express their sentiments and thus reflect the public opinion of certain diffusion events. For example, comments with positive (happy) and negative (sad) emotion may lead to totally different diffusion patterns.

Thus, we want to examine the relations between the public opinion among information propagators and the virality of online diffusion.

2. Problems and Motivations

We want to develop a system to:

  1. collect diffusion events in Weibo;
  2. extract emotion of propagators;
  3. propose new structures of diffusion events;
  4. build models to predict the structures;

3. Methods and Solutions

1. Emotion-Labeled Datasets

Weibo provides the complete reposting list and explicit information of parent node for every tweet and retweet in their APIs, based on which, we developed web crawlers.

We collect personal information of users, tweet contents and retweeting activities. In the dataset, 28,307,541 messages and 11,225,257 unique user profiles are involved. 

Then by applying algorithms from this paper, a total of 60,551 independent diffusion events are reconstructed as corresponding diffusion trees. We select a total of 31,978 events that have at least 100 nodes and construct the diffusion dataset for our structural research. 

We use another open-source Chinese emotion lexicon and build our own emotion-labeled dataset which contains about 4.2 million tweets. The labeled dataset is then used to train our Naïve Bayes algorithms. Our classifier analyzes the contents published by a user, and determine the individual opinion of this user for a specific diffusion event. Users without comments or not be categorized by the classifier will be considered to imply the same opinions as their parent nodes.

2. New Diffusion Structures

We firstly calculate the exsiting metrics including Wiener Number (WN), Modularity (MOD), Average Depth (AD),  Maximum Sub-Tree Proportion (MST), Non-Maximum Broadcast (NMB)Distinct Parent (DP). The following figure shows structural metrics value distribution. 

Based on which, we define the following new structrues.

  • Flare. Structures with metric values close to the origin and left-bottom diagonal in figure d (NMB<0.5, NMB≈MST, MST<1/11), which represent that the broadcast from the root node dominates all diffusions, and the influence of any other nodes is a magnitude weaker (less than 1/10 of the root node’s). Such a diffusion structure contains just one main exposure by the original publisher and is similar to a flare. 
  • Echo. Structures around the left-bottom diagonal but further from the origin compared to Flare (NMB<0.5, NMB≈MST, MST≥1/11). The root broadcast dominates the diffusion with a majority, while another secondary node leads almost all other diffusions. The root node gets support from a weaker but important opinion leader as a single echo. 
  • Detonation. Structural metric values distribute around the right-bottom diagonal of figure d (NMB<0.5, NMB+MST≈1), which means that a secondary node leads the diffusion with a majority, while the root node broadcasts to most of the remaining nodes. The first broadcast causes a larger explosion, like a detonation. 
  • Colony. Structures that at the area of the top and the right quarter in figure d (NMB+MST≥1), which means some secondary nodes are stronger influencers than the root node. Unlike Detonation, the diffusion process is driven by multiple secondary influencers, whose diffusion topology looks like a bacteria colony. 
  • Firework. Structures that at the top half of the left quarter (NMB≥0.5, NMB+MST<1). Metrics in this area show that the root broadcast dominates the largest influence but is not yet an absolute majority. More than half of the retweets happen in secondary nodes. 
  • Galaxy. Structural metric values in the bottom half of the left quarter in figure d (NMB<0.5, NMB+MST<1). Different from Firework, the majority (more than half, to be accurate) of the diffusion is directly by the root broadcast. On the other hand, multigenerational.

3. Predictors

We define the prediction tasks as classification problems. By using the features in another paper from us, we apply logistic regression classifier in all our prediction task.

4. Performance

The prediction performance evaluated by AUC for each pattern is reported below. 

We first notice that prediction results for different patterns are all significantly improved by the introduction of opinion-based features. Besides, the predictability among different patterns seems to be quite different. For instance, it may be easier to detect a Firework diffusion by the early observation, compared to an Echo one. Individual analysis is also carried out, and the result indicates that the WOD, opinion distribution entropy, and objection proportion are usually effective predictors for structural pattern prediction.

Related Technologies

  1. Crawlers: We build web crawlers to collect tweets in Weibo.
  2. Sentimental Analysis: We build classifiers to analyze sentimental information of diffussion events.
  3. Algorithms: We use machine learning algorithms to build prefiction models.

Programming Languages

  1. Java: 70%. The main projects are using Java.
  2. Python: 15%. For part of the web crawlers.
  3. SAS and MATLAB: 15%. For statistical analysis.

Link to Papers

  1. Opinion-based Analysis of Structural Patterns in Online Viral Diffusion
  2. Poster abstract: Homophily and controversy: On the role of public opinion in online viral diffusion