Automatically Infer Human Traits and Behavior from Social Media Data


一、摘要

  • 通过社交媒体(Twitter,Facebook),分析/预测人的特征和行为。

二、可行性

  • large scale:(用户基数大,数据多)
  • comprehensive:(数据种类多 e.g.,text posts, image posts, likes, and friendship)
  • longitudinal:(a long period of time)
  • objective:(客观的数据)

三、相关研究

3.1 综述列举了24篇论文所研究的内容。

  • Platform: Twitter, Facebook, Reddit(社交新闻), Quora(美版知乎), Instagram
  • Source Data Type: tweet, user profile, post, social network...
  • Predicted Target: political leaning, ethnicity(种族)...
  • Explicit User Characteristics: age, gender, name...
  • Latent User Characteristics refer: behavior, personality...
  • 样本容量:max:100K people; min:383

3.2 技术难度

  • graph analytics for social networks(社会关系网)
  • NLP(文本分析)

-w685

图1:系统架构

3.1 三大挑战

  • small labeled training datasets
  • unstructured and high dimensional user data
  • heterogeneous user data

-w920

3.2 分类方法:

  1. Stage(solve challenge 1): 2-Stage:大样本无监督;1-stage:监督学习
  2. Dimension Reduction(solve challenge 2):human engineered;supervised selection;unsupervised feature learning

四、特征提取

4.1 text feature的提取

  • n-gram语言模型(unigrams;bigram;trigram
example(from baidu baike):
 西安交通大学:
 unigram 形式为:西/安/交/通/大/学
 bigram形式为: 西安/安交/交通/通大/大学
 trigram形式为:西安交/安交通/交通大/通大学
  • LIWC(LIWC2015 is the gold standard in computerized text analysis)

  • customized vocabulary

4.2 image feature的提取

  • presence of tattoos,graffiti, Toward multimodal cyberbullying detection(网络欺凌检测)

4.3 Unsupervised Single View Feature Learning

  • Singular Value Decomposition (SVD)
  • Principle Component Analysis (PCA)
  • Latent Dirichlet Allocation (LDA)
  • GloVe
  • Autoencoder (AE)
  • Word Embedding with Word2Vec
  • Document Embedding with Doc2Vec

4.4 Multi-view Feature Fusion

  • Canonical Correlation Analysis (CCA)
  • Deep Canonical Correlation Analysis (DCCA) DCCA
  • Multi-task learning (MTL) is

五、Future Directions Large-scale

  • muti-users