Fake News Detection on Social Media: A Data Mining Perspective 본문

Paper Review

Fake News Detection on Social Media: A Data Mining Perspective

Yumere 2018.03.26 14:28

Fake News Detection on Social Media: A Data Mining Perspective

paper | dataset

Fake News detection 문제를 2가지 관점으로 본다

  1. Characterization
  2. Detection


  • Discuss narrow ans broad definitions of fake news that cover most existing definitions in the literature and further present the unique characteristics of fake news on social media and its implications compared with the traditional media
  • Give an overview of existing fake news detection methods with a principled way to group representative methods into different categories
  • Discuss several open issues and provide future directions of fake news detection in social media

Fake News Charaterization

  • Introduce the basic social and psychological theories
  • Discuss more advanced patterns introduced by social media

Definitions of Fake News

좁은 의미의 fake news 는

  • 의도를 갖고 있고
  • 검증되지 않고
  • 읽는 사람들을 misleading 할 수 있는

news를 의미한다.

넓은 의미의 fake news 는 authenticity 또는 intent of news content 둘 중 하나에만 집중한다. 예) 풍자

이 논문에서 사용하는 정의는 아래와 같음

Fake news is a news article that is intentionally and verifiably false

위의 정의에 따르면 아래 5가지는 fake news가 아니다

  • satire news with proper context
  • rumors that did not orginate from news events
  • conspiracy theories
  • misinformation that is created unintentionally
  • hoaxes that are only motivated by fun or to scam targeted individuals

Psychological Foundations of Fake News

Humans are naturally not very good at differentiating between real and fake news

There are two major factors which make consumers naturally vulnerable to fake news

  • Naive Realism: consumer 들은 자신의 인식만이 유일한 견해라고 믿는 경향이 있고, 동의 하지 않는 사람들은 정보가 없거나 편향된 것이라고 믿음
  • Confirmation Bias: consumer 들은 기존 자신의 의견을 확인하는 정보를 선호

Fake News on Social Media

Unique characteristics of fake news on social media

  • Malicious Accounts on Social Media for Propaganda
    social bots, cyborg users, and trolls
  • Echo Chamber Effect
    Consumers are selectively exposed to certain kinds of news because of the way news feed appear on their homepage in social media

Fake News Detection

  • problem definition
  • propose approaches for fake news detection

Problem Definition


  • aa: News Article: It consists of Publisher and Content
    Publisher pa\vec{p_a} includes a set of profile features to describe the original author (name, domain, age, among other attributes)
    Content ca\vec{c_a} consists of a set of attributes that represent the news article (headline, text, image, etc.)

  • Define Social News Engagements as a set of tuples ε={eit}\varepsilon=\{e_{it}\} to represent the process of how news spread over time among nn users U={u1,u2,...,un}\mathcal{U}=\{u_1, u_2, ..., u_n\}

    P={p1,p2,...,pn}\mathcal{P}=\{p_1, p_2, ..., p_n\} posts on social media regarding news article aa

    eit={ui,pi,t}e_{it} = \{u_i, p_i, t\} represents that a user uiu_i spreads news article aa using pip_i at time tt

    article aa가 engagement가 없을 경우 t=Nullt=Null, uiu_i는 publisher

Given the social news engagements ε\varepsilon among nn users for news article aa, the task of news detection is to predict whether the news article aa is a fake news piece or not, i.e.i.e., F:ε{0,1}\mathcal{F}:\varepsilon\to\{0,1\}

F(a)={1,if a is a piece of fake news0,otherwise. \mathcal{F}(a)= \begin{cases}1,& \text{if a is a piece of fake news} \\ 0,& \text{otherwise.}\end{cases}

Binary Classification problem
General data mining framework for fake news detection which includes two phases

  • feature extraction
  • model construction

Feature Extraction

  • 기존 news media에서는 news content에 집중했다
  • Social media에서는 다른 information이 추가 될 수 있다

News Content Features c\vec{c}
source, headline, body text, image/video

  • lexical features, including character level and word-level features such as total words, characters per word, frequency of large words, and unique words
  • syntactic features, frequency of function words and phrases, BOW or POS tagging
  • sensational or even fake images

Social Context Features
social context features can also be derived from the user-driven social engagements of news consumption on social media platform.

  • User-based
    Fake news pieces are likely to be created and spread by non-human accounts, such as social bots or cyborgs.

    Individual level: registration age, number of followers/followees, number of tweets, etc.

    group level: The assumption is that the spreaders of fake news and real news may form different communities with unique characteristics. averaging and weighting individual level features.

  • Post-based
    Post level features: stance(supporting/denying), topic(LDA) and credibility

    Group level features: aggregate the features values for all relevant posts for specific news articles

    Temporal level features: consider the temporal variations of post level features (RNN)

  • Network-based
    features are extracted via constructing specific networks among the users who published related social media posts

    stance network nodes indicate all the tweets relevant to the news and edges indicate the weights of similarity of stances

    co-occurrence network based on the user engagements by counting whether those users write posts relevant to the same news articles

    friendship network indicates the following/followee structure of users who post related tweets

Model Construction

Discuss the details of the model construction process for several existing approaches, categorizing existing methods based on their main input sources as : News Content Models and Social Context Models

News Content Models

  • Knowledge-based Use external sources to fact-check
  • Style-based Detect fake news by capturing the manipulators in the writing style of news content

Social Context Models

  • Stance-based Utilize users’ viewpoints from relevant post contents to infer the veracity of original news articles
  • Propagation-based The basic assumption is that the credibility of a news event is highly related to the credibilities of relevant social media posts


  • BuzzFeedNews
    Sample of news published in Facebook from 9 news agencies over a week close to the 201 U.S. election from September 19 to 23 and September 26 and 27

  • LIAR
    No link. PolitiFact API를 이용해 수집한 데이터

  • BS Detector
    Collected from a browser extension developed for checking news veracity GitHub

    This is a large scale crowd sourced dataset of approximately 60 million tweets that cover 96 days. All the tweets are broken down to be related to over 1,000 news envets, with each event assessed for credibilities by 30 annotators from Amazon Mechanical Truk

    CREDBANK was originally collected for tweet credibility assessment, so the tweets in this dataset are not really the social engagements for specific news articles

  • FakeNewsNet
    This dataset includes all mentioned news content and social context features with reliable ground truth fake news labels (presented from this paper)

    꽤 좋은 dataset. 한번 사용해보는 것도 괜찮을 듯 하다

실험에 대한 결과 없이 논문이 그냥 끝나버리네…

댓글쓰기 폼