Fake News Detection on Social Media: A Data Mining Perspective

paper | dataset

Fake News detection 문제를 2가지 관점으로 본다

Characterization
Detection

Contribution

Discuss narrow ans broad definitions of fake news that cover most existing definitions in the literature and further present the unique characteristics of fake news on social media and its implications compared with the traditional media
Give an overview of existing fake news detection methods with a principled way to group representative methods into different categories
Discuss several open issues and provide future directions of fake news detection in social media

Fake News Charaterization

Introduce the basic social and psychological theories
Discuss more advanced patterns introduced by social media

Definitions of Fake News

좁은 의미의 fake news 는

의도를 갖고 있고
검증되지 않고
읽는 사람들을 misleading 할 수 있는

news를 의미한다.

넓은 의미의 fake news 는 authenticity 또는 intent of news content 둘 중 하나에만 집중한다. 예) 풍자

Definition
이 논문에서 사용하는 정의는 아래와 같음

Fake news is a news article that is intentionally and verifiably false

위의 정의에 따르면 아래 5가지는 fake news가 아니다

satire news with proper context
rumors that did not orginate from news events
conspiracy theories
misinformation that is created unintentionally
hoaxes that are only motivated by fun or to scam targeted individuals

Psychological Foundations of Fake News

Humans are naturally not very good at differentiating between real and fake news

There are two major factors which make consumers naturally vulnerable to fake news

Naive Realism: consumer 들은 자신의 인식만이 유일한 견해라고 믿는 경향이 있고, 동의 하지 않는 사람들은 정보가 없거나 편향된 것이라고 믿음
Confirmation Bias: consumer 들은 기존 자신의 의견을 확인하는 정보를 선호

Unique characteristics of fake news on social media

Malicious Accounts on Social Media for Propaganda
social bots, cyborg users, and trolls
Echo Chamber Effect
Consumers are selectively exposed to certain kinds of news because of the way news feed appear on their homepage in social media

Fake News Detection

problem definition
propose approaches for fake news detection

Problem Definition

Notation:

$a$ : News Article: It consists of Publisher and Content
Publisher $\vec{p_a}$ includes a set of profile features to describe the original author (name, domain, age, among other attributes)
Content $\vec{c_a}$ consists of a set of attributes that represent the news article (headline, text, image, etc.)
Define Social News Engagements as a set of tuples $\varepsilon=\{e_{it}\}$ to represent the process of how news spread over time among $n$ users $\mathcal{U}=\{u_1, u_2, ..., u_n\}$

$\mathcal{P}=\{p_1, p_2, ..., p_n\}$ posts on social media regarding news article $a$

$e_{it} = \{u_i, p_i, t\}$ represents that a user $u_i$ spreads news article $a$ using $p_i$ at time $t$

article $a$ 가 engagement가 없을 경우 $t=Null$ , $u_i$ 는 publisher

Given the social news engagements $\varepsilon$ among $n$ users for news article $a$ , the task of news detection is to predict whether the news article $a$ is a fake news piece or not, $i.e.$ , $\mathcal{F}:\varepsilon\to\{0,1\}$

$\mathcal{F}(a)= \begin{cases}1,& \text{if a is a piece of fake news} \\ 0,& \text{otherwise.}\end{cases}$

Binary Classification problem
General data mining framework for fake news detection which includes two phases

feature extraction
model construction

Feature Extraction

기존 news media에서는 news content에 집중했다
Social media에서는 다른 information이 추가 될 수 있다

News Content Features $\vec{c}$
source, headline, body text, image/video

lexical features, including character level and word-level features such as total words, characters per word, frequency of large words, and unique words
syntactic features, frequency of function words and phrases, BOW or POS tagging
sensational or even fake images

Social Context Features
social context features can also be derived from the user-driven social engagements of news consumption on social media platform.

User-based
Fake news pieces are likely to be created and spread by non-human accounts, such as social bots or cyborgs.

Individual level: registration age, number of followers/followees, number of tweets, etc.

group level: The assumption is that the spreaders of fake news and real news may form different communities with unique characteristics. averaging and weighting individual level features.
Post-based
Post level features: stance(supporting/denying), topic(LDA) and credibility

Group level features: aggregate the features values for all relevant posts for specific news articles

Temporal level features: consider the temporal variations of post level features (RNN)
Network-based
features are extracted via constructing specific networks among the users who published related social media posts

stance network nodes indicate all the tweets relevant to the news and edges indicate the weights of similarity of stances

co-occurrence network based on the user engagements by counting whether those users write posts relevant to the same news articles

friendship network indicates the following/followee structure of users who post related tweets

Model Construction

Discuss the details of the model construction process for several existing approaches, categorizing existing methods based on their main input sources as : News Content Models and Social Context Models

News Content Models

Knowledge-based Use external sources to fact-check
Style-based Detect fake news by capturing the manipulators in the writing style of news content

Social Context Models

Stance-based Utilize users’ viewpoints from relevant post contents to infer the veracity of original news articles
Propagation-based The basic assumption is that the credibility of a news event is highly related to the credibilities of relevant social media posts

Datasets

BuzzFeedNews
Sample of news published in Facebook from 9 news agencies over a week close to the 201 U.S. election from September 19 to 23 and September 26 and 27
LIAR
No link. PolitiFact API를 이용해 수집한 데이터
BS Detector
Collected from a browser extension developed for checking news veracity GitHub
CREDBANK
This is a large scale crowd sourced dataset of approximately 60 million tweets that cover 96 days. All the tweets are broken down to be related to over 1,000 news envets, with each event assessed for credibilities by 30 annotators from Amazon Mechanical Truk

CREDBANK was originally collected for tweet credibility assessment, so the tweets in this dataset are not really the social engagements for specific news articles
FakeNewsNet
This dataset includes all mentioned news content and social context features with reliable ground truth fake news labels (presented from this paper)

꽤 좋은 dataset. 한번 사용해보는 것도 괜찮을 듯 하다

실험에 대한 결과 없이 논문이 그냥 끝나버리네…

저작자표시

'Paper Review' 카테고리의 다른 글

BiDAF 리뷰 및 기록 (0)	2018.11.02
Enriching Word Vectors with Subword Information (0)	2018.03.22
Joint Many-Task(JMT) Model 관련 paper 리뷰 (0)	2017.09.07
간단한 Softmax Regression (0)	2017.04.17
간단한 Logistic Regression (0)	2017.04.17

Yumere

Fake News Detection on Social Media: A Data Mining Perspective

Fake News Charaterization

Definitions of Fake News

Psychological Foundations of Fake News

Fake News Detection

Problem Definition

Feature Extraction

Model Construction

Datasets

'Paper Review' 카테고리의 다른 글

티스토리툴바

Fake News Detection on Social Media: A Data Mining Perspective

Fake News Detection on Social Media: A Data Mining Perspective

Fake News Charaterization

Definitions of Fake News

Psychological Foundations of Fake News

Fake News on Social Media

Fake News Detection

Problem Definition

Feature Extraction

Model Construction

Datasets

'Paper Review' 카테고리의 다른 글

'Paper Review' Related Articles

티스토리툴바