argument extraction from news, blogs and social media

Argument Extraction from News, Blogs,and

Social Media. Theodosis Goudas, Christos Louizos, Georgios Petasis, Vangelis

Karkaletsis.

Presented by :

Sharath T.S

Shubhangi Tandon

What is Argument Extraction?An argument can be usually decomposed into a claim and one or more premises justifying it.

Task of identifying arguments along with their components in text

Difficult even for humans to distinguish whether a part of a sentence contains an argument element or not

Why social media?

● Most widely used and accessible platform available to seek advice or express opinion

● Is a storehouse of both meaningful and meaningless information on social media about recent trends and

topics.

● Almost no prior research in this field; only one publication related to product reviews on Amazon.

Why is this difficult ?

● Almost no prior research in this field; only one publication related to product reviews on Amazon!

● Text from social media may not always contain arguments

● Expressed in an informal form, and they do not follow any formal guidelines or specific rules

● Absence of widely used corpora in order to comparably evaluate approaches for argument extraction.

● Traditional research in the area, concentrates mainly on law documents and scientific publications.

Existing methods● Palau et al. [4,7]

○ Classification at the sentence level by trying to identify possible argumentative sentences. Using NB, SVM, maximum entropy.

○ Identify groups of sentences that refer to the same argument, using semantic distance based on the relatedness of words contained

○ Detect clauses of sentences through a parsing tool, which are classified as argumentative or not with a maximum entropy classifier

○ Argumentative clauses are classified into premises and claims through support vector machines

○ Araucaria corpus and ECHR corpus [11], achieving an accuracy of 73% and 80%

● A rule based system - Input an argumentation scheme and an ontology concerning an object, for example, a camera and its characteristic features. Argumentation schemes are populated with discourse indicators, domain specific features and rules are constructed.

○ Applied argument extraction on product reviews in an electronic shop which is related to social media.

Proposed Method

The proposed Automatic Argument Extraction is a two step process :

Step A : Identification of Argumentative Sentences (Supervised Classification using standard classifiers : Logistic

regression, Random Forest, Support Vector Machines, Naive Bayes)

Step B :Extraction of Claims and Premises (From output of Step A , using Conditional Random Fields)

Feature Selection for Corpus

State of the art features: Position Comma Token Number Connective Number Verb Number Word Number

Cue words # verbs in passive voice Domain Entities Number Adverb Number Word Mean Length

Feature Selection for Corpus (contd.)New Domain Specific Features:

Adjective number : Number of adjectives in a sentence .Usually in argumentation opinions are expressed towards an entity/claim, through adjectives.

Entities in previous sentences: Number of entities in the nth previous sentence. History of n = 5 sentences, obtain five features, Correlates to the probability that the current sentence contains an argument element.

Cumulative number of entities in previous sentences: Total number of entities from the previous n sentences. Considering a history of n = 5 we obtain four features.

Ratio of distributions: Two Language models created from sentences that contain argument elements and from sentences that do not contain an argument element. The ratio between these two distributions based on unigrams, bigrams and trigrams of words. Can be described as :

Distributions over unigrams, bigrams, trigrams of part of speech tags (POS tags): Identical to [4] with the exception that unigrams, bigrams and trigrams are extracted from the part of speech tags instead of words.

Step B: Argument Extraction with CRFWhy Conditional Random Fields ?

Structured prediction algorithm

Can take local context into consideration ( help maintain linguistic aspects such as the word ordering in the sentence)

Features:

The words in these sentences

Gazetteer lists of known entities for the thematic domain related to the arguments we want to extract,

Gazetteer lists of cue words and indicator phrases

Lexica of verbs and adjectives automatically acquired using Term Frequency - Inverse Document Frequency (TF-IDF) between two “documents” ( With and without argumentative text from Step A )

Corpus Preparation● 204 documents (in Greek) collected from the social media

● Thematic domain of Renewable Energy Sources

● Selected documents were manually annotated with domain entities and text segments that correspond to argument premises.

● Claims are not represented into documents as segments, but implied by the author as positive or negative views

760 sentences: Annotated as containing arguments

16000 sentences from 204 documents

Final Output

Ellogon

Step A Step B

Evaluation : Base Case Simple base case classifier:

1. Manually annotated segments (argument components) used to form a gazetteer.

2. Applied on the corpus in order to detect all exact matches of all these segments.

a. All segments identified are marked as argumentative segments

b. All sentences that contain at least one argumentative segment identified by the gazetteer, are characterised as an argumentative sentence.

3. Argumentative segments/sentences are compared to “gold” counterparts, manually annotated by humans.

a. Sentences that contain these recognized fragments are marked as argumentative for the first step base case.

b. Segments marked as argumentative are evaluated for the second step base case.

4. Results are taken through 10-fold

cross validation on the whole corpus (all 16k sentences)

Evaluation : Step AEach sentence represented as a fixed-size vector using features described (including class - Supervised learning )

Tested against classifiers such as : Support Vector Machines, Naive Bayes, Random Forest and Logistic Regression.

Initial Data set is heavily skewed towards non-argumentative documents . Therefore, Data Sampling and Testing was done in two different ways :

Use Precision , Recall , F-1 Measure and Accuracy for Evaluation

Logistic Regression and Naive Bayes performed the best

Way #1 Way #2

Sampling Randomly ignore negative examples.Result set contains equal number of instances from both classes

Split Initial Data set in the ratio 70:30 for testing and training

Evaluation 10-fold cross validation, achieved high accuracy Achieved 49% accuracy , Discarded

Evaluation : Step B To use CRF, need BIO tagging for sentences:

B for starting a text segment (premise),

I for a token in a premise other than the first, and

O for all other tokens (outside of the premise segment)

Example for “Wind turbines generate noise in the summer”

Final Result after CRF

Baseline Results

What did we think ?

Questions/ Observations/Inputs?

Appendix

Evaluation Results

Step A

Evaluation Results :Step A (contd)

Go back

argument extraction from news, blogs and social media

Data & Analytics