prompt-learning for fine-grained entity typing

12
Prompt-Learning for Fine-Grained Entity Typing Ning Ding 1* , Yulin Chen 3* , Xu Han 1,5 , Guangwei Xu 2 , Pengjun Xie 2 , Hai-Tao Zheng 3, Zhiyuan Liu 1,5, Juanzi Li 1 , Hong-Gee Kim 4 1 Department of Computer Science and Technology, Tsinghua University 2 Alibaba Group 3 SIGS, Tsinghua University 4 Seoul National University 5 State Key Lab on Intelligent Technology and Systems, Tsinghua University {dingn18, yl-chen21, hanxu17}@mails.tsinghua.edu.cn Abstract As an effective approach to tune pre-trained language models (PLMs) for specific tasks, prompt-learning has recently attracted much attention from researchers. By using cloze- style language prompts to stimulate the ver- satile knowledge of PLMs, prompt-learning can achieve promising results on a series of NLP tasks, such as natural language infer- ence, sentiment classification, and knowledge probing. In this work, we investigate the ap- plication of prompt-learning on fine-grained entity typing in fully supervised, few-shot and zero-shot scenarios. We first develop a simple and effective prompt-learning pipeline by constructing entity-oriented verbalizer and templates and conducting masked language modeling. Further, to tackle the zero-shot regime, we propose a self-supervised strategy that carries out distribution-level optimization in prompt-learning to automatically summa- rize the information of entity types. Extensive experiments on three fine-grained entity typ- ing benchmarks (with up to 86 classes) under fully supervised, few-shot and zero-shot set- tings show that prompt-learning methods sig- nificantly outperform fine-tuning baselines, es- pecially when the training data is insufficient. 1 Introduction In recent years, pre-trained language models (PLMs) have been widely explored and become a key instrument for natural language understand- ing (Devlin et al., 2019; Liu et al., 2019) and gener- ation (Radford et al., 2018; Raffel et al., 2020). By applying self-supervised learning on large-scale unlabeled corpora, PLMs can capture rich lexi- cal (Jawahar et al., 2019), syntactic (Hewitt and Manning, 2019; Wang et al., 2021), and factual knowledge (Petroni et al., 2019) that well benefits * equal contribution corresponding authors Prompt [CLS] iPhone is produced by . [SEP] [MASK] MLM Apple Prompt [CLS] I like this. It was . [SEP] [MASK] MLM Great Positive Prompt [CLS] Bob Dylan, the author of "Blowing in the Wind", won the Nobel Prize in Literature in 2016 . Bob Dylan is . [SEP] Ukpigt Ytkvgt Label Words [MASK] MLM PERSON ARTIST AUTHOR Class Sets Knowledge Probing Sentiment Classification ( ~ 2 classes ) Natural Language Inference ( ~ 3 classes ) Entity Typing ( > 46 classes ) Prompt [CLS] What happened to his lab ? , his lab was torn down in 1904 . [SEP] [MASK] MLM Yes Entailment Figure 1: Examples of prompt-learning to stimulate the knowledge of PLMs by formalizing specific tasks as equivalent cloze-style tasks. downstream NLP tasks. Considering the versatile knowledge contained in PLMs, many efforts of researchers have been devoted to stimulating task- specific knowledge in PLMs and adapting such knowledge to downstream NLP tasks. Fine-tuning with extra classifiers has been one typical solution for adapting PLMs to specific tasks and achieves promising results on various NLP tasks (Qiu et al., 2020; Han et al., 2021a). Some recent efforts on probing knowledge of PLMs show that, by writing some natural language prompts, we can induce PLMs to complete fac- tual knowledge (Petroni et al., 2019). GPT-3 fur- ther utilizes the information provided by prompts to conduct few-shot learning and achieves awe- some results (Brown et al., 2020). Inspired by this, prompt-learning has been introduced. As shown arXiv:2108.10604v1 [cs.CL] 24 Aug 2021

Upload: others

Post on 28-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Prompt-Learning for Fine-Grained Entity Typing

Ning Ding1lowast Yulin Chen3lowast Xu Han15 Guangwei Xu2 Pengjun Xie2Hai-Tao Zheng3dagger Zhiyuan Liu15dagger Juanzi Li1 Hong-Gee Kim4

1Department of Computer Science and Technology Tsinghua University2 Alibaba Group 3 SIGS Tsinghua University 4 Seoul National University

5 State Key Lab on Intelligent Technology and Systems Tsinghua Universitydingn18 yl-chen21 hanxu17mailstsinghuaeducn

Abstract

As an effective approach to tune pre-trainedlanguage models (PLMs) for specific tasksprompt-learning has recently attracted muchattention from researchers By using cloze-style language prompts to stimulate the ver-satile knowledge of PLMs prompt-learningcan achieve promising results on a series ofNLP tasks such as natural language infer-ence sentiment classification and knowledgeprobing In this work we investigate the ap-plication of prompt-learning on fine-grainedentity typing in fully supervised few-shotand zero-shot scenarios We first develop asimple and effective prompt-learning pipelineby constructing entity-oriented verbalizer andtemplates and conducting masked languagemodeling Further to tackle the zero-shotregime we propose a self-supervised strategythat carries out distribution-level optimizationin prompt-learning to automatically summa-rize the information of entity types Extensiveexperiments on three fine-grained entity typ-ing benchmarks (with up to 86 classes) underfully supervised few-shot and zero-shot set-tings show that prompt-learning methods sig-nificantly outperform fine-tuning baselines es-pecially when the training data is insufficient

1 Introduction

In recent years pre-trained language models(PLMs) have been widely explored and becomea key instrument for natural language understand-ing (Devlin et al 2019 Liu et al 2019) and gener-ation (Radford et al 2018 Raffel et al 2020) Byapplying self-supervised learning on large-scaleunlabeled corpora PLMs can capture rich lexi-cal (Jawahar et al 2019) syntactic (Hewitt andManning 2019 Wang et al 2021) and factualknowledge (Petroni et al 2019) that well benefits

lowast equal contributiondagger corresponding authors

Prompt

[CLS] IPhone is produced by [SEP] [MASK]

MLM Apple

Prompt

[CLS] I like this It was [SEP] [MASK]

MLM Great ClassPositive

Prompt

[CLS] What happened to his lab his lab was torn down [SEP] [MASK]

MLM Yes ClassEntailment

Prompt

[CLS] Bob Dylan who wrote the song Blowing in the Wind

won the Nobel Prize in Literature in 2016

Bob Dylan is [SEP]

6LQJHUULWHUŏ

Label Words

[MASK]MLM

3HUVRQ$UWLVW$XWKRU

Class Sets

KowledgeProbing

Sentiment Classification

( 2 classes PositiveNegative)

Natural Language Inference

( 3 classes EntailmentNetural

Contradiction)

Entity Typing( gt 40 classes

PersonLocationOrganizationhelliphellip)

Prompt

[CLS] iPhone is produced by [SEP] [MASK]

MLM Apple

Prompt

[CLS] I like this It was [SEP] [MASK]

MLM Great Positive

Prompt

[CLS] Bob Dylan the author of Blowing in the Wind won the Nobel Prize in Literature in 2016

Bob Dylan is [SEP]6LQJHUULWHUŏ

Label Words

[MASK]

MLMPERSONARTISTAUTHOR

Class Sets

KnowledgeProbing

Sentiment Classification( ~ 2 classes )

NaturalLanguageInference( ~ 3 classes )

Entity Typing( gt 46 classes )

Prompt

[CLS] What happened to his lab

his lab was torn down in 1904 [SEP] [MASK]

MLM Yes Entailment

Figure 1 Examples of prompt-learning to stimulate theknowledge of PLMs by formalizing specific tasks asequivalent cloze-style tasks

downstream NLP tasks Considering the versatileknowledge contained in PLMs many efforts ofresearchers have been devoted to stimulating task-specific knowledge in PLMs and adapting suchknowledge to downstream NLP tasks Fine-tuningwith extra classifiers has been one typical solutionfor adapting PLMs to specific tasks and achievespromising results on various NLP tasks (Qiu et al2020 Han et al 2021a)

Some recent efforts on probing knowledge ofPLMs show that by writing some natural languageprompts we can induce PLMs to complete fac-tual knowledge (Petroni et al 2019) GPT-3 fur-ther utilizes the information provided by promptsto conduct few-shot learning and achieves awe-some results (Brown et al 2020) Inspired by thisprompt-learning has been introduced As shown

arX

iv2

108

1060

4v1

[cs

CL

] 2

4 A

ug 2

021

in Figure 1 in prompt-learning downstream tasksare formalized as equivalent cloze-style tasks andPLMs are asked to handle these cloze-style tasksinstead of original downstream tasks Comparedwith conventional fine-tuning methods prompt-learning does not require extra neural layers andintuitively bridges the objective form gap betweenpre-training and fine-tuning Sufficient empiricalanalysis shows that either for manually pickinghand-crafted prompts (Liu et al 2021b Han et al2021b) or automatically building auto-generatedprompts (Shin et al 2020 Gao et al 2020 Lesteret al 2021) taking prompts for tuning models issurprisingly effective for the knowledge stimula-tion and model adaptation of PLMs especially inthe low-data regime

Intuitively prompt-learning is applicable to fine-grained entity typing which aims at classifyingmarked entities from input sequences into specifictypes in a pre-defined label set We discuss thistopic with a motivating example ldquoHe is from NewYorkrdquo By adding a prompt with a masking token[MASK] the sentence becomes ldquoHe is from NewYork In this sentence New York is [MASK]rdquo Dueto the wealth of knowledge acquired during pre-training PLMs can compute a probability distri-bution over the vocabulary at the masked positionand a relatively higher probability with the wordldquocityrdquo than the word ldquopersonrdquo In other words withsimple prompts the abstract entity attributes con-tained in PLMs can be efficiently exploited whichis meaningful for downstream entity-related tasks

In this work we comprehensively explore theapplication of prompt-learning to fine-grained en-tity typing in fully supervised few-shot and zero-shot settings Particularly we first introduce anaive pipeline where we construct entity-orientedprompts and formalize fine-grained entity typingas a cloze-style task This simple pipeline yieldspromising results in our experiments especiallywhen supervision is insufficient Then to tacklethe zero-shot scenario where no explicit supervi-sion exists in training we develop a self-supervisedstrategy under our prompt-learning pipeline Ourself-supervised strategy attempts to automaticallysummarize entity types by optimizing the similarityof the predicted probability distributions of pairedexamples in prompt-learning

Three popular benchmarks are used for our ex-periments including FEW-NERD (Ding et al2021b) OntoNotes (Weischedel et al 2013)

BBN (Weischedel and Brunstein 2005) All thesedatasets have a complex type hierarchy consistingof rich entity types requiring models to have goodcapabilities of entity attribute detection Empiri-cally our method yields significant improvementson these benchmark datasets especially under thezero-shot and few-shot settings We also make ananalysis and point out both the superiority and bot-tleneck of prompt-learning in fine-grained entitytyping which may advance further efforts to ex-tract entity attributes using PLMs Our source codeand pre-trained models will be publicly available

2 Background

In this section we first give a problem definition ofthe entity typing task (sect 21) followed by an intro-duction of conventional vanilla fine-tuning (sect 22)and prompt-based tuning (sect 23) with PLMs

21 Problem DefinitionThe input of entity typing is a dataset D =x1 xn with n sentences and each sentencex contains a marked entity mention m For eachinput sentence x entity typing aims at predictingthe entity type y isin Y of its marked mention mwhere Y is a pre-defined set of entity types En-tity typing is typically regarded as a context-awareclassification task For example in the sentenceldquoLondon is the fifth album by the rock band JesusJonesrdquo the entity mention London should be clas-sified as Music rather than Location In the eraof PLMs using pre-trained neural language models(eg BERT) as the encoder and performing modeltuning for classifying types becomes a standardparadigm

22 Vanilla Fine-tuningIn the vanilla fine-tuning paradigm of entity typ-ing for each token ti in an input sequencex = [CLS] t1 m tT [SEP] witha marked entity mention m = ti tj thePLM M produces its contextualized representa-tion h[CLS]h1 hT h[SEP] Empiricallywe choose the embedding of the [CLS] tokenh[CLS] as the final representation that is fed intoan output layer to predict the probability distribu-tion over the label space

P (y isin Y|s) = softmax(Wh[CLS] + b) (1)

where W and b are learnable parameters W band all parameters of PLMs are tuned by maximiz-

London is one of the biggest cities in the world London is a

Input Prompt

[MASK] [SEP]

Copy the entity mention

LOCATIONCITY

Label Words Class Sets

Mapping

[CLS]

MLM head

City Location hellip

Figure 2 The illustration of prompt-learning for fine-grained entity typing with supervision We take hard-encoding prompt strategy as an example in this figure

ing the objective function 1n

sumni=1 log(P (yi|si))

where yi is the golden type label of si

23 Prompt-based Tuning

In prompt-based tuning for each label y isin Y wedefine a label word set Vy = w1 wm Vyis a subset of the vocabulary V of the PLM Mie Vy sube V By taking the union of the dictio-nary corresponding to each label we get an overalldictionary Vlowast For example in sentiment classi-fication we could map the label y = POSITIVE

into a set Vy = great good wonderful Andanother primary component of prompt-learning is aprompt template T (middot) which modifies the originalinput x into a prompt input T (x) by adding a setof additional tokens at the end of x Convention-ally a [MASK] token is added for PLMs to predictthe missing label word w isin Vlowast Thus in prompt-learning a classification problem is transferred intoa masked language modeling problem

p(y isin Y|s)=p([MASK]=wisinVy|T (s)) (2)

3 Prompt-learning for Entity Typing ANaive Pipeline

After transferred into masked language modelingthe prompt-learning method is applicable to learn-ing and aggregating type information of entities Inthis section we first introduce a naive but empiri-cally strong baseline that utilizes prompts to extractentity types with explicit supervision includingthe construction of label words (sect 31) templates(sect 32) and training (sect 33) And such a simplepipeline yields remarkable results on three bench-mark datasets Then we propose a self-supervisedprompt-learning method that automatically learnstype information from unlabeled data (sect 4)

31 Label Words Set Vlowast

For fine-grained entity typing datasets usu-ally use hierarchical label space such as PER-SONARTIST (FEW-NERD) and ORGANIZA-TIONPARTY (OntoNotes) In this case we useall the words as the label words set Vlowast for this en-tity type For example y = LOCATIONCITY rarrv = location city And as the entity types areall well-defined nouns with clear boundaries it isintuitive to expand the label words set Vlowast with ob-tainable related nouns For example in RelatedWords1 the top-10 related words of the label wordcity is ldquometropolis town municipality urban sub-urb municipal megalopolis civilization down-town countryrdquo These words are strongly relatedto the class CITY and they are hardly mappedto other entity types even under the same LOCA-TION class such as LOCATIONMOUNTAIN LO-CATIONISLAND etc

In masked language modeling we use confi-dence scores of all the words in Vy to constructthe final score of the particular type y That is foran input x (which is mapped to T (x)) and its entitytype y (which is mapped to Vy = w1 wm)the conditional probability becomes

P (y|x)= 1

m

msumj

λjP ([MASK]=wj |T (x)) (3)

where λi is a parameter to indicate the importanceof the current word wj isin Vy Note that λi couldalso be learnable or heuristically defined during thetraining procedure

32 Templates

In this section we construct entity-orientedprompts for the fine-grained entity typing task We

1httpsrelatedwordsorg

choose hard-encoding templates with natural lan-guage and soft-encoding templates with additionalspecial tokens in our work

For the choice of hard-encoding templates wedo not use automatic searching methods for dis-crete prompts since the fine-grained entity typingtask is clearly defined and the prompts are easilypurposeful We select simple declarative templatesrather than hypernym templates to avoid grammarti-cal errors In the template of hard encoding settingwe first copy the marked entity mention in x thenwe add a few linking verbs and articles followed bythe [MASK] token With the marked entity mention[Ent] we use the following templates

T1(x) = x [Ent] is [MASK]

T2(x) = x [Ent] is a [MASK]

T3(x) = x In this sentence [Ent] is a [MASK]

where [Ent] is the entity mention in x In sect 5 wereport the the results of T3(middot)

We also adopt the soft-encoding strategywhich introduces some additional special tokens[P1] [Pl] as the template where l is a pre-defined hyper-parameter The template begins witha delimiter [P] and a copy of the entity mention [M]The complete template becomes

T4(x) = x [P] [Ent] [P1] [Pl] [MASK]

where each embedding of prompts is randomly ini-tialized and optimized during training Intuitivelythese special tokens can represent a cluster of wordswith similar semantics in the vocabulary

33 Training and InferenceThe strategies of hard or soft encoding provide dif-ferent initialization of templates and they both canbe parameterized by φ and optimized along withM during training We train the pre-trained modelM (parameterized by θ) along with the additionalprompt embeddings by using the cross-entropy lossfunction

L = minussum

logP (y|x θ φ) (4)

For inference we can directly use Eq 3 to predictthe label of the current input instance based on thepredicted words of the [MASK] position

This pipeline could be applied to entity typingtask with explicit supervision and it is effectiveeven if the training data are insufficient ie the

few-shot scenario (sect 55) Naturally we considera more extreme situation that is a scenario with-out any training data (zero-shot scenario) In thissetting if we directly use an additional classifier topredict the label the result is equivalent to randomguessing because the parameters of the classifierare randomly initialized If we use prompts to inferthe label based on the predicted words although itsperformance is significantly better than guessingthere will also be a catastrophic decline (sect 56) Atthis time a question emerges ldquoIs it possible forPLMs to predict entity types without any explicitsupervision rdquo

4 Self-supervised Prompt-learning forZero-shot Entity Typing

With prompt-learning the answer is yes be-cause in the pre-training stage the contexts ofentities have already implied the correspondingtype information which provides an advanta-geous initialization point for the prompt-learningparadigm For example in the input sentencewith the T3(middot) template ldquoSteve Jobs found Ap-ple In this sentence Steve Jobs is a [MASK] rdquoIn our observations the probability of PLMs pre-dicting person at the masked position will be sig-nificantly higher than the probability of locationAnd if we make reasonable use of this superiorinitialization point it is possible for PLMs to au-tomatically summarize the type information andfinally extract the correct entity type

41 Overview

In order to create conditions for PLMs to sum-marize entity types we consider a self-supervisedparadigm that optimizes the similarity of the prob-ability distribution predicted by similar examplesover a projected vocabulary Vlowast To achieve thatin prompt-learning we need to (1) impose a limiton the prediction range of the model so that onlythose words that we need that is words that ex-press entity types participate in the optimizationof the gradient (2) provide an unlabeled datasetwhere entity mentions are marked without anytypes to allow the model to learn the process ofinducing type information in a self-supervised man-ner The inputs contain a pre-trained modelM apre-defined label schema Y and a dataset with-out labels D = x1 xn (entity mentions aremarked without any types) our goal is to makeMcapable to automatically carry out zero-shot entity

London is one of the biggest citieshellip London

Input Prompt

[MASK] [SEP][CLS]

[HIDE] is located in the south-easthellip

Input

[CLS]

[P1] [P2]

Prompt

[MASK] [SEP][P1] [P2]

Copy the entity mention

Randomly hide the mention with probability α

[HIDE]

D

MLM

MLM Predictionover

Predictionover

JS Divergence

LD Unlabeled Datset Pre-defined Label Schema S Sampler L

L

S

Mapping

Mapping

Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure

typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process

42 Self-supervised Learning

Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning

Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs

with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04

Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by

s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)

where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime

As we attempt to make the predictions of thepositive pairs similar the objective is computed by

L = minus 1

| ˆDpos|2sum

xisinDpos

sumxprimeisinDpos

log(1minus s(hhprime))

minus 1

| ˆDneg|2sum

xisin ˆDneg

sumxprimeisin ˆDneg

γ log(s(hhprime))

(6)

where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg

Dataset Type Supervised Few-shot Zero-shot

|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|

Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824

Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot

Shot Metric Few-NERD OntoNotes BBN

Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET

1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)

2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)

4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)

8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)

16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)

Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size

5 Experiments

In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET

to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets

51 Datasets

We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN

FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-

sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments

OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)

BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2

52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets

Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET

Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets

Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)

Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities

53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods

2httpsgithubcomgoogle-researchbert

3httpspytorchorg4httpsgithubcomhuggingface

transformers

with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs

54 Results of Fully Supervised Entity Typing

Dataset Metric Method

FT PLET (H) PLET (S)

Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576

OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977

BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781

Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder

The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context

It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed

55 Results of Few-shot Entity Typing

Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted

Dataset Metric Method

PLET PLET (S)

Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)

OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)

BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)

Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder

that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds

56 Results of Zero-shot Entity Typing

Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available

To explore the more subtle changes in perfor-

mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types

57 Effect of Templates

As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens

6 Related Work

After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al

0

600

1200

1800

2400

3000

ᤒ໒ 1

Org-SportsLeague 50

Product-Game 2783

Org-SportsTeam 43

MISC-Livingthing 43

Org-Company 27

Org-Political 16

OR

G-SPORTSL

EAG

UE

PROD

UCT-G

AM

EO

RG-SPORTST

EAM

MISC-L

IVIN

G

ORG-C

OM

PAN

YO

RG-POLITICA

L

926

14 10 0717

ᤒ໒ 1-1

Org-SportsLeague 1929

Per-Athlete 368

Event-Other 132

Location-Mountain 117

Org-SportsTeam 116

Build-Airport 54

OR

G-SPORTSL

EAG

UE

123

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

44

642

PER-ATH

LETEE

VEN

T-OTH

ERL

OCATIO

N-MO

UN

TAIN

ORG-SPO

RTSTEA

MB

UILD-A

IRPORT

PLET PLET (S)

39 39 1814

1

(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE

0

600

1200

1800

2400

3000

ᤒ໒ 1

Event-Attack 47

PROD-Game 2662Event-Disaster 540

Loc-Other 242

event-protest 203

MISC-Language 169

EV

ENT-A

TTAC

KPRO

D-G

AM

EE

VEN

T-DISA

STERL

OC-O

THER

EV

ENT-PRO

TESTM

ISC-LA

NG

UA

GE

630

12857 48 40

ᤒ໒ 1-1

Event-Attack 1894

Org-Political 420

Per-Politician 392

Loc-Island 342

Loc-Other 342

Other-Language 154

EV

ENT-A

TTAC

K

99

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

93

449

ORG-PO

LITICAL

PER-POLITICIA

NL

OC-ISLA

ND

LO

C-OTH

ER

MISC-L

AN

GU

AG

EPLET PLET (S)

81 81 3711

2

(b) Zero-shot prediction distribution on EVENT-ATTACK

0

500

1000

1500

2000

2500

ᤒ໒ 1

Misc-Currency 1196Person-Other 782

PROD-Car 666

Misc-Language 282

product-game 179Org-Company 116

MISC-C

URRENCY

PERSON-O

THER

PROD-C

AR

MISC-L

AN

GU

AG

EPRO

DU

CT-GA

ME

ORG-C

OM

PAN

Y

203

86 55 35

ᤒ໒ 1-1

Misc-Currency 2197

Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75

MISC-C

URRENCY

59 48

670

LO

C-ISLAN

DO

RG-POLITICA

LO

RG-CO

MPA

NY

LO

C-MO

UN

TAIN

MISC-L

AN

GU

AG

E

PLET PLET (S)

42 33 23

365

239

1

(c) Zero-shot prediction distribution on MISC-CURRENCY

0

500

1000

1500

2000

2500

ᤒ໒ 1

Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22

LOC-M

OUNTAIN

MISC-L

AN

GU

AG

EL

OC-O

THER

PERSON-O

THER

MISC-L

IVIN

G

PROD-C

AR

10130 19 08

ᤒ໒ 1-1

Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31

LOC-M

OUNTAIN

86 18

767

LO

C-ISLAN

DPERSO

N-ARTIST

PERSON-PO

LITICIAN

MISC-L

AN

GU

AG

EPRO

D-SHIP

PLET PLET (S)

16 15 1134

772

1

(d) Zero-shot prediction distribution on LOC-MOUNTAIN

Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types

Encoding Strategy Template T(x) Acc MiF MaF

Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874

Soft-encoding

x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839

Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET

2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)

Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some

cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning

Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

in Figure 1 in prompt-learning downstream tasksare formalized as equivalent cloze-style tasks andPLMs are asked to handle these cloze-style tasksinstead of original downstream tasks Comparedwith conventional fine-tuning methods prompt-learning does not require extra neural layers andintuitively bridges the objective form gap betweenpre-training and fine-tuning Sufficient empiricalanalysis shows that either for manually pickinghand-crafted prompts (Liu et al 2021b Han et al2021b) or automatically building auto-generatedprompts (Shin et al 2020 Gao et al 2020 Lesteret al 2021) taking prompts for tuning models issurprisingly effective for the knowledge stimula-tion and model adaptation of PLMs especially inthe low-data regime

Intuitively prompt-learning is applicable to fine-grained entity typing which aims at classifyingmarked entities from input sequences into specifictypes in a pre-defined label set We discuss thistopic with a motivating example ldquoHe is from NewYorkrdquo By adding a prompt with a masking token[MASK] the sentence becomes ldquoHe is from NewYork In this sentence New York is [MASK]rdquo Dueto the wealth of knowledge acquired during pre-training PLMs can compute a probability distri-bution over the vocabulary at the masked positionand a relatively higher probability with the wordldquocityrdquo than the word ldquopersonrdquo In other words withsimple prompts the abstract entity attributes con-tained in PLMs can be efficiently exploited whichis meaningful for downstream entity-related tasks

In this work we comprehensively explore theapplication of prompt-learning to fine-grained en-tity typing in fully supervised few-shot and zero-shot settings Particularly we first introduce anaive pipeline where we construct entity-orientedprompts and formalize fine-grained entity typingas a cloze-style task This simple pipeline yieldspromising results in our experiments especiallywhen supervision is insufficient Then to tacklethe zero-shot scenario where no explicit supervi-sion exists in training we develop a self-supervisedstrategy under our prompt-learning pipeline Ourself-supervised strategy attempts to automaticallysummarize entity types by optimizing the similarityof the predicted probability distributions of pairedexamples in prompt-learning

Three popular benchmarks are used for our ex-periments including FEW-NERD (Ding et al2021b) OntoNotes (Weischedel et al 2013)

BBN (Weischedel and Brunstein 2005) All thesedatasets have a complex type hierarchy consistingof rich entity types requiring models to have goodcapabilities of entity attribute detection Empiri-cally our method yields significant improvementson these benchmark datasets especially under thezero-shot and few-shot settings We also make ananalysis and point out both the superiority and bot-tleneck of prompt-learning in fine-grained entitytyping which may advance further efforts to ex-tract entity attributes using PLMs Our source codeand pre-trained models will be publicly available

2 Background

In this section we first give a problem definition ofthe entity typing task (sect 21) followed by an intro-duction of conventional vanilla fine-tuning (sect 22)and prompt-based tuning (sect 23) with PLMs

21 Problem DefinitionThe input of entity typing is a dataset D =x1 xn with n sentences and each sentencex contains a marked entity mention m For eachinput sentence x entity typing aims at predictingthe entity type y isin Y of its marked mention mwhere Y is a pre-defined set of entity types En-tity typing is typically regarded as a context-awareclassification task For example in the sentenceldquoLondon is the fifth album by the rock band JesusJonesrdquo the entity mention London should be clas-sified as Music rather than Location In the eraof PLMs using pre-trained neural language models(eg BERT) as the encoder and performing modeltuning for classifying types becomes a standardparadigm

22 Vanilla Fine-tuningIn the vanilla fine-tuning paradigm of entity typ-ing for each token ti in an input sequencex = [CLS] t1 m tT [SEP] witha marked entity mention m = ti tj thePLM M produces its contextualized representa-tion h[CLS]h1 hT h[SEP] Empiricallywe choose the embedding of the [CLS] tokenh[CLS] as the final representation that is fed intoan output layer to predict the probability distribu-tion over the label space

P (y isin Y|s) = softmax(Wh[CLS] + b) (1)

where W and b are learnable parameters W band all parameters of PLMs are tuned by maximiz-

London is one of the biggest cities in the world London is a

Input Prompt

[MASK] [SEP]

Copy the entity mention

LOCATIONCITY

Label Words Class Sets

Mapping

[CLS]

MLM head

City Location hellip

Figure 2 The illustration of prompt-learning for fine-grained entity typing with supervision We take hard-encoding prompt strategy as an example in this figure

ing the objective function 1n

sumni=1 log(P (yi|si))

where yi is the golden type label of si

23 Prompt-based Tuning

In prompt-based tuning for each label y isin Y wedefine a label word set Vy = w1 wm Vyis a subset of the vocabulary V of the PLM Mie Vy sube V By taking the union of the dictio-nary corresponding to each label we get an overalldictionary Vlowast For example in sentiment classi-fication we could map the label y = POSITIVE

into a set Vy = great good wonderful Andanother primary component of prompt-learning is aprompt template T (middot) which modifies the originalinput x into a prompt input T (x) by adding a setof additional tokens at the end of x Convention-ally a [MASK] token is added for PLMs to predictthe missing label word w isin Vlowast Thus in prompt-learning a classification problem is transferred intoa masked language modeling problem

p(y isin Y|s)=p([MASK]=wisinVy|T (s)) (2)

3 Prompt-learning for Entity Typing ANaive Pipeline

After transferred into masked language modelingthe prompt-learning method is applicable to learn-ing and aggregating type information of entities Inthis section we first introduce a naive but empiri-cally strong baseline that utilizes prompts to extractentity types with explicit supervision includingthe construction of label words (sect 31) templates(sect 32) and training (sect 33) And such a simplepipeline yields remarkable results on three bench-mark datasets Then we propose a self-supervisedprompt-learning method that automatically learnstype information from unlabeled data (sect 4)

31 Label Words Set Vlowast

For fine-grained entity typing datasets usu-ally use hierarchical label space such as PER-SONARTIST (FEW-NERD) and ORGANIZA-TIONPARTY (OntoNotes) In this case we useall the words as the label words set Vlowast for this en-tity type For example y = LOCATIONCITY rarrv = location city And as the entity types areall well-defined nouns with clear boundaries it isintuitive to expand the label words set Vlowast with ob-tainable related nouns For example in RelatedWords1 the top-10 related words of the label wordcity is ldquometropolis town municipality urban sub-urb municipal megalopolis civilization down-town countryrdquo These words are strongly relatedto the class CITY and they are hardly mappedto other entity types even under the same LOCA-TION class such as LOCATIONMOUNTAIN LO-CATIONISLAND etc

In masked language modeling we use confi-dence scores of all the words in Vy to constructthe final score of the particular type y That is foran input x (which is mapped to T (x)) and its entitytype y (which is mapped to Vy = w1 wm)the conditional probability becomes

P (y|x)= 1

m

msumj

λjP ([MASK]=wj |T (x)) (3)

where λi is a parameter to indicate the importanceof the current word wj isin Vy Note that λi couldalso be learnable or heuristically defined during thetraining procedure

32 Templates

In this section we construct entity-orientedprompts for the fine-grained entity typing task We

1httpsrelatedwordsorg

choose hard-encoding templates with natural lan-guage and soft-encoding templates with additionalspecial tokens in our work

For the choice of hard-encoding templates wedo not use automatic searching methods for dis-crete prompts since the fine-grained entity typingtask is clearly defined and the prompts are easilypurposeful We select simple declarative templatesrather than hypernym templates to avoid grammarti-cal errors In the template of hard encoding settingwe first copy the marked entity mention in x thenwe add a few linking verbs and articles followed bythe [MASK] token With the marked entity mention[Ent] we use the following templates

T1(x) = x [Ent] is [MASK]

T2(x) = x [Ent] is a [MASK]

T3(x) = x In this sentence [Ent] is a [MASK]

where [Ent] is the entity mention in x In sect 5 wereport the the results of T3(middot)

We also adopt the soft-encoding strategywhich introduces some additional special tokens[P1] [Pl] as the template where l is a pre-defined hyper-parameter The template begins witha delimiter [P] and a copy of the entity mention [M]The complete template becomes

T4(x) = x [P] [Ent] [P1] [Pl] [MASK]

where each embedding of prompts is randomly ini-tialized and optimized during training Intuitivelythese special tokens can represent a cluster of wordswith similar semantics in the vocabulary

33 Training and InferenceThe strategies of hard or soft encoding provide dif-ferent initialization of templates and they both canbe parameterized by φ and optimized along withM during training We train the pre-trained modelM (parameterized by θ) along with the additionalprompt embeddings by using the cross-entropy lossfunction

L = minussum

logP (y|x θ φ) (4)

For inference we can directly use Eq 3 to predictthe label of the current input instance based on thepredicted words of the [MASK] position

This pipeline could be applied to entity typingtask with explicit supervision and it is effectiveeven if the training data are insufficient ie the

few-shot scenario (sect 55) Naturally we considera more extreme situation that is a scenario with-out any training data (zero-shot scenario) In thissetting if we directly use an additional classifier topredict the label the result is equivalent to randomguessing because the parameters of the classifierare randomly initialized If we use prompts to inferthe label based on the predicted words although itsperformance is significantly better than guessingthere will also be a catastrophic decline (sect 56) Atthis time a question emerges ldquoIs it possible forPLMs to predict entity types without any explicitsupervision rdquo

4 Self-supervised Prompt-learning forZero-shot Entity Typing

With prompt-learning the answer is yes be-cause in the pre-training stage the contexts ofentities have already implied the correspondingtype information which provides an advanta-geous initialization point for the prompt-learningparadigm For example in the input sentencewith the T3(middot) template ldquoSteve Jobs found Ap-ple In this sentence Steve Jobs is a [MASK] rdquoIn our observations the probability of PLMs pre-dicting person at the masked position will be sig-nificantly higher than the probability of locationAnd if we make reasonable use of this superiorinitialization point it is possible for PLMs to au-tomatically summarize the type information andfinally extract the correct entity type

41 Overview

In order to create conditions for PLMs to sum-marize entity types we consider a self-supervisedparadigm that optimizes the similarity of the prob-ability distribution predicted by similar examplesover a projected vocabulary Vlowast To achieve thatin prompt-learning we need to (1) impose a limiton the prediction range of the model so that onlythose words that we need that is words that ex-press entity types participate in the optimizationof the gradient (2) provide an unlabeled datasetwhere entity mentions are marked without anytypes to allow the model to learn the process ofinducing type information in a self-supervised man-ner The inputs contain a pre-trained modelM apre-defined label schema Y and a dataset with-out labels D = x1 xn (entity mentions aremarked without any types) our goal is to makeMcapable to automatically carry out zero-shot entity

London is one of the biggest citieshellip London

Input Prompt

[MASK] [SEP][CLS]

[HIDE] is located in the south-easthellip

Input

[CLS]

[P1] [P2]

Prompt

[MASK] [SEP][P1] [P2]

Copy the entity mention

Randomly hide the mention with probability α

[HIDE]

D

MLM

MLM Predictionover

Predictionover

JS Divergence

LD Unlabeled Datset Pre-defined Label Schema S Sampler L

L

S

Mapping

Mapping

Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure

typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process

42 Self-supervised Learning

Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning

Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs

with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04

Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by

s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)

where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime

As we attempt to make the predictions of thepositive pairs similar the objective is computed by

L = minus 1

| ˆDpos|2sum

xisinDpos

sumxprimeisinDpos

log(1minus s(hhprime))

minus 1

| ˆDneg|2sum

xisin ˆDneg

sumxprimeisin ˆDneg

γ log(s(hhprime))

(6)

where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg

Dataset Type Supervised Few-shot Zero-shot

|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|

Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824

Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot

Shot Metric Few-NERD OntoNotes BBN

Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET

1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)

2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)

4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)

8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)

16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)

Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size

5 Experiments

In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET

to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets

51 Datasets

We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN

FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-

sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments

OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)

BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2

52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets

Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET

Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets

Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)

Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities

53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods

2httpsgithubcomgoogle-researchbert

3httpspytorchorg4httpsgithubcomhuggingface

transformers

with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs

54 Results of Fully Supervised Entity Typing

Dataset Metric Method

FT PLET (H) PLET (S)

Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576

OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977

BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781

Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder

The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context

It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed

55 Results of Few-shot Entity Typing

Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted

Dataset Metric Method

PLET PLET (S)

Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)

OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)

BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)

Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder

that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds

56 Results of Zero-shot Entity Typing

Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available

To explore the more subtle changes in perfor-

mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types

57 Effect of Templates

As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens

6 Related Work

After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al

0

600

1200

1800

2400

3000

ᤒ໒ 1

Org-SportsLeague 50

Product-Game 2783

Org-SportsTeam 43

MISC-Livingthing 43

Org-Company 27

Org-Political 16

OR

G-SPORTSL

EAG

UE

PROD

UCT-G

AM

EO

RG-SPORTST

EAM

MISC-L

IVIN

G

ORG-C

OM

PAN

YO

RG-POLITICA

L

926

14 10 0717

ᤒ໒ 1-1

Org-SportsLeague 1929

Per-Athlete 368

Event-Other 132

Location-Mountain 117

Org-SportsTeam 116

Build-Airport 54

OR

G-SPORTSL

EAG

UE

123

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

44

642

PER-ATH

LETEE

VEN

T-OTH

ERL

OCATIO

N-MO

UN

TAIN

ORG-SPO

RTSTEA

MB

UILD-A

IRPORT

PLET PLET (S)

39 39 1814

1

(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE

0

600

1200

1800

2400

3000

ᤒ໒ 1

Event-Attack 47

PROD-Game 2662Event-Disaster 540

Loc-Other 242

event-protest 203

MISC-Language 169

EV

ENT-A

TTAC

KPRO

D-G

AM

EE

VEN

T-DISA

STERL

OC-O

THER

EV

ENT-PRO

TESTM

ISC-LA

NG

UA

GE

630

12857 48 40

ᤒ໒ 1-1

Event-Attack 1894

Org-Political 420

Per-Politician 392

Loc-Island 342

Loc-Other 342

Other-Language 154

EV

ENT-A

TTAC

K

99

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

93

449

ORG-PO

LITICAL

PER-POLITICIA

NL

OC-ISLA

ND

LO

C-OTH

ER

MISC-L

AN

GU

AG

EPLET PLET (S)

81 81 3711

2

(b) Zero-shot prediction distribution on EVENT-ATTACK

0

500

1000

1500

2000

2500

ᤒ໒ 1

Misc-Currency 1196Person-Other 782

PROD-Car 666

Misc-Language 282

product-game 179Org-Company 116

MISC-C

URRENCY

PERSON-O

THER

PROD-C

AR

MISC-L

AN

GU

AG

EPRO

DU

CT-GA

ME

ORG-C

OM

PAN

Y

203

86 55 35

ᤒ໒ 1-1

Misc-Currency 2197

Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75

MISC-C

URRENCY

59 48

670

LO

C-ISLAN

DO

RG-POLITICA

LO

RG-CO

MPA

NY

LO

C-MO

UN

TAIN

MISC-L

AN

GU

AG

E

PLET PLET (S)

42 33 23

365

239

1

(c) Zero-shot prediction distribution on MISC-CURRENCY

0

500

1000

1500

2000

2500

ᤒ໒ 1

Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22

LOC-M

OUNTAIN

MISC-L

AN

GU

AG

EL

OC-O

THER

PERSON-O

THER

MISC-L

IVIN

G

PROD-C

AR

10130 19 08

ᤒ໒ 1-1

Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31

LOC-M

OUNTAIN

86 18

767

LO

C-ISLAN

DPERSO

N-ARTIST

PERSON-PO

LITICIAN

MISC-L

AN

GU

AG

EPRO

D-SHIP

PLET PLET (S)

16 15 1134

772

1

(d) Zero-shot prediction distribution on LOC-MOUNTAIN

Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types

Encoding Strategy Template T(x) Acc MiF MaF

Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874

Soft-encoding

x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839

Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET

2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)

Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some

cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning

Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

London is one of the biggest cities in the world London is a

Input Prompt

[MASK] [SEP]

Copy the entity mention

LOCATIONCITY

Label Words Class Sets

Mapping

[CLS]

MLM head

City Location hellip

Figure 2 The illustration of prompt-learning for fine-grained entity typing with supervision We take hard-encoding prompt strategy as an example in this figure

ing the objective function 1n

sumni=1 log(P (yi|si))

where yi is the golden type label of si

23 Prompt-based Tuning

In prompt-based tuning for each label y isin Y wedefine a label word set Vy = w1 wm Vyis a subset of the vocabulary V of the PLM Mie Vy sube V By taking the union of the dictio-nary corresponding to each label we get an overalldictionary Vlowast For example in sentiment classi-fication we could map the label y = POSITIVE

into a set Vy = great good wonderful Andanother primary component of prompt-learning is aprompt template T (middot) which modifies the originalinput x into a prompt input T (x) by adding a setof additional tokens at the end of x Convention-ally a [MASK] token is added for PLMs to predictthe missing label word w isin Vlowast Thus in prompt-learning a classification problem is transferred intoa masked language modeling problem

p(y isin Y|s)=p([MASK]=wisinVy|T (s)) (2)

3 Prompt-learning for Entity Typing ANaive Pipeline

After transferred into masked language modelingthe prompt-learning method is applicable to learn-ing and aggregating type information of entities Inthis section we first introduce a naive but empiri-cally strong baseline that utilizes prompts to extractentity types with explicit supervision includingthe construction of label words (sect 31) templates(sect 32) and training (sect 33) And such a simplepipeline yields remarkable results on three bench-mark datasets Then we propose a self-supervisedprompt-learning method that automatically learnstype information from unlabeled data (sect 4)

31 Label Words Set Vlowast

For fine-grained entity typing datasets usu-ally use hierarchical label space such as PER-SONARTIST (FEW-NERD) and ORGANIZA-TIONPARTY (OntoNotes) In this case we useall the words as the label words set Vlowast for this en-tity type For example y = LOCATIONCITY rarrv = location city And as the entity types areall well-defined nouns with clear boundaries it isintuitive to expand the label words set Vlowast with ob-tainable related nouns For example in RelatedWords1 the top-10 related words of the label wordcity is ldquometropolis town municipality urban sub-urb municipal megalopolis civilization down-town countryrdquo These words are strongly relatedto the class CITY and they are hardly mappedto other entity types even under the same LOCA-TION class such as LOCATIONMOUNTAIN LO-CATIONISLAND etc

In masked language modeling we use confi-dence scores of all the words in Vy to constructthe final score of the particular type y That is foran input x (which is mapped to T (x)) and its entitytype y (which is mapped to Vy = w1 wm)the conditional probability becomes

P (y|x)= 1

m

msumj

λjP ([MASK]=wj |T (x)) (3)

where λi is a parameter to indicate the importanceof the current word wj isin Vy Note that λi couldalso be learnable or heuristically defined during thetraining procedure

32 Templates

In this section we construct entity-orientedprompts for the fine-grained entity typing task We

1httpsrelatedwordsorg

choose hard-encoding templates with natural lan-guage and soft-encoding templates with additionalspecial tokens in our work

For the choice of hard-encoding templates wedo not use automatic searching methods for dis-crete prompts since the fine-grained entity typingtask is clearly defined and the prompts are easilypurposeful We select simple declarative templatesrather than hypernym templates to avoid grammarti-cal errors In the template of hard encoding settingwe first copy the marked entity mention in x thenwe add a few linking verbs and articles followed bythe [MASK] token With the marked entity mention[Ent] we use the following templates

T1(x) = x [Ent] is [MASK]

T2(x) = x [Ent] is a [MASK]

T3(x) = x In this sentence [Ent] is a [MASK]

where [Ent] is the entity mention in x In sect 5 wereport the the results of T3(middot)

We also adopt the soft-encoding strategywhich introduces some additional special tokens[P1] [Pl] as the template where l is a pre-defined hyper-parameter The template begins witha delimiter [P] and a copy of the entity mention [M]The complete template becomes

T4(x) = x [P] [Ent] [P1] [Pl] [MASK]

where each embedding of prompts is randomly ini-tialized and optimized during training Intuitivelythese special tokens can represent a cluster of wordswith similar semantics in the vocabulary

33 Training and InferenceThe strategies of hard or soft encoding provide dif-ferent initialization of templates and they both canbe parameterized by φ and optimized along withM during training We train the pre-trained modelM (parameterized by θ) along with the additionalprompt embeddings by using the cross-entropy lossfunction

L = minussum

logP (y|x θ φ) (4)

For inference we can directly use Eq 3 to predictthe label of the current input instance based on thepredicted words of the [MASK] position

This pipeline could be applied to entity typingtask with explicit supervision and it is effectiveeven if the training data are insufficient ie the

few-shot scenario (sect 55) Naturally we considera more extreme situation that is a scenario with-out any training data (zero-shot scenario) In thissetting if we directly use an additional classifier topredict the label the result is equivalent to randomguessing because the parameters of the classifierare randomly initialized If we use prompts to inferthe label based on the predicted words although itsperformance is significantly better than guessingthere will also be a catastrophic decline (sect 56) Atthis time a question emerges ldquoIs it possible forPLMs to predict entity types without any explicitsupervision rdquo

4 Self-supervised Prompt-learning forZero-shot Entity Typing

With prompt-learning the answer is yes be-cause in the pre-training stage the contexts ofentities have already implied the correspondingtype information which provides an advanta-geous initialization point for the prompt-learningparadigm For example in the input sentencewith the T3(middot) template ldquoSteve Jobs found Ap-ple In this sentence Steve Jobs is a [MASK] rdquoIn our observations the probability of PLMs pre-dicting person at the masked position will be sig-nificantly higher than the probability of locationAnd if we make reasonable use of this superiorinitialization point it is possible for PLMs to au-tomatically summarize the type information andfinally extract the correct entity type

41 Overview

In order to create conditions for PLMs to sum-marize entity types we consider a self-supervisedparadigm that optimizes the similarity of the prob-ability distribution predicted by similar examplesover a projected vocabulary Vlowast To achieve thatin prompt-learning we need to (1) impose a limiton the prediction range of the model so that onlythose words that we need that is words that ex-press entity types participate in the optimizationof the gradient (2) provide an unlabeled datasetwhere entity mentions are marked without anytypes to allow the model to learn the process ofinducing type information in a self-supervised man-ner The inputs contain a pre-trained modelM apre-defined label schema Y and a dataset with-out labels D = x1 xn (entity mentions aremarked without any types) our goal is to makeMcapable to automatically carry out zero-shot entity

London is one of the biggest citieshellip London

Input Prompt

[MASK] [SEP][CLS]

[HIDE] is located in the south-easthellip

Input

[CLS]

[P1] [P2]

Prompt

[MASK] [SEP][P1] [P2]

Copy the entity mention

Randomly hide the mention with probability α

[HIDE]

D

MLM

MLM Predictionover

Predictionover

JS Divergence

LD Unlabeled Datset Pre-defined Label Schema S Sampler L

L

S

Mapping

Mapping

Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure

typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process

42 Self-supervised Learning

Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning

Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs

with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04

Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by

s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)

where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime

As we attempt to make the predictions of thepositive pairs similar the objective is computed by

L = minus 1

| ˆDpos|2sum

xisinDpos

sumxprimeisinDpos

log(1minus s(hhprime))

minus 1

| ˆDneg|2sum

xisin ˆDneg

sumxprimeisin ˆDneg

γ log(s(hhprime))

(6)

where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg

Dataset Type Supervised Few-shot Zero-shot

|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|

Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824

Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot

Shot Metric Few-NERD OntoNotes BBN

Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET

1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)

2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)

4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)

8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)

16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)

Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size

5 Experiments

In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET

to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets

51 Datasets

We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN

FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-

sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments

OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)

BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2

52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets

Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET

Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets

Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)

Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities

53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods

2httpsgithubcomgoogle-researchbert

3httpspytorchorg4httpsgithubcomhuggingface

transformers

with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs

54 Results of Fully Supervised Entity Typing

Dataset Metric Method

FT PLET (H) PLET (S)

Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576

OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977

BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781

Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder

The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context

It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed

55 Results of Few-shot Entity Typing

Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted

Dataset Metric Method

PLET PLET (S)

Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)

OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)

BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)

Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder

that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds

56 Results of Zero-shot Entity Typing

Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available

To explore the more subtle changes in perfor-

mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types

57 Effect of Templates

As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens

6 Related Work

After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al

0

600

1200

1800

2400

3000

ᤒ໒ 1

Org-SportsLeague 50

Product-Game 2783

Org-SportsTeam 43

MISC-Livingthing 43

Org-Company 27

Org-Political 16

OR

G-SPORTSL

EAG

UE

PROD

UCT-G

AM

EO

RG-SPORTST

EAM

MISC-L

IVIN

G

ORG-C

OM

PAN

YO

RG-POLITICA

L

926

14 10 0717

ᤒ໒ 1-1

Org-SportsLeague 1929

Per-Athlete 368

Event-Other 132

Location-Mountain 117

Org-SportsTeam 116

Build-Airport 54

OR

G-SPORTSL

EAG

UE

123

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

44

642

PER-ATH

LETEE

VEN

T-OTH

ERL

OCATIO

N-MO

UN

TAIN

ORG-SPO

RTSTEA

MB

UILD-A

IRPORT

PLET PLET (S)

39 39 1814

1

(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE

0

600

1200

1800

2400

3000

ᤒ໒ 1

Event-Attack 47

PROD-Game 2662Event-Disaster 540

Loc-Other 242

event-protest 203

MISC-Language 169

EV

ENT-A

TTAC

KPRO

D-G

AM

EE

VEN

T-DISA

STERL

OC-O

THER

EV

ENT-PRO

TESTM

ISC-LA

NG

UA

GE

630

12857 48 40

ᤒ໒ 1-1

Event-Attack 1894

Org-Political 420

Per-Politician 392

Loc-Island 342

Loc-Other 342

Other-Language 154

EV

ENT-A

TTAC

K

99

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

93

449

ORG-PO

LITICAL

PER-POLITICIA

NL

OC-ISLA

ND

LO

C-OTH

ER

MISC-L

AN

GU

AG

EPLET PLET (S)

81 81 3711

2

(b) Zero-shot prediction distribution on EVENT-ATTACK

0

500

1000

1500

2000

2500

ᤒ໒ 1

Misc-Currency 1196Person-Other 782

PROD-Car 666

Misc-Language 282

product-game 179Org-Company 116

MISC-C

URRENCY

PERSON-O

THER

PROD-C

AR

MISC-L

AN

GU

AG

EPRO

DU

CT-GA

ME

ORG-C

OM

PAN

Y

203

86 55 35

ᤒ໒ 1-1

Misc-Currency 2197

Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75

MISC-C

URRENCY

59 48

670

LO

C-ISLAN

DO

RG-POLITICA

LO

RG-CO

MPA

NY

LO

C-MO

UN

TAIN

MISC-L

AN

GU

AG

E

PLET PLET (S)

42 33 23

365

239

1

(c) Zero-shot prediction distribution on MISC-CURRENCY

0

500

1000

1500

2000

2500

ᤒ໒ 1

Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22

LOC-M

OUNTAIN

MISC-L

AN

GU

AG

EL

OC-O

THER

PERSON-O

THER

MISC-L

IVIN

G

PROD-C

AR

10130 19 08

ᤒ໒ 1-1

Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31

LOC-M

OUNTAIN

86 18

767

LO

C-ISLAN

DPERSO

N-ARTIST

PERSON-PO

LITICIAN

MISC-L

AN

GU

AG

EPRO

D-SHIP

PLET PLET (S)

16 15 1134

772

1

(d) Zero-shot prediction distribution on LOC-MOUNTAIN

Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types

Encoding Strategy Template T(x) Acc MiF MaF

Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874

Soft-encoding

x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839

Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET

2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)

Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some

cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning

Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

choose hard-encoding templates with natural lan-guage and soft-encoding templates with additionalspecial tokens in our work

For the choice of hard-encoding templates wedo not use automatic searching methods for dis-crete prompts since the fine-grained entity typingtask is clearly defined and the prompts are easilypurposeful We select simple declarative templatesrather than hypernym templates to avoid grammarti-cal errors In the template of hard encoding settingwe first copy the marked entity mention in x thenwe add a few linking verbs and articles followed bythe [MASK] token With the marked entity mention[Ent] we use the following templates

T1(x) = x [Ent] is [MASK]

T2(x) = x [Ent] is a [MASK]

T3(x) = x In this sentence [Ent] is a [MASK]

where [Ent] is the entity mention in x In sect 5 wereport the the results of T3(middot)

We also adopt the soft-encoding strategywhich introduces some additional special tokens[P1] [Pl] as the template where l is a pre-defined hyper-parameter The template begins witha delimiter [P] and a copy of the entity mention [M]The complete template becomes

T4(x) = x [P] [Ent] [P1] [Pl] [MASK]

where each embedding of prompts is randomly ini-tialized and optimized during training Intuitivelythese special tokens can represent a cluster of wordswith similar semantics in the vocabulary

33 Training and InferenceThe strategies of hard or soft encoding provide dif-ferent initialization of templates and they both canbe parameterized by φ and optimized along withM during training We train the pre-trained modelM (parameterized by θ) along with the additionalprompt embeddings by using the cross-entropy lossfunction

L = minussum

logP (y|x θ φ) (4)

For inference we can directly use Eq 3 to predictthe label of the current input instance based on thepredicted words of the [MASK] position

This pipeline could be applied to entity typingtask with explicit supervision and it is effectiveeven if the training data are insufficient ie the

few-shot scenario (sect 55) Naturally we considera more extreme situation that is a scenario with-out any training data (zero-shot scenario) In thissetting if we directly use an additional classifier topredict the label the result is equivalent to randomguessing because the parameters of the classifierare randomly initialized If we use prompts to inferthe label based on the predicted words although itsperformance is significantly better than guessingthere will also be a catastrophic decline (sect 56) Atthis time a question emerges ldquoIs it possible forPLMs to predict entity types without any explicitsupervision rdquo

4 Self-supervised Prompt-learning forZero-shot Entity Typing

With prompt-learning the answer is yes be-cause in the pre-training stage the contexts ofentities have already implied the correspondingtype information which provides an advanta-geous initialization point for the prompt-learningparadigm For example in the input sentencewith the T3(middot) template ldquoSteve Jobs found Ap-ple In this sentence Steve Jobs is a [MASK] rdquoIn our observations the probability of PLMs pre-dicting person at the masked position will be sig-nificantly higher than the probability of locationAnd if we make reasonable use of this superiorinitialization point it is possible for PLMs to au-tomatically summarize the type information andfinally extract the correct entity type

41 Overview

In order to create conditions for PLMs to sum-marize entity types we consider a self-supervisedparadigm that optimizes the similarity of the prob-ability distribution predicted by similar examplesover a projected vocabulary Vlowast To achieve thatin prompt-learning we need to (1) impose a limiton the prediction range of the model so that onlythose words that we need that is words that ex-press entity types participate in the optimizationof the gradient (2) provide an unlabeled datasetwhere entity mentions are marked without anytypes to allow the model to learn the process ofinducing type information in a self-supervised man-ner The inputs contain a pre-trained modelM apre-defined label schema Y and a dataset with-out labels D = x1 xn (entity mentions aremarked without any types) our goal is to makeMcapable to automatically carry out zero-shot entity

London is one of the biggest citieshellip London

Input Prompt

[MASK] [SEP][CLS]

[HIDE] is located in the south-easthellip

Input

[CLS]

[P1] [P2]

Prompt

[MASK] [SEP][P1] [P2]

Copy the entity mention

Randomly hide the mention with probability α

[HIDE]

D

MLM

MLM Predictionover

Predictionover

JS Divergence

LD Unlabeled Datset Pre-defined Label Schema S Sampler L

L

S

Mapping

Mapping

Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure

typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process

42 Self-supervised Learning

Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning

Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs

with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04

Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by

s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)

where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime

As we attempt to make the predictions of thepositive pairs similar the objective is computed by

L = minus 1

| ˆDpos|2sum

xisinDpos

sumxprimeisinDpos

log(1minus s(hhprime))

minus 1

| ˆDneg|2sum

xisin ˆDneg

sumxprimeisin ˆDneg

γ log(s(hhprime))

(6)

where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg

Dataset Type Supervised Few-shot Zero-shot

|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|

Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824

Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot

Shot Metric Few-NERD OntoNotes BBN

Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET

1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)

2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)

4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)

8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)

16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)

Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size

5 Experiments

In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET

to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets

51 Datasets

We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN

FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-

sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments

OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)

BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2

52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets

Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET

Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets

Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)

Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities

53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods

2httpsgithubcomgoogle-researchbert

3httpspytorchorg4httpsgithubcomhuggingface

transformers

with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs

54 Results of Fully Supervised Entity Typing

Dataset Metric Method

FT PLET (H) PLET (S)

Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576

OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977

BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781

Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder

The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context

It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed

55 Results of Few-shot Entity Typing

Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted

Dataset Metric Method

PLET PLET (S)

Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)

OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)

BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)

Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder

that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds

56 Results of Zero-shot Entity Typing

Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available

To explore the more subtle changes in perfor-

mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types

57 Effect of Templates

As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens

6 Related Work

After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al

0

600

1200

1800

2400

3000

ᤒ໒ 1

Org-SportsLeague 50

Product-Game 2783

Org-SportsTeam 43

MISC-Livingthing 43

Org-Company 27

Org-Political 16

OR

G-SPORTSL

EAG

UE

PROD

UCT-G

AM

EO

RG-SPORTST

EAM

MISC-L

IVIN

G

ORG-C

OM

PAN

YO

RG-POLITICA

L

926

14 10 0717

ᤒ໒ 1-1

Org-SportsLeague 1929

Per-Athlete 368

Event-Other 132

Location-Mountain 117

Org-SportsTeam 116

Build-Airport 54

OR

G-SPORTSL

EAG

UE

123

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

44

642

PER-ATH

LETEE

VEN

T-OTH

ERL

OCATIO

N-MO

UN

TAIN

ORG-SPO

RTSTEA

MB

UILD-A

IRPORT

PLET PLET (S)

39 39 1814

1

(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE

0

600

1200

1800

2400

3000

ᤒ໒ 1

Event-Attack 47

PROD-Game 2662Event-Disaster 540

Loc-Other 242

event-protest 203

MISC-Language 169

EV

ENT-A

TTAC

KPRO

D-G

AM

EE

VEN

T-DISA

STERL

OC-O

THER

EV

ENT-PRO

TESTM

ISC-LA

NG

UA

GE

630

12857 48 40

ᤒ໒ 1-1

Event-Attack 1894

Org-Political 420

Per-Politician 392

Loc-Island 342

Loc-Other 342

Other-Language 154

EV

ENT-A

TTAC

K

99

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

93

449

ORG-PO

LITICAL

PER-POLITICIA

NL

OC-ISLA

ND

LO

C-OTH

ER

MISC-L

AN

GU

AG

EPLET PLET (S)

81 81 3711

2

(b) Zero-shot prediction distribution on EVENT-ATTACK

0

500

1000

1500

2000

2500

ᤒ໒ 1

Misc-Currency 1196Person-Other 782

PROD-Car 666

Misc-Language 282

product-game 179Org-Company 116

MISC-C

URRENCY

PERSON-O

THER

PROD-C

AR

MISC-L

AN

GU

AG

EPRO

DU

CT-GA

ME

ORG-C

OM

PAN

Y

203

86 55 35

ᤒ໒ 1-1

Misc-Currency 2197

Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75

MISC-C

URRENCY

59 48

670

LO

C-ISLAN

DO

RG-POLITICA

LO

RG-CO

MPA

NY

LO

C-MO

UN

TAIN

MISC-L

AN

GU

AG

E

PLET PLET (S)

42 33 23

365

239

1

(c) Zero-shot prediction distribution on MISC-CURRENCY

0

500

1000

1500

2000

2500

ᤒ໒ 1

Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22

LOC-M

OUNTAIN

MISC-L

AN

GU

AG

EL

OC-O

THER

PERSON-O

THER

MISC-L

IVIN

G

PROD-C

AR

10130 19 08

ᤒ໒ 1-1

Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31

LOC-M

OUNTAIN

86 18

767

LO

C-ISLAN

DPERSO

N-ARTIST

PERSON-PO

LITICIAN

MISC-L

AN

GU

AG

EPRO

D-SHIP

PLET PLET (S)

16 15 1134

772

1

(d) Zero-shot prediction distribution on LOC-MOUNTAIN

Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types

Encoding Strategy Template T(x) Acc MiF MaF

Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874

Soft-encoding

x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839

Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET

2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)

Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some

cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning

Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

London is one of the biggest citieshellip London

Input Prompt

[MASK] [SEP][CLS]

[HIDE] is located in the south-easthellip

Input

[CLS]

[P1] [P2]

Prompt

[MASK] [SEP][P1] [P2]

Copy the entity mention

Randomly hide the mention with probability α

[HIDE]

D

MLM

MLM Predictionover

Predictionover

JS Divergence

LD Unlabeled Datset Pre-defined Label Schema S Sampler L

L

S

Mapping

Mapping

Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure

typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process

42 Self-supervised Learning

Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning

Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs

with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04

Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by

s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)

where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime

As we attempt to make the predictions of thepositive pairs similar the objective is computed by

L = minus 1

| ˆDpos|2sum

xisinDpos

sumxprimeisinDpos

log(1minus s(hhprime))

minus 1

| ˆDneg|2sum

xisin ˆDneg

sumxprimeisin ˆDneg

γ log(s(hhprime))

(6)

where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg

Dataset Type Supervised Few-shot Zero-shot

|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|

Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824

Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot

Shot Metric Few-NERD OntoNotes BBN

Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET

1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)

2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)

4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)

8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)

16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)

Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size

5 Experiments

In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET

to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets

51 Datasets

We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN

FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-

sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments

OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)

BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2

52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets

Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET

Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets

Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)

Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities

53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods

2httpsgithubcomgoogle-researchbert

3httpspytorchorg4httpsgithubcomhuggingface

transformers

with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs

54 Results of Fully Supervised Entity Typing

Dataset Metric Method

FT PLET (H) PLET (S)

Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576

OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977

BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781

Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder

The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context

It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed

55 Results of Few-shot Entity Typing

Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted

Dataset Metric Method

PLET PLET (S)

Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)

OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)

BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)

Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder

that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds

56 Results of Zero-shot Entity Typing

Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available

To explore the more subtle changes in perfor-

mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types

57 Effect of Templates

As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens

6 Related Work

After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al

0

600

1200

1800

2400

3000

ᤒ໒ 1

Org-SportsLeague 50

Product-Game 2783

Org-SportsTeam 43

MISC-Livingthing 43

Org-Company 27

Org-Political 16

OR

G-SPORTSL

EAG

UE

PROD

UCT-G

AM

EO

RG-SPORTST

EAM

MISC-L

IVIN

G

ORG-C

OM

PAN

YO

RG-POLITICA

L

926

14 10 0717

ᤒ໒ 1-1

Org-SportsLeague 1929

Per-Athlete 368

Event-Other 132

Location-Mountain 117

Org-SportsTeam 116

Build-Airport 54

OR

G-SPORTSL

EAG

UE

123

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

44

642

PER-ATH

LETEE

VEN

T-OTH

ERL

OCATIO

N-MO

UN

TAIN

ORG-SPO

RTSTEA

MB

UILD-A

IRPORT

PLET PLET (S)

39 39 1814

1

(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE

0

600

1200

1800

2400

3000

ᤒ໒ 1

Event-Attack 47

PROD-Game 2662Event-Disaster 540

Loc-Other 242

event-protest 203

MISC-Language 169

EV

ENT-A

TTAC

KPRO

D-G

AM

EE

VEN

T-DISA

STERL

OC-O

THER

EV

ENT-PRO

TESTM

ISC-LA

NG

UA

GE

630

12857 48 40

ᤒ໒ 1-1

Event-Attack 1894

Org-Political 420

Per-Politician 392

Loc-Island 342

Loc-Other 342

Other-Language 154

EV

ENT-A

TTAC

K

99

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

93

449

ORG-PO

LITICAL

PER-POLITICIA

NL

OC-ISLA

ND

LO

C-OTH

ER

MISC-L

AN

GU

AG

EPLET PLET (S)

81 81 3711

2

(b) Zero-shot prediction distribution on EVENT-ATTACK

0

500

1000

1500

2000

2500

ᤒ໒ 1

Misc-Currency 1196Person-Other 782

PROD-Car 666

Misc-Language 282

product-game 179Org-Company 116

MISC-C

URRENCY

PERSON-O

THER

PROD-C

AR

MISC-L

AN

GU

AG

EPRO

DU

CT-GA

ME

ORG-C

OM

PAN

Y

203

86 55 35

ᤒ໒ 1-1

Misc-Currency 2197

Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75

MISC-C

URRENCY

59 48

670

LO

C-ISLAN

DO

RG-POLITICA

LO

RG-CO

MPA

NY

LO

C-MO

UN

TAIN

MISC-L

AN

GU

AG

E

PLET PLET (S)

42 33 23

365

239

1

(c) Zero-shot prediction distribution on MISC-CURRENCY

0

500

1000

1500

2000

2500

ᤒ໒ 1

Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22

LOC-M

OUNTAIN

MISC-L

AN

GU

AG

EL

OC-O

THER

PERSON-O

THER

MISC-L

IVIN

G

PROD-C

AR

10130 19 08

ᤒ໒ 1-1

Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31

LOC-M

OUNTAIN

86 18

767

LO

C-ISLAN

DPERSO

N-ARTIST

PERSON-PO

LITICIAN

MISC-L

AN

GU

AG

EPRO

D-SHIP

PLET PLET (S)

16 15 1134

772

1

(d) Zero-shot prediction distribution on LOC-MOUNTAIN

Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types

Encoding Strategy Template T(x) Acc MiF MaF

Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874

Soft-encoding

x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839

Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET

2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)

Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some

cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning

Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

Dataset Type Supervised Few-shot Zero-shot

|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|

Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824

Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot

Shot Metric Few-NERD OntoNotes BBN

Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET

1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)

2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)

4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)

8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)

16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)

Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size

5 Experiments

In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET

to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets

51 Datasets

We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN

FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-

sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments

OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)

BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2

52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets

Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET

Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets

Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)

Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities

53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods

2httpsgithubcomgoogle-researchbert

3httpspytorchorg4httpsgithubcomhuggingface

transformers

with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs

54 Results of Fully Supervised Entity Typing

Dataset Metric Method

FT PLET (H) PLET (S)

Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576

OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977

BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781

Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder

The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context

It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed

55 Results of Few-shot Entity Typing

Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted

Dataset Metric Method

PLET PLET (S)

Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)

OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)

BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)

Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder

that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds

56 Results of Zero-shot Entity Typing

Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available

To explore the more subtle changes in perfor-

mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types

57 Effect of Templates

As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens

6 Related Work

After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al

0

600

1200

1800

2400

3000

ᤒ໒ 1

Org-SportsLeague 50

Product-Game 2783

Org-SportsTeam 43

MISC-Livingthing 43

Org-Company 27

Org-Political 16

OR

G-SPORTSL

EAG

UE

PROD

UCT-G

AM

EO

RG-SPORTST

EAM

MISC-L

IVIN

G

ORG-C

OM

PAN

YO

RG-POLITICA

L

926

14 10 0717

ᤒ໒ 1-1

Org-SportsLeague 1929

Per-Athlete 368

Event-Other 132

Location-Mountain 117

Org-SportsTeam 116

Build-Airport 54

OR

G-SPORTSL

EAG

UE

123

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

44

642

PER-ATH

LETEE

VEN

T-OTH

ERL

OCATIO

N-MO

UN

TAIN

ORG-SPO

RTSTEA

MB

UILD-A

IRPORT

PLET PLET (S)

39 39 1814

1

(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE

0

600

1200

1800

2400

3000

ᤒ໒ 1

Event-Attack 47

PROD-Game 2662Event-Disaster 540

Loc-Other 242

event-protest 203

MISC-Language 169

EV

ENT-A

TTAC

KPRO

D-G

AM

EE

VEN

T-DISA

STERL

OC-O

THER

EV

ENT-PRO

TESTM

ISC-LA

NG

UA

GE

630

12857 48 40

ᤒ໒ 1-1

Event-Attack 1894

Org-Political 420

Per-Politician 392

Loc-Island 342

Loc-Other 342

Other-Language 154

EV

ENT-A

TTAC

K

99

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

93

449

ORG-PO

LITICAL

PER-POLITICIA

NL

OC-ISLA

ND

LO

C-OTH

ER

MISC-L

AN

GU

AG

EPLET PLET (S)

81 81 3711

2

(b) Zero-shot prediction distribution on EVENT-ATTACK

0

500

1000

1500

2000

2500

ᤒ໒ 1

Misc-Currency 1196Person-Other 782

PROD-Car 666

Misc-Language 282

product-game 179Org-Company 116

MISC-C

URRENCY

PERSON-O

THER

PROD-C

AR

MISC-L

AN

GU

AG

EPRO

DU

CT-GA

ME

ORG-C

OM

PAN

Y

203

86 55 35

ᤒ໒ 1-1

Misc-Currency 2197

Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75

MISC-C

URRENCY

59 48

670

LO

C-ISLAN

DO

RG-POLITICA

LO

RG-CO

MPA

NY

LO

C-MO

UN

TAIN

MISC-L

AN

GU

AG

E

PLET PLET (S)

42 33 23

365

239

1

(c) Zero-shot prediction distribution on MISC-CURRENCY

0

500

1000

1500

2000

2500

ᤒ໒ 1

Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22

LOC-M

OUNTAIN

MISC-L

AN

GU

AG

EL

OC-O

THER

PERSON-O

THER

MISC-L

IVIN

G

PROD-C

AR

10130 19 08

ᤒ໒ 1-1

Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31

LOC-M

OUNTAIN

86 18

767

LO

C-ISLAN

DPERSO

N-ARTIST

PERSON-PO

LITICIAN

MISC-L

AN

GU

AG

EPRO

D-SHIP

PLET PLET (S)

16 15 1134

772

1

(d) Zero-shot prediction distribution on LOC-MOUNTAIN

Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types

Encoding Strategy Template T(x) Acc MiF MaF

Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874

Soft-encoding

x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839

Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET

2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)

Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some

cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning

Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets

Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET

Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets

Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)

Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities

53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods

2httpsgithubcomgoogle-researchbert

3httpspytorchorg4httpsgithubcomhuggingface

transformers

with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs

54 Results of Fully Supervised Entity Typing

Dataset Metric Method

FT PLET (H) PLET (S)

Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576

OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977

BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781

Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder

The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context

It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed

55 Results of Few-shot Entity Typing

Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted

Dataset Metric Method

PLET PLET (S)

Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)

OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)

BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)

Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder

that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds

56 Results of Zero-shot Entity Typing

Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available

To explore the more subtle changes in perfor-

mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types

57 Effect of Templates

As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens

6 Related Work

After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al

0

600

1200

1800

2400

3000

ᤒ໒ 1

Org-SportsLeague 50

Product-Game 2783

Org-SportsTeam 43

MISC-Livingthing 43

Org-Company 27

Org-Political 16

OR

G-SPORTSL

EAG

UE

PROD

UCT-G

AM

EO

RG-SPORTST

EAM

MISC-L

IVIN

G

ORG-C

OM

PAN

YO

RG-POLITICA

L

926

14 10 0717

ᤒ໒ 1-1

Org-SportsLeague 1929

Per-Athlete 368

Event-Other 132

Location-Mountain 117

Org-SportsTeam 116

Build-Airport 54

OR

G-SPORTSL

EAG

UE

123

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

44

642

PER-ATH

LETEE

VEN

T-OTH

ERL

OCATIO

N-MO

UN

TAIN

ORG-SPO

RTSTEA

MB

UILD-A

IRPORT

PLET PLET (S)

39 39 1814

1

(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE

0

600

1200

1800

2400

3000

ᤒ໒ 1

Event-Attack 47

PROD-Game 2662Event-Disaster 540

Loc-Other 242

event-protest 203

MISC-Language 169

EV

ENT-A

TTAC

KPRO

D-G

AM

EE

VEN

T-DISA

STERL

OC-O

THER

EV

ENT-PRO

TESTM

ISC-LA

NG

UA

GE

630

12857 48 40

ᤒ໒ 1-1

Event-Attack 1894

Org-Political 420

Per-Politician 392

Loc-Island 342

Loc-Other 342

Other-Language 154

EV

ENT-A

TTAC

K

99

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

93

449

ORG-PO

LITICAL

PER-POLITICIA

NL

OC-ISLA

ND

LO

C-OTH

ER

MISC-L

AN

GU

AG

EPLET PLET (S)

81 81 3711

2

(b) Zero-shot prediction distribution on EVENT-ATTACK

0

500

1000

1500

2000

2500

ᤒ໒ 1

Misc-Currency 1196Person-Other 782

PROD-Car 666

Misc-Language 282

product-game 179Org-Company 116

MISC-C

URRENCY

PERSON-O

THER

PROD-C

AR

MISC-L

AN

GU

AG

EPRO

DU

CT-GA

ME

ORG-C

OM

PAN

Y

203

86 55 35

ᤒ໒ 1-1

Misc-Currency 2197

Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75

MISC-C

URRENCY

59 48

670

LO

C-ISLAN

DO

RG-POLITICA

LO

RG-CO

MPA

NY

LO

C-MO

UN

TAIN

MISC-L

AN

GU

AG

E

PLET PLET (S)

42 33 23

365

239

1

(c) Zero-shot prediction distribution on MISC-CURRENCY

0

500

1000

1500

2000

2500

ᤒ໒ 1

Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22

LOC-M

OUNTAIN

MISC-L

AN

GU

AG

EL

OC-O

THER

PERSON-O

THER

MISC-L

IVIN

G

PROD-C

AR

10130 19 08

ᤒ໒ 1-1

Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31

LOC-M

OUNTAIN

86 18

767

LO

C-ISLAN

DPERSO

N-ARTIST

PERSON-PO

LITICIAN

MISC-L

AN

GU

AG

EPRO

D-SHIP

PLET PLET (S)

16 15 1134

772

1

(d) Zero-shot prediction distribution on LOC-MOUNTAIN

Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types

Encoding Strategy Template T(x) Acc MiF MaF

Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874

Soft-encoding

x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839

Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET

2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)

Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some

cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning

Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

Dataset Metric Method

PLET PLET (S)

Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)

OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)

BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)

Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder

that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds

56 Results of Zero-shot Entity Typing

Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available

To explore the more subtle changes in perfor-

mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types

57 Effect of Templates

As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens

6 Related Work

After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al

0

600

1200

1800

2400

3000

ᤒ໒ 1

Org-SportsLeague 50

Product-Game 2783

Org-SportsTeam 43

MISC-Livingthing 43

Org-Company 27

Org-Political 16

OR

G-SPORTSL

EAG

UE

PROD

UCT-G

AM

EO

RG-SPORTST

EAM

MISC-L

IVIN

G

ORG-C

OM

PAN

YO

RG-POLITICA

L

926

14 10 0717

ᤒ໒ 1-1

Org-SportsLeague 1929

Per-Athlete 368

Event-Other 132

Location-Mountain 117

Org-SportsTeam 116

Build-Airport 54

OR

G-SPORTSL

EAG

UE

123

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

44

642

PER-ATH

LETEE

VEN

T-OTH

ERL

OCATIO

N-MO

UN

TAIN

ORG-SPO

RTSTEA

MB

UILD-A

IRPORT

PLET PLET (S)

39 39 1814

1

(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE

0

600

1200

1800

2400

3000

ᤒ໒ 1

Event-Attack 47

PROD-Game 2662Event-Disaster 540

Loc-Other 242

event-protest 203

MISC-Language 169

EV

ENT-A

TTAC

KPRO

D-G

AM

EE

VEN

T-DISA

STERL

OC-O

THER

EV

ENT-PRO

TESTM

ISC-LA

NG

UA

GE

630

12857 48 40

ᤒ໒ 1-1

Event-Attack 1894

Org-Political 420

Per-Politician 392

Loc-Island 342

Loc-Other 342

Other-Language 154

EV

ENT-A

TTAC

K

99

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

93

449

ORG-PO

LITICAL

PER-POLITICIA

NL

OC-ISLA

ND

LO

C-OTH

ER

MISC-L

AN

GU

AG

EPLET PLET (S)

81 81 3711

2

(b) Zero-shot prediction distribution on EVENT-ATTACK

0

500

1000

1500

2000

2500

ᤒ໒ 1

Misc-Currency 1196Person-Other 782

PROD-Car 666

Misc-Language 282

product-game 179Org-Company 116

MISC-C

URRENCY

PERSON-O

THER

PROD-C

AR

MISC-L

AN

GU

AG

EPRO

DU

CT-GA

ME

ORG-C

OM

PAN

Y

203

86 55 35

ᤒ໒ 1-1

Misc-Currency 2197

Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75

MISC-C

URRENCY

59 48

670

LO

C-ISLAN

DO

RG-POLITICA

LO

RG-CO

MPA

NY

LO

C-MO

UN

TAIN

MISC-L

AN

GU

AG

E

PLET PLET (S)

42 33 23

365

239

1

(c) Zero-shot prediction distribution on MISC-CURRENCY

0

500

1000

1500

2000

2500

ᤒ໒ 1

Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22

LOC-M

OUNTAIN

MISC-L

AN

GU

AG

EL

OC-O

THER

PERSON-O

THER

MISC-L

IVIN

G

PROD-C

AR

10130 19 08

ᤒ໒ 1-1

Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31

LOC-M

OUNTAIN

86 18

767

LO

C-ISLAN

DPERSO

N-ARTIST

PERSON-PO

LITICIAN

MISC-L

AN

GU

AG

EPRO

D-SHIP

PLET PLET (S)

16 15 1134

772

1

(d) Zero-shot prediction distribution on LOC-MOUNTAIN

Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types

Encoding Strategy Template T(x) Acc MiF MaF

Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874

Soft-encoding

x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839

Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET

2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)

Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some

cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning

Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

0

600

1200

1800

2400

3000

ᤒ໒ 1

Org-SportsLeague 50

Product-Game 2783

Org-SportsTeam 43

MISC-Livingthing 43

Org-Company 27

Org-Political 16

OR

G-SPORTSL

EAG

UE

PROD

UCT-G

AM

EO

RG-SPORTST

EAM

MISC-L

IVIN

G

ORG-C

OM

PAN

YO

RG-POLITICA

L

926

14 10 0717

ᤒ໒ 1-1

Org-SportsLeague 1929

Per-Athlete 368

Event-Other 132

Location-Mountain 117

Org-SportsTeam 116

Build-Airport 54

OR

G-SPORTSL

EAG

UE

123

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

44

642

PER-ATH

LETEE

VEN

T-OTH

ERL

OCATIO

N-MO

UN

TAIN

ORG-SPO

RTSTEA

MB

UILD-A

IRPORT

PLET PLET (S)

39 39 1814

1

(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE

0

600

1200

1800

2400

3000

ᤒ໒ 1

Event-Attack 47

PROD-Game 2662Event-Disaster 540

Loc-Other 242

event-protest 203

MISC-Language 169

EV

ENT-A

TTAC

KPRO

D-G

AM

EE

VEN

T-DISA

STERL

OC-O

THER

EV

ENT-PRO

TESTM

ISC-LA

NG

UA

GE

630

12857 48 40

ᤒ໒ 1-1

Event-Attack 1894

Org-Political 420

Per-Politician 392

Loc-Island 342

Loc-Other 342

Other-Language 154

EV

ENT-A

TTAC

K

99

Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET

93

449

ORG-PO

LITICAL

PER-POLITICIA

NL

OC-ISLA

ND

LO

C-OTH

ER

MISC-L

AN

GU

AG

EPLET PLET (S)

81 81 3711

2

(b) Zero-shot prediction distribution on EVENT-ATTACK

0

500

1000

1500

2000

2500

ᤒ໒ 1

Misc-Currency 1196Person-Other 782

PROD-Car 666

Misc-Language 282

product-game 179Org-Company 116

MISC-C

URRENCY

PERSON-O

THER

PROD-C

AR

MISC-L

AN

GU

AG

EPRO

DU

CT-GA

ME

ORG-C

OM

PAN

Y

203

86 55 35

ᤒ໒ 1-1

Misc-Currency 2197

Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75

MISC-C

URRENCY

59 48

670

LO

C-ISLAN

DO

RG-POLITICA

LO

RG-CO

MPA

NY

LO

C-MO

UN

TAIN

MISC-L

AN

GU

AG

E

PLET PLET (S)

42 33 23

365

239

1

(c) Zero-shot prediction distribution on MISC-CURRENCY

0

500

1000

1500

2000

2500

ᤒ໒ 1

Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22

LOC-M

OUNTAIN

MISC-L

AN

GU

AG

EL

OC-O

THER

PERSON-O

THER

MISC-L

IVIN

G

PROD-C

AR

10130 19 08

ᤒ໒ 1-1

Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31

LOC-M

OUNTAIN

86 18

767

LO

C-ISLAN

DPERSO

N-ARTIST

PERSON-PO

LITICIAN

MISC-L

AN

GU

AG

EPRO

D-SHIP

PLET PLET (S)

16 15 1134

772

1

(d) Zero-shot prediction distribution on LOC-MOUNTAIN

Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types

Encoding Strategy Template T(x) Acc MiF MaF

Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874

Soft-encoding

x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839

Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET

2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)

Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some

cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning

Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases

In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios

7 Conclusion

This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data

ReferencesDaniel Adiwardana Minh-Thang Luong David R So

Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977

Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics

Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901

Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96

Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799

Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186

Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR

Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213

Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723

Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139

Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259

John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035

Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657

Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691

Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190

Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903

Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI

Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385

Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692

Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035

Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672

Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473

Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67

Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics

Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834

Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269

Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235

Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690

Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL

Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia

Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network

Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45

Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339

Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278

Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690