prompt-learning for fine-grained entity typing
TRANSCRIPT
Prompt-Learning for Fine-Grained Entity Typing
Ning Ding1lowast Yulin Chen3lowast Xu Han15 Guangwei Xu2 Pengjun Xie2Hai-Tao Zheng3dagger Zhiyuan Liu15dagger Juanzi Li1 Hong-Gee Kim4
1Department of Computer Science and Technology Tsinghua University2 Alibaba Group 3 SIGS Tsinghua University 4 Seoul National University
5 State Key Lab on Intelligent Technology and Systems Tsinghua Universitydingn18 yl-chen21 hanxu17mailstsinghuaeducn
Abstract
As an effective approach to tune pre-trainedlanguage models (PLMs) for specific tasksprompt-learning has recently attracted muchattention from researchers By using cloze-style language prompts to stimulate the ver-satile knowledge of PLMs prompt-learningcan achieve promising results on a series ofNLP tasks such as natural language infer-ence sentiment classification and knowledgeprobing In this work we investigate the ap-plication of prompt-learning on fine-grainedentity typing in fully supervised few-shotand zero-shot scenarios We first develop asimple and effective prompt-learning pipelineby constructing entity-oriented verbalizer andtemplates and conducting masked languagemodeling Further to tackle the zero-shotregime we propose a self-supervised strategythat carries out distribution-level optimizationin prompt-learning to automatically summa-rize the information of entity types Extensiveexperiments on three fine-grained entity typ-ing benchmarks (with up to 86 classes) underfully supervised few-shot and zero-shot set-tings show that prompt-learning methods sig-nificantly outperform fine-tuning baselines es-pecially when the training data is insufficient
1 Introduction
In recent years pre-trained language models(PLMs) have been widely explored and becomea key instrument for natural language understand-ing (Devlin et al 2019 Liu et al 2019) and gener-ation (Radford et al 2018 Raffel et al 2020) Byapplying self-supervised learning on large-scaleunlabeled corpora PLMs can capture rich lexi-cal (Jawahar et al 2019) syntactic (Hewitt andManning 2019 Wang et al 2021) and factualknowledge (Petroni et al 2019) that well benefits
lowast equal contributiondagger corresponding authors
Prompt
[CLS] IPhone is produced by [SEP] [MASK]
MLM Apple
Prompt
[CLS] I like this It was [SEP] [MASK]
MLM Great ClassPositive
Prompt
[CLS] What happened to his lab his lab was torn down [SEP] [MASK]
MLM Yes ClassEntailment
Prompt
[CLS] Bob Dylan who wrote the song Blowing in the Wind
won the Nobel Prize in Literature in 2016
Bob Dylan is [SEP]
6LQJHUULWHUŏ
Label Words
[MASK]MLM
3HUVRQ$UWLVW$XWKRU
Class Sets
KowledgeProbing
Sentiment Classification
( 2 classes PositiveNegative)
Natural Language Inference
( 3 classes EntailmentNetural
Contradiction)
Entity Typing( gt 40 classes
PersonLocationOrganizationhelliphellip)
Prompt
[CLS] iPhone is produced by [SEP] [MASK]
MLM Apple
Prompt
[CLS] I like this It was [SEP] [MASK]
MLM Great Positive
Prompt
[CLS] Bob Dylan the author of Blowing in the Wind won the Nobel Prize in Literature in 2016
Bob Dylan is [SEP]6LQJHUULWHUŏ
Label Words
[MASK]
MLMPERSONARTISTAUTHOR
Class Sets
KnowledgeProbing
Sentiment Classification( ~ 2 classes )
NaturalLanguageInference( ~ 3 classes )
Entity Typing( gt 46 classes )
Prompt
[CLS] What happened to his lab
his lab was torn down in 1904 [SEP] [MASK]
MLM Yes Entailment
Figure 1 Examples of prompt-learning to stimulate theknowledge of PLMs by formalizing specific tasks asequivalent cloze-style tasks
downstream NLP tasks Considering the versatileknowledge contained in PLMs many efforts ofresearchers have been devoted to stimulating task-specific knowledge in PLMs and adapting suchknowledge to downstream NLP tasks Fine-tuningwith extra classifiers has been one typical solutionfor adapting PLMs to specific tasks and achievespromising results on various NLP tasks (Qiu et al2020 Han et al 2021a)
Some recent efforts on probing knowledge ofPLMs show that by writing some natural languageprompts we can induce PLMs to complete fac-tual knowledge (Petroni et al 2019) GPT-3 fur-ther utilizes the information provided by promptsto conduct few-shot learning and achieves awe-some results (Brown et al 2020) Inspired by thisprompt-learning has been introduced As shown
arX
iv2
108
1060
4v1
[cs
CL
] 2
4 A
ug 2
021
in Figure 1 in prompt-learning downstream tasksare formalized as equivalent cloze-style tasks andPLMs are asked to handle these cloze-style tasksinstead of original downstream tasks Comparedwith conventional fine-tuning methods prompt-learning does not require extra neural layers andintuitively bridges the objective form gap betweenpre-training and fine-tuning Sufficient empiricalanalysis shows that either for manually pickinghand-crafted prompts (Liu et al 2021b Han et al2021b) or automatically building auto-generatedprompts (Shin et al 2020 Gao et al 2020 Lesteret al 2021) taking prompts for tuning models issurprisingly effective for the knowledge stimula-tion and model adaptation of PLMs especially inthe low-data regime
Intuitively prompt-learning is applicable to fine-grained entity typing which aims at classifyingmarked entities from input sequences into specifictypes in a pre-defined label set We discuss thistopic with a motivating example ldquoHe is from NewYorkrdquo By adding a prompt with a masking token[MASK] the sentence becomes ldquoHe is from NewYork In this sentence New York is [MASK]rdquo Dueto the wealth of knowledge acquired during pre-training PLMs can compute a probability distri-bution over the vocabulary at the masked positionand a relatively higher probability with the wordldquocityrdquo than the word ldquopersonrdquo In other words withsimple prompts the abstract entity attributes con-tained in PLMs can be efficiently exploited whichis meaningful for downstream entity-related tasks
In this work we comprehensively explore theapplication of prompt-learning to fine-grained en-tity typing in fully supervised few-shot and zero-shot settings Particularly we first introduce anaive pipeline where we construct entity-orientedprompts and formalize fine-grained entity typingas a cloze-style task This simple pipeline yieldspromising results in our experiments especiallywhen supervision is insufficient Then to tacklethe zero-shot scenario where no explicit supervi-sion exists in training we develop a self-supervisedstrategy under our prompt-learning pipeline Ourself-supervised strategy attempts to automaticallysummarize entity types by optimizing the similarityof the predicted probability distributions of pairedexamples in prompt-learning
Three popular benchmarks are used for our ex-periments including FEW-NERD (Ding et al2021b) OntoNotes (Weischedel et al 2013)
BBN (Weischedel and Brunstein 2005) All thesedatasets have a complex type hierarchy consistingof rich entity types requiring models to have goodcapabilities of entity attribute detection Empiri-cally our method yields significant improvementson these benchmark datasets especially under thezero-shot and few-shot settings We also make ananalysis and point out both the superiority and bot-tleneck of prompt-learning in fine-grained entitytyping which may advance further efforts to ex-tract entity attributes using PLMs Our source codeand pre-trained models will be publicly available
2 Background
In this section we first give a problem definition ofthe entity typing task (sect 21) followed by an intro-duction of conventional vanilla fine-tuning (sect 22)and prompt-based tuning (sect 23) with PLMs
21 Problem DefinitionThe input of entity typing is a dataset D =x1 xn with n sentences and each sentencex contains a marked entity mention m For eachinput sentence x entity typing aims at predictingthe entity type y isin Y of its marked mention mwhere Y is a pre-defined set of entity types En-tity typing is typically regarded as a context-awareclassification task For example in the sentenceldquoLondon is the fifth album by the rock band JesusJonesrdquo the entity mention London should be clas-sified as Music rather than Location In the eraof PLMs using pre-trained neural language models(eg BERT) as the encoder and performing modeltuning for classifying types becomes a standardparadigm
22 Vanilla Fine-tuningIn the vanilla fine-tuning paradigm of entity typ-ing for each token ti in an input sequencex = [CLS] t1 m tT [SEP] witha marked entity mention m = ti tj thePLM M produces its contextualized representa-tion h[CLS]h1 hT h[SEP] Empiricallywe choose the embedding of the [CLS] tokenh[CLS] as the final representation that is fed intoan output layer to predict the probability distribu-tion over the label space
P (y isin Y|s) = softmax(Wh[CLS] + b) (1)
where W and b are learnable parameters W band all parameters of PLMs are tuned by maximiz-
London is one of the biggest cities in the world London is a
Input Prompt
[MASK] [SEP]
Copy the entity mention
LOCATIONCITY
Label Words Class Sets
Mapping
[CLS]
MLM head
City Location hellip
Figure 2 The illustration of prompt-learning for fine-grained entity typing with supervision We take hard-encoding prompt strategy as an example in this figure
ing the objective function 1n
sumni=1 log(P (yi|si))
where yi is the golden type label of si
23 Prompt-based Tuning
In prompt-based tuning for each label y isin Y wedefine a label word set Vy = w1 wm Vyis a subset of the vocabulary V of the PLM Mie Vy sube V By taking the union of the dictio-nary corresponding to each label we get an overalldictionary Vlowast For example in sentiment classi-fication we could map the label y = POSITIVE
into a set Vy = great good wonderful Andanother primary component of prompt-learning is aprompt template T (middot) which modifies the originalinput x into a prompt input T (x) by adding a setof additional tokens at the end of x Convention-ally a [MASK] token is added for PLMs to predictthe missing label word w isin Vlowast Thus in prompt-learning a classification problem is transferred intoa masked language modeling problem
p(y isin Y|s)=p([MASK]=wisinVy|T (s)) (2)
3 Prompt-learning for Entity Typing ANaive Pipeline
After transferred into masked language modelingthe prompt-learning method is applicable to learn-ing and aggregating type information of entities Inthis section we first introduce a naive but empiri-cally strong baseline that utilizes prompts to extractentity types with explicit supervision includingthe construction of label words (sect 31) templates(sect 32) and training (sect 33) And such a simplepipeline yields remarkable results on three bench-mark datasets Then we propose a self-supervisedprompt-learning method that automatically learnstype information from unlabeled data (sect 4)
31 Label Words Set Vlowast
For fine-grained entity typing datasets usu-ally use hierarchical label space such as PER-SONARTIST (FEW-NERD) and ORGANIZA-TIONPARTY (OntoNotes) In this case we useall the words as the label words set Vlowast for this en-tity type For example y = LOCATIONCITY rarrv = location city And as the entity types areall well-defined nouns with clear boundaries it isintuitive to expand the label words set Vlowast with ob-tainable related nouns For example in RelatedWords1 the top-10 related words of the label wordcity is ldquometropolis town municipality urban sub-urb municipal megalopolis civilization down-town countryrdquo These words are strongly relatedto the class CITY and they are hardly mappedto other entity types even under the same LOCA-TION class such as LOCATIONMOUNTAIN LO-CATIONISLAND etc
In masked language modeling we use confi-dence scores of all the words in Vy to constructthe final score of the particular type y That is foran input x (which is mapped to T (x)) and its entitytype y (which is mapped to Vy = w1 wm)the conditional probability becomes
P (y|x)= 1
m
msumj
λjP ([MASK]=wj |T (x)) (3)
where λi is a parameter to indicate the importanceof the current word wj isin Vy Note that λi couldalso be learnable or heuristically defined during thetraining procedure
32 Templates
In this section we construct entity-orientedprompts for the fine-grained entity typing task We
1httpsrelatedwordsorg
choose hard-encoding templates with natural lan-guage and soft-encoding templates with additionalspecial tokens in our work
For the choice of hard-encoding templates wedo not use automatic searching methods for dis-crete prompts since the fine-grained entity typingtask is clearly defined and the prompts are easilypurposeful We select simple declarative templatesrather than hypernym templates to avoid grammarti-cal errors In the template of hard encoding settingwe first copy the marked entity mention in x thenwe add a few linking verbs and articles followed bythe [MASK] token With the marked entity mention[Ent] we use the following templates
T1(x) = x [Ent] is [MASK]
T2(x) = x [Ent] is a [MASK]
T3(x) = x In this sentence [Ent] is a [MASK]
where [Ent] is the entity mention in x In sect 5 wereport the the results of T3(middot)
We also adopt the soft-encoding strategywhich introduces some additional special tokens[P1] [Pl] as the template where l is a pre-defined hyper-parameter The template begins witha delimiter [P] and a copy of the entity mention [M]The complete template becomes
T4(x) = x [P] [Ent] [P1] [Pl] [MASK]
where each embedding of prompts is randomly ini-tialized and optimized during training Intuitivelythese special tokens can represent a cluster of wordswith similar semantics in the vocabulary
33 Training and InferenceThe strategies of hard or soft encoding provide dif-ferent initialization of templates and they both canbe parameterized by φ and optimized along withM during training We train the pre-trained modelM (parameterized by θ) along with the additionalprompt embeddings by using the cross-entropy lossfunction
L = minussum
logP (y|x θ φ) (4)
For inference we can directly use Eq 3 to predictthe label of the current input instance based on thepredicted words of the [MASK] position
This pipeline could be applied to entity typingtask with explicit supervision and it is effectiveeven if the training data are insufficient ie the
few-shot scenario (sect 55) Naturally we considera more extreme situation that is a scenario with-out any training data (zero-shot scenario) In thissetting if we directly use an additional classifier topredict the label the result is equivalent to randomguessing because the parameters of the classifierare randomly initialized If we use prompts to inferthe label based on the predicted words although itsperformance is significantly better than guessingthere will also be a catastrophic decline (sect 56) Atthis time a question emerges ldquoIs it possible forPLMs to predict entity types without any explicitsupervision rdquo
4 Self-supervised Prompt-learning forZero-shot Entity Typing
With prompt-learning the answer is yes be-cause in the pre-training stage the contexts ofentities have already implied the correspondingtype information which provides an advanta-geous initialization point for the prompt-learningparadigm For example in the input sentencewith the T3(middot) template ldquoSteve Jobs found Ap-ple In this sentence Steve Jobs is a [MASK] rdquoIn our observations the probability of PLMs pre-dicting person at the masked position will be sig-nificantly higher than the probability of locationAnd if we make reasonable use of this superiorinitialization point it is possible for PLMs to au-tomatically summarize the type information andfinally extract the correct entity type
41 Overview
In order to create conditions for PLMs to sum-marize entity types we consider a self-supervisedparadigm that optimizes the similarity of the prob-ability distribution predicted by similar examplesover a projected vocabulary Vlowast To achieve thatin prompt-learning we need to (1) impose a limiton the prediction range of the model so that onlythose words that we need that is words that ex-press entity types participate in the optimizationof the gradient (2) provide an unlabeled datasetwhere entity mentions are marked without anytypes to allow the model to learn the process ofinducing type information in a self-supervised man-ner The inputs contain a pre-trained modelM apre-defined label schema Y and a dataset with-out labels D = x1 xn (entity mentions aremarked without any types) our goal is to makeMcapable to automatically carry out zero-shot entity
London is one of the biggest citieshellip London
Input Prompt
[MASK] [SEP][CLS]
[HIDE] is located in the south-easthellip
Input
[CLS]
[P1] [P2]
Prompt
[MASK] [SEP][P1] [P2]
Copy the entity mention
Randomly hide the mention with probability α
[HIDE]
D
MLM
MLM Predictionover
Predictionover
JS Divergence
LD Unlabeled Datset Pre-defined Label Schema S Sampler L
L
S
Mapping
Mapping
Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure
typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process
42 Self-supervised Learning
Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning
Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs
with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04
Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by
s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)
where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime
As we attempt to make the predictions of thepositive pairs similar the objective is computed by
L = minus 1
| ˆDpos|2sum
xisinDpos
sumxprimeisinDpos
log(1minus s(hhprime))
minus 1
| ˆDneg|2sum
xisin ˆDneg
sumxprimeisin ˆDneg
γ log(s(hhprime))
(6)
where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg
Dataset Type Supervised Few-shot Zero-shot
|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|
Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824
Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot
Shot Metric Few-NERD OntoNotes BBN
Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET
1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)
2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)
4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)
8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)
16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)
Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size
5 Experiments
In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET
to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets
51 Datasets
We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN
FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-
sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments
OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)
BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2
52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets
Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET
Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets
Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)
Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities
53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods
2httpsgithubcomgoogle-researchbert
3httpspytorchorg4httpsgithubcomhuggingface
transformers
with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs
54 Results of Fully Supervised Entity Typing
Dataset Metric Method
FT PLET (H) PLET (S)
Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576
OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977
BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781
Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder
The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context
It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed
55 Results of Few-shot Entity Typing
Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted
Dataset Metric Method
PLET PLET (S)
Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)
OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)
BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)
Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder
that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds
56 Results of Zero-shot Entity Typing
Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available
To explore the more subtle changes in perfor-
mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types
57 Effect of Templates
As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens
6 Related Work
After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al
0
600
1200
1800
2400
3000
ᤒ໒ 1
Org-SportsLeague 50
Product-Game 2783
Org-SportsTeam 43
MISC-Livingthing 43
Org-Company 27
Org-Political 16
OR
G-SPORTSL
EAG
UE
PROD
UCT-G
AM
EO
RG-SPORTST
EAM
MISC-L
IVIN
G
ORG-C
OM
PAN
YO
RG-POLITICA
L
926
14 10 0717
ᤒ໒ 1-1
Org-SportsLeague 1929
Per-Athlete 368
Event-Other 132
Location-Mountain 117
Org-SportsTeam 116
Build-Airport 54
OR
G-SPORTSL
EAG
UE
123
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
44
642
PER-ATH
LETEE
VEN
T-OTH
ERL
OCATIO
N-MO
UN
TAIN
ORG-SPO
RTSTEA
MB
UILD-A
IRPORT
PLET PLET (S)
39 39 1814
1
(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE
0
600
1200
1800
2400
3000
ᤒ໒ 1
Event-Attack 47
PROD-Game 2662Event-Disaster 540
Loc-Other 242
event-protest 203
MISC-Language 169
EV
ENT-A
TTAC
KPRO
D-G
AM
EE
VEN
T-DISA
STERL
OC-O
THER
EV
ENT-PRO
TESTM
ISC-LA
NG
UA
GE
630
12857 48 40
ᤒ໒ 1-1
Event-Attack 1894
Org-Political 420
Per-Politician 392
Loc-Island 342
Loc-Other 342
Other-Language 154
EV
ENT-A
TTAC
K
99
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
93
449
ORG-PO
LITICAL
PER-POLITICIA
NL
OC-ISLA
ND
LO
C-OTH
ER
MISC-L
AN
GU
AG
EPLET PLET (S)
81 81 3711
2
(b) Zero-shot prediction distribution on EVENT-ATTACK
0
500
1000
1500
2000
2500
ᤒ໒ 1
Misc-Currency 1196Person-Other 782
PROD-Car 666
Misc-Language 282
product-game 179Org-Company 116
MISC-C
URRENCY
PERSON-O
THER
PROD-C
AR
MISC-L
AN
GU
AG
EPRO
DU
CT-GA
ME
ORG-C
OM
PAN
Y
203
86 55 35
ᤒ໒ 1-1
Misc-Currency 2197
Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75
MISC-C
URRENCY
59 48
670
LO
C-ISLAN
DO
RG-POLITICA
LO
RG-CO
MPA
NY
LO
C-MO
UN
TAIN
MISC-L
AN
GU
AG
E
PLET PLET (S)
42 33 23
365
239
1
(c) Zero-shot prediction distribution on MISC-CURRENCY
0
500
1000
1500
2000
2500
ᤒ໒ 1
Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22
LOC-M
OUNTAIN
MISC-L
AN
GU
AG
EL
OC-O
THER
PERSON-O
THER
MISC-L
IVIN
G
PROD-C
AR
10130 19 08
ᤒ໒ 1-1
Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31
LOC-M
OUNTAIN
86 18
767
LO
C-ISLAN
DPERSO
N-ARTIST
PERSON-PO
LITICIAN
MISC-L
AN
GU
AG
EPRO
D-SHIP
PLET PLET (S)
16 15 1134
772
1
(d) Zero-shot prediction distribution on LOC-MOUNTAIN
Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types
Encoding Strategy Template T(x) Acc MiF MaF
Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874
Soft-encoding
x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839
Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET
2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)
Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some
cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning
Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
in Figure 1 in prompt-learning downstream tasksare formalized as equivalent cloze-style tasks andPLMs are asked to handle these cloze-style tasksinstead of original downstream tasks Comparedwith conventional fine-tuning methods prompt-learning does not require extra neural layers andintuitively bridges the objective form gap betweenpre-training and fine-tuning Sufficient empiricalanalysis shows that either for manually pickinghand-crafted prompts (Liu et al 2021b Han et al2021b) or automatically building auto-generatedprompts (Shin et al 2020 Gao et al 2020 Lesteret al 2021) taking prompts for tuning models issurprisingly effective for the knowledge stimula-tion and model adaptation of PLMs especially inthe low-data regime
Intuitively prompt-learning is applicable to fine-grained entity typing which aims at classifyingmarked entities from input sequences into specifictypes in a pre-defined label set We discuss thistopic with a motivating example ldquoHe is from NewYorkrdquo By adding a prompt with a masking token[MASK] the sentence becomes ldquoHe is from NewYork In this sentence New York is [MASK]rdquo Dueto the wealth of knowledge acquired during pre-training PLMs can compute a probability distri-bution over the vocabulary at the masked positionand a relatively higher probability with the wordldquocityrdquo than the word ldquopersonrdquo In other words withsimple prompts the abstract entity attributes con-tained in PLMs can be efficiently exploited whichis meaningful for downstream entity-related tasks
In this work we comprehensively explore theapplication of prompt-learning to fine-grained en-tity typing in fully supervised few-shot and zero-shot settings Particularly we first introduce anaive pipeline where we construct entity-orientedprompts and formalize fine-grained entity typingas a cloze-style task This simple pipeline yieldspromising results in our experiments especiallywhen supervision is insufficient Then to tacklethe zero-shot scenario where no explicit supervi-sion exists in training we develop a self-supervisedstrategy under our prompt-learning pipeline Ourself-supervised strategy attempts to automaticallysummarize entity types by optimizing the similarityof the predicted probability distributions of pairedexamples in prompt-learning
Three popular benchmarks are used for our ex-periments including FEW-NERD (Ding et al2021b) OntoNotes (Weischedel et al 2013)
BBN (Weischedel and Brunstein 2005) All thesedatasets have a complex type hierarchy consistingof rich entity types requiring models to have goodcapabilities of entity attribute detection Empiri-cally our method yields significant improvementson these benchmark datasets especially under thezero-shot and few-shot settings We also make ananalysis and point out both the superiority and bot-tleneck of prompt-learning in fine-grained entitytyping which may advance further efforts to ex-tract entity attributes using PLMs Our source codeand pre-trained models will be publicly available
2 Background
In this section we first give a problem definition ofthe entity typing task (sect 21) followed by an intro-duction of conventional vanilla fine-tuning (sect 22)and prompt-based tuning (sect 23) with PLMs
21 Problem DefinitionThe input of entity typing is a dataset D =x1 xn with n sentences and each sentencex contains a marked entity mention m For eachinput sentence x entity typing aims at predictingthe entity type y isin Y of its marked mention mwhere Y is a pre-defined set of entity types En-tity typing is typically regarded as a context-awareclassification task For example in the sentenceldquoLondon is the fifth album by the rock band JesusJonesrdquo the entity mention London should be clas-sified as Music rather than Location In the eraof PLMs using pre-trained neural language models(eg BERT) as the encoder and performing modeltuning for classifying types becomes a standardparadigm
22 Vanilla Fine-tuningIn the vanilla fine-tuning paradigm of entity typ-ing for each token ti in an input sequencex = [CLS] t1 m tT [SEP] witha marked entity mention m = ti tj thePLM M produces its contextualized representa-tion h[CLS]h1 hT h[SEP] Empiricallywe choose the embedding of the [CLS] tokenh[CLS] as the final representation that is fed intoan output layer to predict the probability distribu-tion over the label space
P (y isin Y|s) = softmax(Wh[CLS] + b) (1)
where W and b are learnable parameters W band all parameters of PLMs are tuned by maximiz-
London is one of the biggest cities in the world London is a
Input Prompt
[MASK] [SEP]
Copy the entity mention
LOCATIONCITY
Label Words Class Sets
Mapping
[CLS]
MLM head
City Location hellip
Figure 2 The illustration of prompt-learning for fine-grained entity typing with supervision We take hard-encoding prompt strategy as an example in this figure
ing the objective function 1n
sumni=1 log(P (yi|si))
where yi is the golden type label of si
23 Prompt-based Tuning
In prompt-based tuning for each label y isin Y wedefine a label word set Vy = w1 wm Vyis a subset of the vocabulary V of the PLM Mie Vy sube V By taking the union of the dictio-nary corresponding to each label we get an overalldictionary Vlowast For example in sentiment classi-fication we could map the label y = POSITIVE
into a set Vy = great good wonderful Andanother primary component of prompt-learning is aprompt template T (middot) which modifies the originalinput x into a prompt input T (x) by adding a setof additional tokens at the end of x Convention-ally a [MASK] token is added for PLMs to predictthe missing label word w isin Vlowast Thus in prompt-learning a classification problem is transferred intoa masked language modeling problem
p(y isin Y|s)=p([MASK]=wisinVy|T (s)) (2)
3 Prompt-learning for Entity Typing ANaive Pipeline
After transferred into masked language modelingthe prompt-learning method is applicable to learn-ing and aggregating type information of entities Inthis section we first introduce a naive but empiri-cally strong baseline that utilizes prompts to extractentity types with explicit supervision includingthe construction of label words (sect 31) templates(sect 32) and training (sect 33) And such a simplepipeline yields remarkable results on three bench-mark datasets Then we propose a self-supervisedprompt-learning method that automatically learnstype information from unlabeled data (sect 4)
31 Label Words Set Vlowast
For fine-grained entity typing datasets usu-ally use hierarchical label space such as PER-SONARTIST (FEW-NERD) and ORGANIZA-TIONPARTY (OntoNotes) In this case we useall the words as the label words set Vlowast for this en-tity type For example y = LOCATIONCITY rarrv = location city And as the entity types areall well-defined nouns with clear boundaries it isintuitive to expand the label words set Vlowast with ob-tainable related nouns For example in RelatedWords1 the top-10 related words of the label wordcity is ldquometropolis town municipality urban sub-urb municipal megalopolis civilization down-town countryrdquo These words are strongly relatedto the class CITY and they are hardly mappedto other entity types even under the same LOCA-TION class such as LOCATIONMOUNTAIN LO-CATIONISLAND etc
In masked language modeling we use confi-dence scores of all the words in Vy to constructthe final score of the particular type y That is foran input x (which is mapped to T (x)) and its entitytype y (which is mapped to Vy = w1 wm)the conditional probability becomes
P (y|x)= 1
m
msumj
λjP ([MASK]=wj |T (x)) (3)
where λi is a parameter to indicate the importanceof the current word wj isin Vy Note that λi couldalso be learnable or heuristically defined during thetraining procedure
32 Templates
In this section we construct entity-orientedprompts for the fine-grained entity typing task We
1httpsrelatedwordsorg
choose hard-encoding templates with natural lan-guage and soft-encoding templates with additionalspecial tokens in our work
For the choice of hard-encoding templates wedo not use automatic searching methods for dis-crete prompts since the fine-grained entity typingtask is clearly defined and the prompts are easilypurposeful We select simple declarative templatesrather than hypernym templates to avoid grammarti-cal errors In the template of hard encoding settingwe first copy the marked entity mention in x thenwe add a few linking verbs and articles followed bythe [MASK] token With the marked entity mention[Ent] we use the following templates
T1(x) = x [Ent] is [MASK]
T2(x) = x [Ent] is a [MASK]
T3(x) = x In this sentence [Ent] is a [MASK]
where [Ent] is the entity mention in x In sect 5 wereport the the results of T3(middot)
We also adopt the soft-encoding strategywhich introduces some additional special tokens[P1] [Pl] as the template where l is a pre-defined hyper-parameter The template begins witha delimiter [P] and a copy of the entity mention [M]The complete template becomes
T4(x) = x [P] [Ent] [P1] [Pl] [MASK]
where each embedding of prompts is randomly ini-tialized and optimized during training Intuitivelythese special tokens can represent a cluster of wordswith similar semantics in the vocabulary
33 Training and InferenceThe strategies of hard or soft encoding provide dif-ferent initialization of templates and they both canbe parameterized by φ and optimized along withM during training We train the pre-trained modelM (parameterized by θ) along with the additionalprompt embeddings by using the cross-entropy lossfunction
L = minussum
logP (y|x θ φ) (4)
For inference we can directly use Eq 3 to predictthe label of the current input instance based on thepredicted words of the [MASK] position
This pipeline could be applied to entity typingtask with explicit supervision and it is effectiveeven if the training data are insufficient ie the
few-shot scenario (sect 55) Naturally we considera more extreme situation that is a scenario with-out any training data (zero-shot scenario) In thissetting if we directly use an additional classifier topredict the label the result is equivalent to randomguessing because the parameters of the classifierare randomly initialized If we use prompts to inferthe label based on the predicted words although itsperformance is significantly better than guessingthere will also be a catastrophic decline (sect 56) Atthis time a question emerges ldquoIs it possible forPLMs to predict entity types without any explicitsupervision rdquo
4 Self-supervised Prompt-learning forZero-shot Entity Typing
With prompt-learning the answer is yes be-cause in the pre-training stage the contexts ofentities have already implied the correspondingtype information which provides an advanta-geous initialization point for the prompt-learningparadigm For example in the input sentencewith the T3(middot) template ldquoSteve Jobs found Ap-ple In this sentence Steve Jobs is a [MASK] rdquoIn our observations the probability of PLMs pre-dicting person at the masked position will be sig-nificantly higher than the probability of locationAnd if we make reasonable use of this superiorinitialization point it is possible for PLMs to au-tomatically summarize the type information andfinally extract the correct entity type
41 Overview
In order to create conditions for PLMs to sum-marize entity types we consider a self-supervisedparadigm that optimizes the similarity of the prob-ability distribution predicted by similar examplesover a projected vocabulary Vlowast To achieve thatin prompt-learning we need to (1) impose a limiton the prediction range of the model so that onlythose words that we need that is words that ex-press entity types participate in the optimizationof the gradient (2) provide an unlabeled datasetwhere entity mentions are marked without anytypes to allow the model to learn the process ofinducing type information in a self-supervised man-ner The inputs contain a pre-trained modelM apre-defined label schema Y and a dataset with-out labels D = x1 xn (entity mentions aremarked without any types) our goal is to makeMcapable to automatically carry out zero-shot entity
London is one of the biggest citieshellip London
Input Prompt
[MASK] [SEP][CLS]
[HIDE] is located in the south-easthellip
Input
[CLS]
[P1] [P2]
Prompt
[MASK] [SEP][P1] [P2]
Copy the entity mention
Randomly hide the mention with probability α
[HIDE]
D
MLM
MLM Predictionover
Predictionover
JS Divergence
LD Unlabeled Datset Pre-defined Label Schema S Sampler L
L
S
Mapping
Mapping
Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure
typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process
42 Self-supervised Learning
Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning
Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs
with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04
Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by
s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)
where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime
As we attempt to make the predictions of thepositive pairs similar the objective is computed by
L = minus 1
| ˆDpos|2sum
xisinDpos
sumxprimeisinDpos
log(1minus s(hhprime))
minus 1
| ˆDneg|2sum
xisin ˆDneg
sumxprimeisin ˆDneg
γ log(s(hhprime))
(6)
where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg
Dataset Type Supervised Few-shot Zero-shot
|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|
Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824
Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot
Shot Metric Few-NERD OntoNotes BBN
Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET
1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)
2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)
4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)
8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)
16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)
Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size
5 Experiments
In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET
to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets
51 Datasets
We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN
FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-
sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments
OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)
BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2
52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets
Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET
Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets
Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)
Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities
53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods
2httpsgithubcomgoogle-researchbert
3httpspytorchorg4httpsgithubcomhuggingface
transformers
with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs
54 Results of Fully Supervised Entity Typing
Dataset Metric Method
FT PLET (H) PLET (S)
Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576
OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977
BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781
Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder
The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context
It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed
55 Results of Few-shot Entity Typing
Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted
Dataset Metric Method
PLET PLET (S)
Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)
OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)
BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)
Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder
that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds
56 Results of Zero-shot Entity Typing
Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available
To explore the more subtle changes in perfor-
mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types
57 Effect of Templates
As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens
6 Related Work
After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al
0
600
1200
1800
2400
3000
ᤒ໒ 1
Org-SportsLeague 50
Product-Game 2783
Org-SportsTeam 43
MISC-Livingthing 43
Org-Company 27
Org-Political 16
OR
G-SPORTSL
EAG
UE
PROD
UCT-G
AM
EO
RG-SPORTST
EAM
MISC-L
IVIN
G
ORG-C
OM
PAN
YO
RG-POLITICA
L
926
14 10 0717
ᤒ໒ 1-1
Org-SportsLeague 1929
Per-Athlete 368
Event-Other 132
Location-Mountain 117
Org-SportsTeam 116
Build-Airport 54
OR
G-SPORTSL
EAG
UE
123
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
44
642
PER-ATH
LETEE
VEN
T-OTH
ERL
OCATIO
N-MO
UN
TAIN
ORG-SPO
RTSTEA
MB
UILD-A
IRPORT
PLET PLET (S)
39 39 1814
1
(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE
0
600
1200
1800
2400
3000
ᤒ໒ 1
Event-Attack 47
PROD-Game 2662Event-Disaster 540
Loc-Other 242
event-protest 203
MISC-Language 169
EV
ENT-A
TTAC
KPRO
D-G
AM
EE
VEN
T-DISA
STERL
OC-O
THER
EV
ENT-PRO
TESTM
ISC-LA
NG
UA
GE
630
12857 48 40
ᤒ໒ 1-1
Event-Attack 1894
Org-Political 420
Per-Politician 392
Loc-Island 342
Loc-Other 342
Other-Language 154
EV
ENT-A
TTAC
K
99
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
93
449
ORG-PO
LITICAL
PER-POLITICIA
NL
OC-ISLA
ND
LO
C-OTH
ER
MISC-L
AN
GU
AG
EPLET PLET (S)
81 81 3711
2
(b) Zero-shot prediction distribution on EVENT-ATTACK
0
500
1000
1500
2000
2500
ᤒ໒ 1
Misc-Currency 1196Person-Other 782
PROD-Car 666
Misc-Language 282
product-game 179Org-Company 116
MISC-C
URRENCY
PERSON-O
THER
PROD-C
AR
MISC-L
AN
GU
AG
EPRO
DU
CT-GA
ME
ORG-C
OM
PAN
Y
203
86 55 35
ᤒ໒ 1-1
Misc-Currency 2197
Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75
MISC-C
URRENCY
59 48
670
LO
C-ISLAN
DO
RG-POLITICA
LO
RG-CO
MPA
NY
LO
C-MO
UN
TAIN
MISC-L
AN
GU
AG
E
PLET PLET (S)
42 33 23
365
239
1
(c) Zero-shot prediction distribution on MISC-CURRENCY
0
500
1000
1500
2000
2500
ᤒ໒ 1
Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22
LOC-M
OUNTAIN
MISC-L
AN
GU
AG
EL
OC-O
THER
PERSON-O
THER
MISC-L
IVIN
G
PROD-C
AR
10130 19 08
ᤒ໒ 1-1
Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31
LOC-M
OUNTAIN
86 18
767
LO
C-ISLAN
DPERSO
N-ARTIST
PERSON-PO
LITICIAN
MISC-L
AN
GU
AG
EPRO
D-SHIP
PLET PLET (S)
16 15 1134
772
1
(d) Zero-shot prediction distribution on LOC-MOUNTAIN
Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types
Encoding Strategy Template T(x) Acc MiF MaF
Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874
Soft-encoding
x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839
Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET
2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)
Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some
cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning
Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
London is one of the biggest cities in the world London is a
Input Prompt
[MASK] [SEP]
Copy the entity mention
LOCATIONCITY
Label Words Class Sets
Mapping
[CLS]
MLM head
City Location hellip
Figure 2 The illustration of prompt-learning for fine-grained entity typing with supervision We take hard-encoding prompt strategy as an example in this figure
ing the objective function 1n
sumni=1 log(P (yi|si))
where yi is the golden type label of si
23 Prompt-based Tuning
In prompt-based tuning for each label y isin Y wedefine a label word set Vy = w1 wm Vyis a subset of the vocabulary V of the PLM Mie Vy sube V By taking the union of the dictio-nary corresponding to each label we get an overalldictionary Vlowast For example in sentiment classi-fication we could map the label y = POSITIVE
into a set Vy = great good wonderful Andanother primary component of prompt-learning is aprompt template T (middot) which modifies the originalinput x into a prompt input T (x) by adding a setof additional tokens at the end of x Convention-ally a [MASK] token is added for PLMs to predictthe missing label word w isin Vlowast Thus in prompt-learning a classification problem is transferred intoa masked language modeling problem
p(y isin Y|s)=p([MASK]=wisinVy|T (s)) (2)
3 Prompt-learning for Entity Typing ANaive Pipeline
After transferred into masked language modelingthe prompt-learning method is applicable to learn-ing and aggregating type information of entities Inthis section we first introduce a naive but empiri-cally strong baseline that utilizes prompts to extractentity types with explicit supervision includingthe construction of label words (sect 31) templates(sect 32) and training (sect 33) And such a simplepipeline yields remarkable results on three bench-mark datasets Then we propose a self-supervisedprompt-learning method that automatically learnstype information from unlabeled data (sect 4)
31 Label Words Set Vlowast
For fine-grained entity typing datasets usu-ally use hierarchical label space such as PER-SONARTIST (FEW-NERD) and ORGANIZA-TIONPARTY (OntoNotes) In this case we useall the words as the label words set Vlowast for this en-tity type For example y = LOCATIONCITY rarrv = location city And as the entity types areall well-defined nouns with clear boundaries it isintuitive to expand the label words set Vlowast with ob-tainable related nouns For example in RelatedWords1 the top-10 related words of the label wordcity is ldquometropolis town municipality urban sub-urb municipal megalopolis civilization down-town countryrdquo These words are strongly relatedto the class CITY and they are hardly mappedto other entity types even under the same LOCA-TION class such as LOCATIONMOUNTAIN LO-CATIONISLAND etc
In masked language modeling we use confi-dence scores of all the words in Vy to constructthe final score of the particular type y That is foran input x (which is mapped to T (x)) and its entitytype y (which is mapped to Vy = w1 wm)the conditional probability becomes
P (y|x)= 1
m
msumj
λjP ([MASK]=wj |T (x)) (3)
where λi is a parameter to indicate the importanceof the current word wj isin Vy Note that λi couldalso be learnable or heuristically defined during thetraining procedure
32 Templates
In this section we construct entity-orientedprompts for the fine-grained entity typing task We
1httpsrelatedwordsorg
choose hard-encoding templates with natural lan-guage and soft-encoding templates with additionalspecial tokens in our work
For the choice of hard-encoding templates wedo not use automatic searching methods for dis-crete prompts since the fine-grained entity typingtask is clearly defined and the prompts are easilypurposeful We select simple declarative templatesrather than hypernym templates to avoid grammarti-cal errors In the template of hard encoding settingwe first copy the marked entity mention in x thenwe add a few linking verbs and articles followed bythe [MASK] token With the marked entity mention[Ent] we use the following templates
T1(x) = x [Ent] is [MASK]
T2(x) = x [Ent] is a [MASK]
T3(x) = x In this sentence [Ent] is a [MASK]
where [Ent] is the entity mention in x In sect 5 wereport the the results of T3(middot)
We also adopt the soft-encoding strategywhich introduces some additional special tokens[P1] [Pl] as the template where l is a pre-defined hyper-parameter The template begins witha delimiter [P] and a copy of the entity mention [M]The complete template becomes
T4(x) = x [P] [Ent] [P1] [Pl] [MASK]
where each embedding of prompts is randomly ini-tialized and optimized during training Intuitivelythese special tokens can represent a cluster of wordswith similar semantics in the vocabulary
33 Training and InferenceThe strategies of hard or soft encoding provide dif-ferent initialization of templates and they both canbe parameterized by φ and optimized along withM during training We train the pre-trained modelM (parameterized by θ) along with the additionalprompt embeddings by using the cross-entropy lossfunction
L = minussum
logP (y|x θ φ) (4)
For inference we can directly use Eq 3 to predictthe label of the current input instance based on thepredicted words of the [MASK] position
This pipeline could be applied to entity typingtask with explicit supervision and it is effectiveeven if the training data are insufficient ie the
few-shot scenario (sect 55) Naturally we considera more extreme situation that is a scenario with-out any training data (zero-shot scenario) In thissetting if we directly use an additional classifier topredict the label the result is equivalent to randomguessing because the parameters of the classifierare randomly initialized If we use prompts to inferthe label based on the predicted words although itsperformance is significantly better than guessingthere will also be a catastrophic decline (sect 56) Atthis time a question emerges ldquoIs it possible forPLMs to predict entity types without any explicitsupervision rdquo
4 Self-supervised Prompt-learning forZero-shot Entity Typing
With prompt-learning the answer is yes be-cause in the pre-training stage the contexts ofentities have already implied the correspondingtype information which provides an advanta-geous initialization point for the prompt-learningparadigm For example in the input sentencewith the T3(middot) template ldquoSteve Jobs found Ap-ple In this sentence Steve Jobs is a [MASK] rdquoIn our observations the probability of PLMs pre-dicting person at the masked position will be sig-nificantly higher than the probability of locationAnd if we make reasonable use of this superiorinitialization point it is possible for PLMs to au-tomatically summarize the type information andfinally extract the correct entity type
41 Overview
In order to create conditions for PLMs to sum-marize entity types we consider a self-supervisedparadigm that optimizes the similarity of the prob-ability distribution predicted by similar examplesover a projected vocabulary Vlowast To achieve thatin prompt-learning we need to (1) impose a limiton the prediction range of the model so that onlythose words that we need that is words that ex-press entity types participate in the optimizationof the gradient (2) provide an unlabeled datasetwhere entity mentions are marked without anytypes to allow the model to learn the process ofinducing type information in a self-supervised man-ner The inputs contain a pre-trained modelM apre-defined label schema Y and a dataset with-out labels D = x1 xn (entity mentions aremarked without any types) our goal is to makeMcapable to automatically carry out zero-shot entity
London is one of the biggest citieshellip London
Input Prompt
[MASK] [SEP][CLS]
[HIDE] is located in the south-easthellip
Input
[CLS]
[P1] [P2]
Prompt
[MASK] [SEP][P1] [P2]
Copy the entity mention
Randomly hide the mention with probability α
[HIDE]
D
MLM
MLM Predictionover
Predictionover
JS Divergence
LD Unlabeled Datset Pre-defined Label Schema S Sampler L
L
S
Mapping
Mapping
Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure
typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process
42 Self-supervised Learning
Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning
Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs
with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04
Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by
s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)
where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime
As we attempt to make the predictions of thepositive pairs similar the objective is computed by
L = minus 1
| ˆDpos|2sum
xisinDpos
sumxprimeisinDpos
log(1minus s(hhprime))
minus 1
| ˆDneg|2sum
xisin ˆDneg
sumxprimeisin ˆDneg
γ log(s(hhprime))
(6)
where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg
Dataset Type Supervised Few-shot Zero-shot
|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|
Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824
Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot
Shot Metric Few-NERD OntoNotes BBN
Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET
1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)
2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)
4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)
8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)
16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)
Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size
5 Experiments
In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET
to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets
51 Datasets
We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN
FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-
sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments
OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)
BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2
52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets
Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET
Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets
Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)
Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities
53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods
2httpsgithubcomgoogle-researchbert
3httpspytorchorg4httpsgithubcomhuggingface
transformers
with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs
54 Results of Fully Supervised Entity Typing
Dataset Metric Method
FT PLET (H) PLET (S)
Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576
OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977
BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781
Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder
The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context
It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed
55 Results of Few-shot Entity Typing
Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted
Dataset Metric Method
PLET PLET (S)
Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)
OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)
BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)
Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder
that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds
56 Results of Zero-shot Entity Typing
Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available
To explore the more subtle changes in perfor-
mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types
57 Effect of Templates
As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens
6 Related Work
After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al
0
600
1200
1800
2400
3000
ᤒ໒ 1
Org-SportsLeague 50
Product-Game 2783
Org-SportsTeam 43
MISC-Livingthing 43
Org-Company 27
Org-Political 16
OR
G-SPORTSL
EAG
UE
PROD
UCT-G
AM
EO
RG-SPORTST
EAM
MISC-L
IVIN
G
ORG-C
OM
PAN
YO
RG-POLITICA
L
926
14 10 0717
ᤒ໒ 1-1
Org-SportsLeague 1929
Per-Athlete 368
Event-Other 132
Location-Mountain 117
Org-SportsTeam 116
Build-Airport 54
OR
G-SPORTSL
EAG
UE
123
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
44
642
PER-ATH
LETEE
VEN
T-OTH
ERL
OCATIO
N-MO
UN
TAIN
ORG-SPO
RTSTEA
MB
UILD-A
IRPORT
PLET PLET (S)
39 39 1814
1
(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE
0
600
1200
1800
2400
3000
ᤒ໒ 1
Event-Attack 47
PROD-Game 2662Event-Disaster 540
Loc-Other 242
event-protest 203
MISC-Language 169
EV
ENT-A
TTAC
KPRO
D-G
AM
EE
VEN
T-DISA
STERL
OC-O
THER
EV
ENT-PRO
TESTM
ISC-LA
NG
UA
GE
630
12857 48 40
ᤒ໒ 1-1
Event-Attack 1894
Org-Political 420
Per-Politician 392
Loc-Island 342
Loc-Other 342
Other-Language 154
EV
ENT-A
TTAC
K
99
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
93
449
ORG-PO
LITICAL
PER-POLITICIA
NL
OC-ISLA
ND
LO
C-OTH
ER
MISC-L
AN
GU
AG
EPLET PLET (S)
81 81 3711
2
(b) Zero-shot prediction distribution on EVENT-ATTACK
0
500
1000
1500
2000
2500
ᤒ໒ 1
Misc-Currency 1196Person-Other 782
PROD-Car 666
Misc-Language 282
product-game 179Org-Company 116
MISC-C
URRENCY
PERSON-O
THER
PROD-C
AR
MISC-L
AN
GU
AG
EPRO
DU
CT-GA
ME
ORG-C
OM
PAN
Y
203
86 55 35
ᤒ໒ 1-1
Misc-Currency 2197
Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75
MISC-C
URRENCY
59 48
670
LO
C-ISLAN
DO
RG-POLITICA
LO
RG-CO
MPA
NY
LO
C-MO
UN
TAIN
MISC-L
AN
GU
AG
E
PLET PLET (S)
42 33 23
365
239
1
(c) Zero-shot prediction distribution on MISC-CURRENCY
0
500
1000
1500
2000
2500
ᤒ໒ 1
Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22
LOC-M
OUNTAIN
MISC-L
AN
GU
AG
EL
OC-O
THER
PERSON-O
THER
MISC-L
IVIN
G
PROD-C
AR
10130 19 08
ᤒ໒ 1-1
Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31
LOC-M
OUNTAIN
86 18
767
LO
C-ISLAN
DPERSO
N-ARTIST
PERSON-PO
LITICIAN
MISC-L
AN
GU
AG
EPRO
D-SHIP
PLET PLET (S)
16 15 1134
772
1
(d) Zero-shot prediction distribution on LOC-MOUNTAIN
Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types
Encoding Strategy Template T(x) Acc MiF MaF
Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874
Soft-encoding
x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839
Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET
2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)
Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some
cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning
Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
choose hard-encoding templates with natural lan-guage and soft-encoding templates with additionalspecial tokens in our work
For the choice of hard-encoding templates wedo not use automatic searching methods for dis-crete prompts since the fine-grained entity typingtask is clearly defined and the prompts are easilypurposeful We select simple declarative templatesrather than hypernym templates to avoid grammarti-cal errors In the template of hard encoding settingwe first copy the marked entity mention in x thenwe add a few linking verbs and articles followed bythe [MASK] token With the marked entity mention[Ent] we use the following templates
T1(x) = x [Ent] is [MASK]
T2(x) = x [Ent] is a [MASK]
T3(x) = x In this sentence [Ent] is a [MASK]
where [Ent] is the entity mention in x In sect 5 wereport the the results of T3(middot)
We also adopt the soft-encoding strategywhich introduces some additional special tokens[P1] [Pl] as the template where l is a pre-defined hyper-parameter The template begins witha delimiter [P] and a copy of the entity mention [M]The complete template becomes
T4(x) = x [P] [Ent] [P1] [Pl] [MASK]
where each embedding of prompts is randomly ini-tialized and optimized during training Intuitivelythese special tokens can represent a cluster of wordswith similar semantics in the vocabulary
33 Training and InferenceThe strategies of hard or soft encoding provide dif-ferent initialization of templates and they both canbe parameterized by φ and optimized along withM during training We train the pre-trained modelM (parameterized by θ) along with the additionalprompt embeddings by using the cross-entropy lossfunction
L = minussum
logP (y|x θ φ) (4)
For inference we can directly use Eq 3 to predictthe label of the current input instance based on thepredicted words of the [MASK] position
This pipeline could be applied to entity typingtask with explicit supervision and it is effectiveeven if the training data are insufficient ie the
few-shot scenario (sect 55) Naturally we considera more extreme situation that is a scenario with-out any training data (zero-shot scenario) In thissetting if we directly use an additional classifier topredict the label the result is equivalent to randomguessing because the parameters of the classifierare randomly initialized If we use prompts to inferthe label based on the predicted words although itsperformance is significantly better than guessingthere will also be a catastrophic decline (sect 56) Atthis time a question emerges ldquoIs it possible forPLMs to predict entity types without any explicitsupervision rdquo
4 Self-supervised Prompt-learning forZero-shot Entity Typing
With prompt-learning the answer is yes be-cause in the pre-training stage the contexts ofentities have already implied the correspondingtype information which provides an advanta-geous initialization point for the prompt-learningparadigm For example in the input sentencewith the T3(middot) template ldquoSteve Jobs found Ap-ple In this sentence Steve Jobs is a [MASK] rdquoIn our observations the probability of PLMs pre-dicting person at the masked position will be sig-nificantly higher than the probability of locationAnd if we make reasonable use of this superiorinitialization point it is possible for PLMs to au-tomatically summarize the type information andfinally extract the correct entity type
41 Overview
In order to create conditions for PLMs to sum-marize entity types we consider a self-supervisedparadigm that optimizes the similarity of the prob-ability distribution predicted by similar examplesover a projected vocabulary Vlowast To achieve thatin prompt-learning we need to (1) impose a limiton the prediction range of the model so that onlythose words that we need that is words that ex-press entity types participate in the optimizationof the gradient (2) provide an unlabeled datasetwhere entity mentions are marked without anytypes to allow the model to learn the process ofinducing type information in a self-supervised man-ner The inputs contain a pre-trained modelM apre-defined label schema Y and a dataset with-out labels D = x1 xn (entity mentions aremarked without any types) our goal is to makeMcapable to automatically carry out zero-shot entity
London is one of the biggest citieshellip London
Input Prompt
[MASK] [SEP][CLS]
[HIDE] is located in the south-easthellip
Input
[CLS]
[P1] [P2]
Prompt
[MASK] [SEP][P1] [P2]
Copy the entity mention
Randomly hide the mention with probability α
[HIDE]
D
MLM
MLM Predictionover
Predictionover
JS Divergence
LD Unlabeled Datset Pre-defined Label Schema S Sampler L
L
S
Mapping
Mapping
Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure
typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process
42 Self-supervised Learning
Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning
Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs
with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04
Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by
s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)
where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime
As we attempt to make the predictions of thepositive pairs similar the objective is computed by
L = minus 1
| ˆDpos|2sum
xisinDpos
sumxprimeisinDpos
log(1minus s(hhprime))
minus 1
| ˆDneg|2sum
xisin ˆDneg
sumxprimeisin ˆDneg
γ log(s(hhprime))
(6)
where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg
Dataset Type Supervised Few-shot Zero-shot
|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|
Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824
Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot
Shot Metric Few-NERD OntoNotes BBN
Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET
1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)
2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)
4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)
8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)
16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)
Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size
5 Experiments
In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET
to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets
51 Datasets
We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN
FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-
sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments
OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)
BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2
52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets
Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET
Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets
Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)
Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities
53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods
2httpsgithubcomgoogle-researchbert
3httpspytorchorg4httpsgithubcomhuggingface
transformers
with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs
54 Results of Fully Supervised Entity Typing
Dataset Metric Method
FT PLET (H) PLET (S)
Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576
OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977
BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781
Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder
The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context
It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed
55 Results of Few-shot Entity Typing
Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted
Dataset Metric Method
PLET PLET (S)
Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)
OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)
BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)
Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder
that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds
56 Results of Zero-shot Entity Typing
Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available
To explore the more subtle changes in perfor-
mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types
57 Effect of Templates
As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens
6 Related Work
After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al
0
600
1200
1800
2400
3000
ᤒ໒ 1
Org-SportsLeague 50
Product-Game 2783
Org-SportsTeam 43
MISC-Livingthing 43
Org-Company 27
Org-Political 16
OR
G-SPORTSL
EAG
UE
PROD
UCT-G
AM
EO
RG-SPORTST
EAM
MISC-L
IVIN
G
ORG-C
OM
PAN
YO
RG-POLITICA
L
926
14 10 0717
ᤒ໒ 1-1
Org-SportsLeague 1929
Per-Athlete 368
Event-Other 132
Location-Mountain 117
Org-SportsTeam 116
Build-Airport 54
OR
G-SPORTSL
EAG
UE
123
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
44
642
PER-ATH
LETEE
VEN
T-OTH
ERL
OCATIO
N-MO
UN
TAIN
ORG-SPO
RTSTEA
MB
UILD-A
IRPORT
PLET PLET (S)
39 39 1814
1
(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE
0
600
1200
1800
2400
3000
ᤒ໒ 1
Event-Attack 47
PROD-Game 2662Event-Disaster 540
Loc-Other 242
event-protest 203
MISC-Language 169
EV
ENT-A
TTAC
KPRO
D-G
AM
EE
VEN
T-DISA
STERL
OC-O
THER
EV
ENT-PRO
TESTM
ISC-LA
NG
UA
GE
630
12857 48 40
ᤒ໒ 1-1
Event-Attack 1894
Org-Political 420
Per-Politician 392
Loc-Island 342
Loc-Other 342
Other-Language 154
EV
ENT-A
TTAC
K
99
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
93
449
ORG-PO
LITICAL
PER-POLITICIA
NL
OC-ISLA
ND
LO
C-OTH
ER
MISC-L
AN
GU
AG
EPLET PLET (S)
81 81 3711
2
(b) Zero-shot prediction distribution on EVENT-ATTACK
0
500
1000
1500
2000
2500
ᤒ໒ 1
Misc-Currency 1196Person-Other 782
PROD-Car 666
Misc-Language 282
product-game 179Org-Company 116
MISC-C
URRENCY
PERSON-O
THER
PROD-C
AR
MISC-L
AN
GU
AG
EPRO
DU
CT-GA
ME
ORG-C
OM
PAN
Y
203
86 55 35
ᤒ໒ 1-1
Misc-Currency 2197
Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75
MISC-C
URRENCY
59 48
670
LO
C-ISLAN
DO
RG-POLITICA
LO
RG-CO
MPA
NY
LO
C-MO
UN
TAIN
MISC-L
AN
GU
AG
E
PLET PLET (S)
42 33 23
365
239
1
(c) Zero-shot prediction distribution on MISC-CURRENCY
0
500
1000
1500
2000
2500
ᤒ໒ 1
Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22
LOC-M
OUNTAIN
MISC-L
AN
GU
AG
EL
OC-O
THER
PERSON-O
THER
MISC-L
IVIN
G
PROD-C
AR
10130 19 08
ᤒ໒ 1-1
Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31
LOC-M
OUNTAIN
86 18
767
LO
C-ISLAN
DPERSO
N-ARTIST
PERSON-PO
LITICIAN
MISC-L
AN
GU
AG
EPRO
D-SHIP
PLET PLET (S)
16 15 1134
772
1
(d) Zero-shot prediction distribution on LOC-MOUNTAIN
Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types
Encoding Strategy Template T(x) Acc MiF MaF
Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874
Soft-encoding
x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839
Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET
2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)
Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some
cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning
Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
London is one of the biggest citieshellip London
Input Prompt
[MASK] [SEP][CLS]
[HIDE] is located in the south-easthellip
Input
[CLS]
[P1] [P2]
Prompt
[MASK] [SEP][P1] [P2]
Copy the entity mention
Randomly hide the mention with probability α
[HIDE]
D
MLM
MLM Predictionover
Predictionover
JS Divergence
LD Unlabeled Datset Pre-defined Label Schema S Sampler L
L
S
Mapping
Mapping
Figure 3 The illustration of self-supervised prompt-learning for fine-grained entity typing with unlabeled data anda pre-defined label set Vlowast denotes the label words projected from the input label set Note that we only show thepositive pair in this figure
typing after trained on D and Y Using prompt-learning as the training strategy we first construct alabel words set Vlowast from Y and for each sentence xin D we wrap it with hard-encoding template witha [MASK] symbol The key idea is to make theprediction distributions of the same type of entitieson Vlowast as similar as possible In this way we canperform contrastive learning by sampling positiveand negative examples while ignoring the impactof other words that are not in Vlowast on optimizationduring the MLM process
42 Self-supervised Learning
Although there are no labels in D we can stilldevelop a sampling strategy based on a simplehypothesis that is same entities in different sen-tences have similar types For instance we willsample two sentences contain ldquoSteve Jobsrdquo as a pos-itive pair Moreover considering entity typing iscontext-aware ldquoSteve Jobsrdquo could be entrepreneurdesigner philanthropist in different contexts wechoose to optimize the similarity between distribu-tions of the words over Vlowast This strategy not onlysoftens the supervision but also eliminates the im-pact of other words in self-supervised learning
Particularly we randomly sample c positivepairs ie sentence pairs that share one same en-tity mention denoted as ˆDpos and c negative pairsie two sentences with different entity mentionsmarked denoted as ˆDneg from a large-scale entity-linked corpusD To avoid generating false negativesamples the negative samples are further restrictedby a large dictionary that contains common entitiesand their type information Only sentence pairs
with entities of different types in the dictionary areselected as negative samples Then we wrap themwith hard-encoding T3(middot) To avoid overfitting ofthe entity names we randomly hide the entity men-tion (in the original input and the template) witha special symbol [Hide] with a probability of αEmpirically α is set to 04
Since the impact of a pair of examples on train-ing should be measured at the distribution level wechoose Jensen-Shannon divergence as a metric toassess the similarity of two distributions Thus ina sentence pair (x xprime) the similarity score of tworepresentations of the the predictions h and hprime ofthe [MASK] position is computed by
s(hhprime) = JS(PVlowast(w|x) PVlowast(w|xprime)) (5)
where JS is Jensen-Shannon divergence PVlowast(w|x)and PVlowast(w|xprime) are probability distributions of thepredicting token w over Vlowast obtained by h and hprime
As we attempt to make the predictions of thepositive pairs similar the objective is computed by
L = minus 1
| ˆDpos|2sum
xisinDpos
sumxprimeisinDpos
log(1minus s(hhprime))
minus 1
| ˆDneg|2sum
xisin ˆDneg
sumxprimeisin ˆDneg
γ log(s(hhprime))
(6)
where γ is a penalty term because the assumptionis loose in negative pairs Overall we use entity-linked English Wikipedia corpus as the raw dataand generate about 1 million pairs of data each asˆDpos and ˆDneg
Dataset Type Supervised Few-shot Zero-shot
|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|
Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824
Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot
Shot Metric Few-NERD OntoNotes BBN
Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET
1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)
2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)
4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)
8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)
16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)
Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size
5 Experiments
In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET
to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets
51 Datasets
We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN
FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-
sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments
OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)
BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2
52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets
Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET
Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets
Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)
Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities
53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods
2httpsgithubcomgoogle-researchbert
3httpspytorchorg4httpsgithubcomhuggingface
transformers
with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs
54 Results of Fully Supervised Entity Typing
Dataset Metric Method
FT PLET (H) PLET (S)
Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576
OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977
BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781
Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder
The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context
It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed
55 Results of Few-shot Entity Typing
Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted
Dataset Metric Method
PLET PLET (S)
Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)
OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)
BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)
Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder
that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds
56 Results of Zero-shot Entity Typing
Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available
To explore the more subtle changes in perfor-
mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types
57 Effect of Templates
As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens
6 Related Work
After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al
0
600
1200
1800
2400
3000
ᤒ໒ 1
Org-SportsLeague 50
Product-Game 2783
Org-SportsTeam 43
MISC-Livingthing 43
Org-Company 27
Org-Political 16
OR
G-SPORTSL
EAG
UE
PROD
UCT-G
AM
EO
RG-SPORTST
EAM
MISC-L
IVIN
G
ORG-C
OM
PAN
YO
RG-POLITICA
L
926
14 10 0717
ᤒ໒ 1-1
Org-SportsLeague 1929
Per-Athlete 368
Event-Other 132
Location-Mountain 117
Org-SportsTeam 116
Build-Airport 54
OR
G-SPORTSL
EAG
UE
123
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
44
642
PER-ATH
LETEE
VEN
T-OTH
ERL
OCATIO
N-MO
UN
TAIN
ORG-SPO
RTSTEA
MB
UILD-A
IRPORT
PLET PLET (S)
39 39 1814
1
(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE
0
600
1200
1800
2400
3000
ᤒ໒ 1
Event-Attack 47
PROD-Game 2662Event-Disaster 540
Loc-Other 242
event-protest 203
MISC-Language 169
EV
ENT-A
TTAC
KPRO
D-G
AM
EE
VEN
T-DISA
STERL
OC-O
THER
EV
ENT-PRO
TESTM
ISC-LA
NG
UA
GE
630
12857 48 40
ᤒ໒ 1-1
Event-Attack 1894
Org-Political 420
Per-Politician 392
Loc-Island 342
Loc-Other 342
Other-Language 154
EV
ENT-A
TTAC
K
99
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
93
449
ORG-PO
LITICAL
PER-POLITICIA
NL
OC-ISLA
ND
LO
C-OTH
ER
MISC-L
AN
GU
AG
EPLET PLET (S)
81 81 3711
2
(b) Zero-shot prediction distribution on EVENT-ATTACK
0
500
1000
1500
2000
2500
ᤒ໒ 1
Misc-Currency 1196Person-Other 782
PROD-Car 666
Misc-Language 282
product-game 179Org-Company 116
MISC-C
URRENCY
PERSON-O
THER
PROD-C
AR
MISC-L
AN
GU
AG
EPRO
DU
CT-GA
ME
ORG-C
OM
PAN
Y
203
86 55 35
ᤒ໒ 1-1
Misc-Currency 2197
Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75
MISC-C
URRENCY
59 48
670
LO
C-ISLAN
DO
RG-POLITICA
LO
RG-CO
MPA
NY
LO
C-MO
UN
TAIN
MISC-L
AN
GU
AG
E
PLET PLET (S)
42 33 23
365
239
1
(c) Zero-shot prediction distribution on MISC-CURRENCY
0
500
1000
1500
2000
2500
ᤒ໒ 1
Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22
LOC-M
OUNTAIN
MISC-L
AN
GU
AG
EL
OC-O
THER
PERSON-O
THER
MISC-L
IVIN
G
PROD-C
AR
10130 19 08
ᤒ໒ 1-1
Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31
LOC-M
OUNTAIN
86 18
767
LO
C-ISLAN
DPERSO
N-ARTIST
PERSON-PO
LITICIAN
MISC-L
AN
GU
AG
EPRO
D-SHIP
PLET PLET (S)
16 15 1134
772
1
(d) Zero-shot prediction distribution on LOC-MOUNTAIN
Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types
Encoding Strategy Template T(x) Acc MiF MaF
Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874
Soft-encoding
x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839
Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET
2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)
Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some
cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning
Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
Dataset Type Supervised Few-shot Zero-shot
|Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest| |Dtrain| |Ddev| |Dtest|
Few-NERD 66 340382 48758 96901 66~1056 = |Dtrain| 96901 0 0 96901OntoNotes 86 253239 2200 8962 86~1376 = |Dtrain| 8962 0 0 8962BBN 46 86077 12824 12824 46~736 = |Dtrain| 12824 0 0 12824
Table 1 Statistics of FEW-NERD OntoNotes and BBN from three experimental settings It can be seen that forall three settings the test sets are identical For the training set of the few-shot setting we report the summationfrom 1-shot to 16-shot
Shot Metric Few-NERD OntoNotes BBN
Fine-tuning PLET Fine-tuning PLET Fine-tuning PLET
1Acc 894 4387 (+3493) 370 3897 (+3527) 080 4070 (+3990)MiF 1985 6060 (+4575) 1898 5991 (+4093) 579 4925 (+4346)MaF 1985 6060 (+4075) 1943 6142 (+4199) 442 4848 (+4306)
2Acc 2083 4778 (+2695) 727 3919 (+3192) 668 4133 (+3465)MiF 3267 6209 (+2942) 2489 6109 (+3620) 1370 5400 (+4030)MaF 3267 6209 (+2942) 2564 6268 (+3704) 1323 5197 (+3874)
4Acc 3309 5700 (+2391) 1115 3839 (+2724) 1934 5221 (+3287)MiF 4414 6861 (+2447) 2769 5981 (+3212) 2703 6113 (+3410)MaF 4414 6861 (+2447) 2826 6089 (+3263) 2469 5891 (+3422)
8Acc 4644 5575 (+931) 1837 3937 (+2100) 2701 4430 (+1729)MiF 5776 6874 (+1098) 3816 5797 (+1981) 4019 5621 (+1602)MaF 5776 6874 (+1098) 3777 5832 (+2055) 3950 5515 (+1565)
16Acc 6098 6158 (+060) 3226 4229 (+1003) 3967 5500 (+1533)MiF 7159 7239 (+080) 5140 6079 (+939) 4901 6284 (+1383)MaF 7159 7239 (+080) 5145 6180 (+1035) 4709 6238 (+1529)
Table 2 Results of few-shot entity typing on FEW-NERD OntoNotes and BBN all the methods use BERTbasewith same initialization weights as the backbone encoder Training set and dev set have the same size
5 Experiments
In this section we conduct experiments to evaluatethe effectiveness of our methods We use FT to de-note the BERT-based fine-tuning approach PLET
to denote the naive prompt-learning approach forentity typing in sect 3 and PLET (S) to denote theself-supervised prompt-learning approach in sect 4Our experiments are carried out on fully supervised(sect 54) few-shot (sect 55) and zero-shot (sect 56) set-tings on three fine-grained entity typing datasets
51 Datasets
We use three fine-grained entity typing datasetsFEW-NERD OntoNotes and BBN
FEW-NERD We use FEW-NERD (Dinget al 2021b) as the main dataset which has the fol-lowing advantages (1) FEW-NERD is large-scaleand fine-grained which contains 8 coarse-grainedand 66 fine-grained entity types (2) FEW-NERDis manually annotated thereby we can precisely as-
sess the capability of entity typing models Specifi-cally we use the supervised setting of the datasetFEW-NERD (SUP) and the official split of it toconduct our experiments
OntoNotes We also use the OntoNotes 50dataset (Weischedel et al 2013) in experimentsFollowing previous works for fine-grained entitytyping we adopt 86-classes version of OntoNoteswhile each class has at most 3 levels of the typehierarchy And the data split is identical to (Shi-maoka et al 2017)
BBN BBN dataset is selected from Penn Tree-bank corpus of Wall Street Journal texts and labeledby (Weischedel and Brunstein 2005) We followthe version processed by (Ren et al 2016a) andthe data split by (Ren et al 2016b) The datasetcontains 46 types and each type has a maximumtype hierarchy level of 2
52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets
Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET
Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets
Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)
Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities
53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods
2httpsgithubcomgoogle-researchbert
3httpspytorchorg4httpsgithubcomhuggingface
transformers
with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs
54 Results of Fully Supervised Entity Typing
Dataset Metric Method
FT PLET (H) PLET (S)
Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576
OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977
BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781
Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder
The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context
It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed
55 Results of Few-shot Entity Typing
Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted
Dataset Metric Method
PLET PLET (S)
Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)
OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)
BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)
Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder
that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds
56 Results of Zero-shot Entity Typing
Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available
To explore the more subtle changes in perfor-
mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types
57 Effect of Templates
As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens
6 Related Work
After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al
0
600
1200
1800
2400
3000
ᤒ໒ 1
Org-SportsLeague 50
Product-Game 2783
Org-SportsTeam 43
MISC-Livingthing 43
Org-Company 27
Org-Political 16
OR
G-SPORTSL
EAG
UE
PROD
UCT-G
AM
EO
RG-SPORTST
EAM
MISC-L
IVIN
G
ORG-C
OM
PAN
YO
RG-POLITICA
L
926
14 10 0717
ᤒ໒ 1-1
Org-SportsLeague 1929
Per-Athlete 368
Event-Other 132
Location-Mountain 117
Org-SportsTeam 116
Build-Airport 54
OR
G-SPORTSL
EAG
UE
123
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
44
642
PER-ATH
LETEE
VEN
T-OTH
ERL
OCATIO
N-MO
UN
TAIN
ORG-SPO
RTSTEA
MB
UILD-A
IRPORT
PLET PLET (S)
39 39 1814
1
(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE
0
600
1200
1800
2400
3000
ᤒ໒ 1
Event-Attack 47
PROD-Game 2662Event-Disaster 540
Loc-Other 242
event-protest 203
MISC-Language 169
EV
ENT-A
TTAC
KPRO
D-G
AM
EE
VEN
T-DISA
STERL
OC-O
THER
EV
ENT-PRO
TESTM
ISC-LA
NG
UA
GE
630
12857 48 40
ᤒ໒ 1-1
Event-Attack 1894
Org-Political 420
Per-Politician 392
Loc-Island 342
Loc-Other 342
Other-Language 154
EV
ENT-A
TTAC
K
99
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
93
449
ORG-PO
LITICAL
PER-POLITICIA
NL
OC-ISLA
ND
LO
C-OTH
ER
MISC-L
AN
GU
AG
EPLET PLET (S)
81 81 3711
2
(b) Zero-shot prediction distribution on EVENT-ATTACK
0
500
1000
1500
2000
2500
ᤒ໒ 1
Misc-Currency 1196Person-Other 782
PROD-Car 666
Misc-Language 282
product-game 179Org-Company 116
MISC-C
URRENCY
PERSON-O
THER
PROD-C
AR
MISC-L
AN
GU
AG
EPRO
DU
CT-GA
ME
ORG-C
OM
PAN
Y
203
86 55 35
ᤒ໒ 1-1
Misc-Currency 2197
Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75
MISC-C
URRENCY
59 48
670
LO
C-ISLAN
DO
RG-POLITICA
LO
RG-CO
MPA
NY
LO
C-MO
UN
TAIN
MISC-L
AN
GU
AG
E
PLET PLET (S)
42 33 23
365
239
1
(c) Zero-shot prediction distribution on MISC-CURRENCY
0
500
1000
1500
2000
2500
ᤒ໒ 1
Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22
LOC-M
OUNTAIN
MISC-L
AN
GU
AG
EL
OC-O
THER
PERSON-O
THER
MISC-L
IVIN
G
PROD-C
AR
10130 19 08
ᤒ໒ 1-1
Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31
LOC-M
OUNTAIN
86 18
767
LO
C-ISLAN
DPERSO
N-ARTIST
PERSON-PO
LITICIAN
MISC-L
AN
GU
AG
EPRO
D-SHIP
PLET PLET (S)
16 15 1134
772
1
(d) Zero-shot prediction distribution on LOC-MOUNTAIN
Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types
Encoding Strategy Template T(x) Acc MiF MaF
Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874
Soft-encoding
x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839
Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET
2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)
Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some
cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning
Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
52 Experimental SettingsThe experiments are performed under three differ-ent settings to evaluate the effect of the prompt-learning method and semi-supervised training Intable 1 we show the statistics of all the settings onthe three datasets
Supervised Setting In a fully supervised set-ting all training data are used in the training phaseFT and PLET are used to train the model We runthe experiments on all three datasets with BERT-base-cased backbone Both hard and soft encodingsare used for PLET
Few-shot Setting In a few-shot setting werandomly sample 1 2 4 8 16 instances for eachentity type for training We apply both FT andPLET methods with hard encoding on all the threedatasets
Zero-shot Setting In zero-shot setting notraining data with labels are available The model isrequired to infer the entity type without any super-vised training Since fine-tuning is not applicablein this setting we only conduct experiments onPLET and PLET (S)
Metrics In terms of evaluation metrics wefollow the widely used setting of Ling and Weld(2012) which includes strict accuracy (Acc) loosemacro F1-score (MaF) and loose micro F1-score(MiF) to evaluate the performances of models Theloose F1-score calculation concerns type labels bydifferent granularities
53 Experimental DetailsWe use BERT-base (Devlin et al 2019) as thebackbone structures of our model and initializedwith the corresponding pre-trained cased weights2The hidden sizes are 768 and the number of lay-ers are 12 Models are implemented by Pytorchframework3 (Paszke et al 2019) and Huggingfacetransformers4 (Wolf et al 2020) BERT modelsare optimized by AdamW (Loshchilov and Hutter2019) with the learning rate of 5e-5 The trainingbatch size used is 16 for all models In the super-vised setting each model is trained for 10 epochsand evaluated on the dev set every 2000 steps Inthe few-shot setting each model is trained for 30epochs and evaluated every 10sim50 steps each timethe evaluation is run for 200 steps For the methods
2httpsgithubcomgoogle-researchbert
3httpspytorchorg4httpsgithubcomhuggingface
transformers
with hard-encoding we report the experimental re-sults of T3(middot) For the soft-encoding method wereport the results of m = 2 Experiments are con-ducted with CUDA on NVIDIA Tesla V100 GPUs
54 Results of Fully Supervised Entity Typing
Dataset Metric Method
FT PLET (H) PLET (S)
Few-NERDAcc 7975 7990 7986MiF 8574 8584 8576MaF 8574 8584 8576
OntoNotesAcc 5971 6037 6568MiF 7047 7078 7453MaF 7657 7642 7977
BBNAcc 6239 6592 6311MiF 6888 7155 6868MaF 6737 7082 6781
Table 3 Fully supervised entity typing results FTdenotes the vanilla fine-tuning method (H) denotesthe hard-encoding strategy and (S) denotes the soft-encoding strategy All the methods use BERTbase withsame initialization weights as the backbone encoder
The results on all three datasets across differentmodels are reported in Table 3 Overall the prompt-based methods have shown certain improvementscomparing to directly fine-tuned models It showsthat the prompt-based method does help with cap-turing entity-type information from a given context
It is also observed that the magnitude of the im-provement and the preference of prompt encod-ing strategy may vary with different datasets Theprompt-based method seems less effective on FEW-NERD dataset than the other two It indicates thatthe effect of the prompt-based method partially de-pends on the characteristics of the dataset and thatdifferent prompt designs may suit different dataSpecifically FEW-NERD is manually annotatedand contains much less noise than the other twodatasets benefiting the FT method to learn classi-fication with an extra linear layer Moreover forthe OntoNotes dataset soft encoding significantlyoutperforms hard encoding while for the other twodatasets the effect seems reversed
55 Results of Few-shot Entity Typing
Table 2 shows the results on few-shot entity typ-ing It is shown that prompt-based model outper-forms fine-tuning by a large margin under few-shotsetting especially when only 1 sim 2 training in-stances per type are available It should be noted
Dataset Metric Method
PLET PLET (S)
Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)
OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)
BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)
Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder
that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds
56 Results of Zero-shot Entity Typing
Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available
To explore the more subtle changes in perfor-
mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types
57 Effect of Templates
As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens
6 Related Work
After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al
0
600
1200
1800
2400
3000
ᤒ໒ 1
Org-SportsLeague 50
Product-Game 2783
Org-SportsTeam 43
MISC-Livingthing 43
Org-Company 27
Org-Political 16
OR
G-SPORTSL
EAG
UE
PROD
UCT-G
AM
EO
RG-SPORTST
EAM
MISC-L
IVIN
G
ORG-C
OM
PAN
YO
RG-POLITICA
L
926
14 10 0717
ᤒ໒ 1-1
Org-SportsLeague 1929
Per-Athlete 368
Event-Other 132
Location-Mountain 117
Org-SportsTeam 116
Build-Airport 54
OR
G-SPORTSL
EAG
UE
123
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
44
642
PER-ATH
LETEE
VEN
T-OTH
ERL
OCATIO
N-MO
UN
TAIN
ORG-SPO
RTSTEA
MB
UILD-A
IRPORT
PLET PLET (S)
39 39 1814
1
(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE
0
600
1200
1800
2400
3000
ᤒ໒ 1
Event-Attack 47
PROD-Game 2662Event-Disaster 540
Loc-Other 242
event-protest 203
MISC-Language 169
EV
ENT-A
TTAC
KPRO
D-G
AM
EE
VEN
T-DISA
STERL
OC-O
THER
EV
ENT-PRO
TESTM
ISC-LA
NG
UA
GE
630
12857 48 40
ᤒ໒ 1-1
Event-Attack 1894
Org-Political 420
Per-Politician 392
Loc-Island 342
Loc-Other 342
Other-Language 154
EV
ENT-A
TTAC
K
99
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
93
449
ORG-PO
LITICAL
PER-POLITICIA
NL
OC-ISLA
ND
LO
C-OTH
ER
MISC-L
AN
GU
AG
EPLET PLET (S)
81 81 3711
2
(b) Zero-shot prediction distribution on EVENT-ATTACK
0
500
1000
1500
2000
2500
ᤒ໒ 1
Misc-Currency 1196Person-Other 782
PROD-Car 666
Misc-Language 282
product-game 179Org-Company 116
MISC-C
URRENCY
PERSON-O
THER
PROD-C
AR
MISC-L
AN
GU
AG
EPRO
DU
CT-GA
ME
ORG-C
OM
PAN
Y
203
86 55 35
ᤒ໒ 1-1
Misc-Currency 2197
Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75
MISC-C
URRENCY
59 48
670
LO
C-ISLAN
DO
RG-POLITICA
LO
RG-CO
MPA
NY
LO
C-MO
UN
TAIN
MISC-L
AN
GU
AG
E
PLET PLET (S)
42 33 23
365
239
1
(c) Zero-shot prediction distribution on MISC-CURRENCY
0
500
1000
1500
2000
2500
ᤒ໒ 1
Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22
LOC-M
OUNTAIN
MISC-L
AN
GU
AG
EL
OC-O
THER
PERSON-O
THER
MISC-L
IVIN
G
PROD-C
AR
10130 19 08
ᤒ໒ 1-1
Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31
LOC-M
OUNTAIN
86 18
767
LO
C-ISLAN
DPERSO
N-ARTIST
PERSON-PO
LITICIAN
MISC-L
AN
GU
AG
EPRO
D-SHIP
PLET PLET (S)
16 15 1134
772
1
(d) Zero-shot prediction distribution on LOC-MOUNTAIN
Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types
Encoding Strategy Template T(x) Acc MiF MaF
Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874
Soft-encoding
x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839
Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET
2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)
Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some
cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning
Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
Dataset Metric Method
PLET PLET (S)
Few-NERDAcc 1755 2399 (+644)MiF 2839 4798 (+1959)MaF 2839 4798 (+1959)
OntoNotesDaggerAcc 2510 2827 (+317)MiF 3361 4979 (+1618)MaF 3791 4995 (+1204)
BBNAcc 5582 5779 (+197)MiF 6064 6324 (+260)MaF 5999 6400 (+401)
Table 4 Results of zero-shot entity typing on FEW-NERD OntoNotes and BBN Dagger means that we re-move the ldquoOtherrdquo class during testing PLET denotesthe prompt-learning pipeline and PLET (S) denotesself-supervised prompt-learning both methods use theBERTbase as the backbone encoder
that for OntoNotes and BBN datasets sampling 16instances for each entity type already amounts toover 05 of the total training data Meanwhilesome of the data in BBN are distantly-supervisedand are potentially erroneous It brings more ran-domness to few-shot training The results supportthe idea that a well-designed prompt has muchpotential in mining the learned knowledge in pre-trained models and thus yields better performancein few-shot settings The results also indicate thateven when the number of entity types is large (46sim86) the superiority of prompt-learning still holds
56 Results of Zero-shot Entity Typing
Table 4 shows the results on zero-shot entity typ-ing task on FEW-NERD dataset We did not re-port the performance of the vanilla fine-tuning ap-proach because it cannot produce reasonable resultswith a randomly initialized classifier And it alsoshould be noted that the prompt method withoutfine-tuning already outperforms random guessingIt indicates that adding a prompt is informative fora model pre-trained on masked-language-modeltask (eg BERT) and can induce reasonable pre-dictions in entity typing tasks Second the perfor-mance of the model improves by a large margin iftrained on unlabeled data It shows the effective-ness of the proposed self-supervised training ap-proach and points to the potential of a pre-trainedprompt-based model under the zero-shot settingwhen no labeled data are available
To explore the more subtle changes in perfor-
mance we carry out case study for the zero-shotentity typing In Figure 4 we illustrate the zero-shot prediction distribution (the correct predictionand other top-5 predictions) for four entity typesin FEW-NERD which are ORG-SPORTSTEAMEVENT-ATTACK MISC-CURRENCY and LOC-MOUNTAIN We could observe that with self-supervised prompt-learning PLET (S) could sum-marize entity type information and infer the relatedwords to a certain extent In Figure 4 (a) and Fig-ure 4 (b) the PLET model suffers from a severe biasand almost predict no correct labels in the zero-shotsetting since such words are low-frequency Andalthough there is no explicit supervision in the pre-training stage of UNPLET the model could stillfind the corresponding words that express the ORG-SPORTSLEAGUE and the EVENT-ATTACK typesIn Figure 4 (c) self-supervised learning increasesthe performance of the original encoder Furtherin Figure 4 (d) PLET has been able to make sat-isfying predictions for this type LOC-MOUNTAINIn this case the use of self-supervised learning hashardly weakened the performance which meansthat the process of automatically summarizing typeinformation has a little negative impact on high-confidence entity types
57 Effect of Templates
As stated in previous studies (Gao et al 2020Zhao et al 2021) the choice of templates mayhave a huge impact on the performance in prompt-learning In this section we carry out experi-ments to investigate such influence Experimentsare conducted under the 8-shot setting on FEW-NERD dataset and we use 3 different hard en-coding templates and 4 soft encoding templates(by changing the number of prompt tokens m)The results demonstrate that the choice of tem-plates exerts a considerable influence on the per-formance of prompt-based few-shot learning Forthe hard-encoding templates the phrase that de-scribes the location ldquoin this sentencerdquo contributes aremarkable improvement in performance For thesoft-encoding templates surprisingly the prompt-learning model yields the best result with the fewestspecial tokens
6 Related Work
After a series of effective PLMs like GPT (Rad-ford et al 2018) BERT (Devlin et al 2019)RoBERTa (Liu et al 2019) and T5 (Raffel et al
0
600
1200
1800
2400
3000
ᤒ໒ 1
Org-SportsLeague 50
Product-Game 2783
Org-SportsTeam 43
MISC-Livingthing 43
Org-Company 27
Org-Political 16
OR
G-SPORTSL
EAG
UE
PROD
UCT-G
AM
EO
RG-SPORTST
EAM
MISC-L
IVIN
G
ORG-C
OM
PAN
YO
RG-POLITICA
L
926
14 10 0717
ᤒ໒ 1-1
Org-SportsLeague 1929
Per-Athlete 368
Event-Other 132
Location-Mountain 117
Org-SportsTeam 116
Build-Airport 54
OR
G-SPORTSL
EAG
UE
123
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
44
642
PER-ATH
LETEE
VEN
T-OTH
ERL
OCATIO
N-MO
UN
TAIN
ORG-SPO
RTSTEA
MB
UILD-A
IRPORT
PLET PLET (S)
39 39 1814
1
(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE
0
600
1200
1800
2400
3000
ᤒ໒ 1
Event-Attack 47
PROD-Game 2662Event-Disaster 540
Loc-Other 242
event-protest 203
MISC-Language 169
EV
ENT-A
TTAC
KPRO
D-G
AM
EE
VEN
T-DISA
STERL
OC-O
THER
EV
ENT-PRO
TESTM
ISC-LA
NG
UA
GE
630
12857 48 40
ᤒ໒ 1-1
Event-Attack 1894
Org-Political 420
Per-Politician 392
Loc-Island 342
Loc-Other 342
Other-Language 154
EV
ENT-A
TTAC
K
99
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
93
449
ORG-PO
LITICAL
PER-POLITICIA
NL
OC-ISLA
ND
LO
C-OTH
ER
MISC-L
AN
GU
AG
EPLET PLET (S)
81 81 3711
2
(b) Zero-shot prediction distribution on EVENT-ATTACK
0
500
1000
1500
2000
2500
ᤒ໒ 1
Misc-Currency 1196Person-Other 782
PROD-Car 666
Misc-Language 282
product-game 179Org-Company 116
MISC-C
URRENCY
PERSON-O
THER
PROD-C
AR
MISC-L
AN
GU
AG
EPRO
DU
CT-GA
ME
ORG-C
OM
PAN
Y
203
86 55 35
ᤒ໒ 1-1
Misc-Currency 2197
Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75
MISC-C
URRENCY
59 48
670
LO
C-ISLAN
DO
RG-POLITICA
LO
RG-CO
MPA
NY
LO
C-MO
UN
TAIN
MISC-L
AN
GU
AG
E
PLET PLET (S)
42 33 23
365
239
1
(c) Zero-shot prediction distribution on MISC-CURRENCY
0
500
1000
1500
2000
2500
ᤒ໒ 1
Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22
LOC-M
OUNTAIN
MISC-L
AN
GU
AG
EL
OC-O
THER
PERSON-O
THER
MISC-L
IVIN
G
PROD-C
AR
10130 19 08
ᤒ໒ 1-1
Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31
LOC-M
OUNTAIN
86 18
767
LO
C-ISLAN
DPERSO
N-ARTIST
PERSON-PO
LITICIAN
MISC-L
AN
GU
AG
EPRO
D-SHIP
PLET PLET (S)
16 15 1134
772
1
(d) Zero-shot prediction distribution on LOC-MOUNTAIN
Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types
Encoding Strategy Template T(x) Acc MiF MaF
Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874
Soft-encoding
x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839
Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET
2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)
Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some
cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning
Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
0
600
1200
1800
2400
3000
ᤒ໒ 1
Org-SportsLeague 50
Product-Game 2783
Org-SportsTeam 43
MISC-Livingthing 43
Org-Company 27
Org-Political 16
OR
G-SPORTSL
EAG
UE
PROD
UCT-G
AM
EO
RG-SPORTST
EAM
MISC-L
IVIN
G
ORG-C
OM
PAN
YO
RG-POLITICA
L
926
14 10 0717
ᤒ໒ 1-1
Org-SportsLeague 1929
Per-Athlete 368
Event-Other 132
Location-Mountain 117
Org-SportsTeam 116
Build-Airport 54
OR
G-SPORTSL
EAG
UE
123
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
44
642
PER-ATH
LETEE
VEN
T-OTH
ERL
OCATIO
N-MO
UN
TAIN
ORG-SPO
RTSTEA
MB
UILD-A
IRPORT
PLET PLET (S)
39 39 1814
1
(a) Zero-shot prediction distribution on ORG-SPORTSLEAGUE
0
600
1200
1800
2400
3000
ᤒ໒ 1
Event-Attack 47
PROD-Game 2662Event-Disaster 540
Loc-Other 242
event-protest 203
MISC-Language 169
EV
ENT-A
TTAC
KPRO
D-G
AM
EE
VEN
T-DISA
STERL
OC-O
THER
EV
ENT-PRO
TESTM
ISC-LA
NG
UA
GE
630
12857 48 40
ᤒ໒ 1-1
Event-Attack 1894
Org-Political 420
Per-Politician 392
Loc-Island 342
Loc-Other 342
Other-Language 154
EV
ENT-A
TTAC
K
99
Zero-shot Prediction distribution for MSIC-Astronomy class by UNPLET
93
449
ORG-PO
LITICAL
PER-POLITICIA
NL
OC-ISLA
ND
LO
C-OTH
ER
MISC-L
AN
GU
AG
EPLET PLET (S)
81 81 3711
2
(b) Zero-shot prediction distribution on EVENT-ATTACK
0
500
1000
1500
2000
2500
ᤒ໒ 1
Misc-Currency 1196Person-Other 782
PROD-Car 666
Misc-Language 282
product-game 179Org-Company 116
MISC-C
URRENCY
PERSON-O
THER
PROD-C
AR
MISC-L
AN
GU
AG
EPRO
DU
CT-GA
ME
ORG-C
OM
PAN
Y
203
86 55 35
ᤒ໒ 1-1
Misc-Currency 2197
Loc-Island 194Org-Political 157Org-Company 138Loc-Mountain 108Misc-Language 75
MISC-C
URRENCY
59 48
670
LO
C-ISLAN
DO
RG-POLITICA
LO
RG-CO
MPA
NY
LO
C-MO
UN
TAIN
MISC-L
AN
GU
AG
E
PLET PLET (S)
42 33 23
365
239
1
(c) Zero-shot prediction distribution on MISC-CURRENCY
0
500
1000
1500
2000
2500
ᤒ໒ 1
Loc-Mountain 2229Misc-language 292Loc-Other 97Person-Other 87Misc-Living 69Prod-Car 22
LOC-M
OUNTAIN
MISC-L
AN
GU
AG
EL
OC-O
THER
PERSON-O
THER
MISC-L
IVIN
G
PROD-C
AR
10130 19 08
ᤒ໒ 1-1
Loc-Mountain 2212Loc-Island 247Person-Artist 52Person-Politician 45Misc-Language 42Prod-Ship 31
LOC-M
OUNTAIN
86 18
767
LO
C-ISLAN
DPERSO
N-ARTIST
PERSON-PO
LITICIAN
MISC-L
AN
GU
AG
EPRO
D-SHIP
PLET PLET (S)
16 15 1134
772
1
(d) Zero-shot prediction distribution on LOC-MOUNTAIN
Figure 4 Zero-shot prediction distribution on four types in FEW-NERD in each subgraph the left part illustratesthe results of PLET and the right part shows the results of PLET (S) denotes the correct predictions denotesthe wrong predictions with correct coarse-grained types and denotes the wrong predictions with wrong coarse-grained types
Encoding Strategy Template T(x) Acc MiF MaF
Hard-encodingx [Ent] is [MASK] 5445 6734 6734x [Ent] is a [MASK] 5393 6644 6644x In this sentence [E] is [MASK] 5575 6874 6874
Soft-encoding
x [P] [Ent] [P1] [Pl] [MASK] l = 2 5925 6958 6958x [P] [Ent] [P1] [Pl] [MASK] l = 3 5366 6606 6606x [P] [Ent] [P1] [Pl] [MASK] l = 4 5296 6601 6601x [P] [Ent] [P1] [Pl] [MASK] l = 5 5544 6839 6839
Table 5 Effect of templates The results are produced under 8-shot setting on FEW-NERD dataset by PLET
2020) fine-tuned PLMs have demonstrated theireffectiveness on various important NLP tasks suchas dialogue generation (Zhang et al 2020) textsummarization (Zhang et al 2019 Liu and Lap-ata 2019) question answering (Adiwardana et al2020) and text classification (Baldini Soares et al2019 Peng et al 2020 Ding et al 2021a)
Despite the success of fine-tuning PLMs thehuge objective form gap between pre-training andfine-tuning still hinders the full use of per-trainedknowledge for downstream tasks (Liu et al 2021bHan et al 2021b Hu et al 2021) To this endprompt-learning has been proposed In prompt-learning by leveraging language prompts as con-texts downstream tasks can be expressed as some
cloze-style objectives similar to those pre-trainingobjectives The seminal work that stimulates the de-velopment of prompt-learning is the birth of GPT-3 (Brown et al 2020) which uses hand-craftedprompts for tuning and achieves very impressiveperformance on various tasks especially under thesetting of few-shot learning
Inspired by GPT-3 a series of hand-craftedprompts have been widely explored in knowledgeprobing (Trinh and Le 2018 Petroni et al 2019Davison et al 2019) relation classification (Hanet al 2021b) entiment classification and naturallanguage inference (Schick and Schuumltze 2021 Liuet al 2021b) To avoid labor-intensive promptdesign automatic prompt search has also been
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
extensively explored Schick et al (2020) Schickand Schuumltze (2021) Shin et al (2020) Gao et al(2020) Liu et al (2021a) to generate languagephrases for prompts Recently some continuousprompts have also been proposed (Li and Liang2021 Lester et al 2021) which directly usea series of learnable continuous embeddings asprompts rather than discrete language phrases
In this paper we aim to stimulate PLMs withprompt-learning to capture the attribute informa-tion of entities We take fine-grained entity typinga crucial task in knowledge extraction to assignentity types to entity mentions (Lin et al 2012) asthe foothold to develop prompt-learning strategiesIn fact Dai et al (2021) use hypernym extractionpatterns to enhance the context and apply maskedlanguage modeling to tackle the ultra-fine entitytyping problem (Choi et al 2018) with free-formlabels which shares a similar idea with prompt-learning In our work we mainly emphasize usingprompt-learning to extract entity types that havebeen pre-defined in low-data scenarios
7 Conclusion
This work investigates the application of prompt-learning on fine-grained entity typing More specif-ically we proposes a framework PLET that coulddeal with fine-grained entity typing in fully super-vised few-shot and zero-shot scenarios In PLETwe first introduce a simple and effective prompt-learning pipeline that could be used to extract entitytypes with both sufficient and insufficient supervi-sion Furthermore to handle the zero-shot settingwe propose a self-supervised prompt-learning ap-proach that automatically learns and summarizesentity types based on unlabeled corpora and a pre-defined label schema PLET utilizes prompts totake advantage of prior knowledge distributed inPLMs and could learn pre-defined type informa-tion without overfitting by performing distribution-level optimization In our future work along the di-rection of PLET (S) we will explore better prompt-learning approaches to automatically learning en-tity types from unlabeled data
ReferencesDaniel Adiwardana Minh-Thang Luong David R So
Jamie Hall Noah Fiedel Romal Thoppilan Zi YangApoorv Kulshreshtha Gaurav Nemade Yifeng Luet al 2020 Towards a human-like open-domainchatbot arXiv preprint arXiv200109977
Livio Baldini Soares Nicholas FitzGerald JeffreyLing and Tom Kwiatkowski 2019 Matching theblanks Distributional similarity for relation learn-ing In Proceedings of the 57th Annual Meetingof the Association for Computational Linguisticspages 2895ndash2905 Florence Italy Association forComputational Linguistics
Tom B Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared Kaplan Prafulla Dhariwal ArvindNeelakantan Pranav Shyam Girish Sastry AmandaAskell et al 2020 Language models are few-shotlearners In Proceedings of NIPS pages 1877ndash1901
Eunsol Choi Omer Levy Yejin Choi and Luke Zettle-moyer 2018 Ultra-fine entity typing In Proceed-ings of ACL pages 87ndash96
Hongliang Dai Yangqiu Song and Haixun Wang2021 Ultra-fine entity typing with weak supervisionfrom a masked language model In Proceedings ofACL pages 1790ndash1799
Joe Davison Joshua Feldman and Alexander M Rush2019 Commonsense knowledge mining from pre-trained models In Proceedings of EMNLP-IJCNLPpages 1173ndash1178
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In Proceedings of NAACL-HLT pages 4171ndash4186
Ning Ding Xiaobin Wang Yao Fu Guangwei Xu RuiWang Pengjun Xie Ying Shen Fei Huang Hai-TaoZheng and Rui Zhang 2021a Prototypical repre-sentation learning for relation extraction In Pro-ceedings of ICLR
Ning Ding Guangwei Xu Yulin Chen Xiaobin WangXu Han Pengjun Xie Hai-Tao Zheng and ZhiyuanLiu 2021b Few-nerd A few-shot named entityrecognition dataset In Proceedings of ACL pages3198ndash3213
Tianyu Gao Adam Fisch and Danqi Chen 2020Making pre-trained language models better few-shotlearners arXiv preprint arXiv201215723
Xu Han Zhengyan Zhang Ning Ding Yuxian GuXiao Liu Yuqi Huo Jiezhong Qiu Liang ZhangWentao Han Minlie Huang et al 2021a Pre-trained models Past present and future arXivpreprint arXiv210607139
Xu Han Weilin Zhao Ning Ding Zhiyuan Liuand Maosong Sun 2021b Ptr Prompt tuningwith rules for text classification arXiv preprintarXiv210511259
John Hewitt and Christopher D Manning 2019 Astructural probe for finding syntax in word repre-sentations In Proceedings of NAACL pages 4129ndash4138
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
Shengding Hu Ning Ding Huadong Wang ZhiyuanLiu Juanzi Li and Maosong Sun 2021 Knowl-edgeable prompt-tuning Incorporating knowledgeinto prompt verbalizer for text classification arXivpreprint arXiv210802035
Ganesh Jawahar Benoicirct Sagot and Djameacute Seddah2019 What does bert learn about the structure oflanguage In Proceedings of ACL pages 3651ndash3657
Brian Lester Rami Al-Rfou and Noah Constant 2021The power of scale for parameter-efficient prompttuning arXiv preprint arXiv210408691
Xiang Lisa Li and Percy Liang 2021 Prefix-tuning Optimizing continuous prompts for genera-tion arXiv preprint arXiv210100190
Thomas Lin Oren Etzioni et al 2012 No noun phraseleft behind detecting and typing unlinkable entitiesIn Proceedings of EMNLP-CoNLL pages 893ndash903
Xiao Ling and Daniel S Weld 2012 Fine-grained en-tity recognition In AAAI
Pengfei Liu Weizhe Yuan Jinlan Fu Zhengbao JiangHiroaki Hayashi and Graham Neubig 2021a Pre-train prompt and predict A systematic survey ofprompting methods in natural language processingarXiv preprint arXiv210713586
Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021b Gptunderstands too arXiv preprint arXiv210310385
Yang Liu and Mirella Lapata 2019 Text summariza-tion with pretrained encoders In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 3730ndash3740 Hong KongChina Association for Computational Linguistics
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019Roberta A robustly optimized bert pretraining ap-proach arXiv preprint arXiv190711692
Ilya Loshchilov and Frank Hutter 2019 Decoupledweight decay regularization In Proceedings ofICLR
Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Koumlpf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Proceedings of NIPS pages 8024ndash8035
Hao Peng Tianyu Gao Xu Han Yankai Lin PengLi Zhiyuan Liu Maosong Sun and Jie Zhou 2020Learning from Context or Names An EmpiricalStudy on Neural Relation Extraction In Proceed-ings of EMNLP pages 3661ndash3672
Fabio Petroni Tim Rocktaumlschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of EMNLP pages2463ndash2473
Xipeng Qiu Tianxiang Sun Yige Xu Yunfan ShaoNing Dai and Xuanjing Huang 2020 Pre-trainedmodels for natural language processing A surveyScience China Technological Sciences pages 1ndash26
Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training OpenAI
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the limitsof transfer learning with a unified text-to-text trans-former JMLR 211ndash67
Xiang Ren Wenqi He Meng Qu Lifu Huang HengJi and Jiawei Han 2016a AFET Automatic fine-grained entity typing by hierarchical partial-labelembedding In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing pages 1369ndash1378 Austin Texas Associationfor Computational Linguistics
Xiang Ren Wenqi He Meng Qu Clare R Voss HengJi and Jiawei Han 2016b Label noise reduction inentity typing by heterogeneous partial-label embed-ding In Proceedings of SIGKDD page 1825ndash1834
Timo Schick Helmut Schmid and Hinrich Schuumltze2020 Automatically identifying words that canserve as labels for few-shot text classification InProceedings of COLING pages 5569ndash5578
Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of EACLpages 255ndash269
Sonse Shimaoka Pontus Stenetorp Kentaro Inui andSebastian Riedel 2017 Neural architectures forfine-grained entity type classification In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational LinguisticsVolume 1 Long Papers pages 1271ndash1280 ValenciaSpain Association for Computational Linguistics
Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutopromptEliciting knowledge from language models using au-tomatically generated prompts In Proceedings ofEMNLP pages 4222ndash4235
Trieu H Trinh and Quoc V Le 2018 A simplemethod for commonsense reasoning arXiv preprintarXiv180602847
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690
Dong Wang Ning Ding Piji Li and Haitao Zheng2021 CLINE Contrastive learning with semanticnegative examples for natural language understand-ing In Proceedings of ACL
Ralph Weischedel and Ada Brunstein 2005 BBN Pro-noun Coreference and Entity Type Corpus Linguis-tic Data Consortium Philadelphia
Ralph Weischedel Martha Palmer Mitchell MarcusEduard Hovy Sameer Pradhan Lance Ramshaw Ni-anwen Xue Ann Taylor Jeff Kaufman MichelleFranchini Mohammed El-Bachouti Robert Belvinand Ann Houston 2013 OntoNotes Release 50Abacus Data Network
Thomas Wolf Lysandre Debut Victor Sanh JulienChaumond Clement Delangue Anthony Moi Pier-ric Cistac Tim Rault Remi Louf Morgan Funtow-icz Joe Davison Sam Shleifer Patrick von PlatenClara Ma Yacine Jernite Julien Plu Canwen XuTeven Le Scao Sylvain Gugger Mariama DrameQuentin Lhoest and Alexander Rush 2020 Trans-formers State-of-the-art natural language process-ing In Proceedings of EMNLP pages 38ndash45
Jingqing Zhang Yao Zhao Mohammad Saleh and Pe-ter J Liu 2019 Pegasus Pre-training with ex-tracted gap-sentences for abstractive summarizationIn Proceedings of ICML pages 11328ndash11339
Yizhe Zhang Siqi Sun Michel Galley Yen-Chun ChenChris Brockett Xiang Gao Jianfeng Gao JingjingLiu and Bill Dolan 2020 Dialogpt Large-scalegenerative pre-training for conversational responsegeneration In Proceedings of ACL pages 270ndash278
Tony Z Zhao Eric Wallace Shi Feng Dan Klein andSameer Singh 2021 Calibrate before use Im-proving few-shot performance of language modelsarXiv preprint arXiv210209690