conversational network in the chinese buddhist canonwongtaksum.no-ip.info:81/index.files/wong tak...

Post on 26-Oct-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Conversational Network in the Chinese Buddhist Canon

Tak-sum Wong and John Sie Yuen LeeCity University of Hong Kong

Conference on Digital Humanities 2015

2

Application of corpus

• It is common to apply linguistic annotation to study the language therein.

• Can we apply dependency relations to analyze the characters in a literary text?

3

Outline

• What is treebank?• Construction of our treebank• Conversational network of Buddhist text

– Goddess of Mercy– Who spoke most?– Mahāyāna vs Hīnayāna

4

Syntactic tree

5

Treebanks: What?

• Many types of parse trees– Example:

Stanford dependency parse tree

6

Treebanks: What?

• A treebank is a collection of syntactically analyzed sentences– Typically in the form of parse trees

7

Treebanks: What?

• A dependency tree represents grammatical relations between words

‘Bills’ is the child of ‘submitted’ in the relation nsubjpass

8

Treebank: What?

• A tree also includes part-of-speech tags– Critical for Chinese since it has no inflectional

morphologyDependency relation: ‘monk’is a direct object of ‘see’

‘not’ ‘hear’ ‘Buddha’ ‘sutra’ ‘and’ ‘not’ ‘see’ ‘monk’

[He] has neither heard about the Buddhist scriptures nor seen any monk.

Part-of-speech tag: ‘monk’ is a noun

9

Treebanks: Why?

• Quickly find examples to support linguistic research– E.g., In the passive structure 為…所…,為 is sometimes dropped in Buddhist Chinese text

• A feature of Buddhist Chinese

– Easy to search for passive sentences in treebank

10

Treebanks: Why?

• Characterize the “profile” of a word

What can you pray ‘for’, and who can you pray ‘to’?

Word Sketch[Kilgarriff et al. 2004]

11

Treebanks: Why?

• Sketch differences between ‘clever’ and ‘intelligent’

Compare the meaning of “clever” and “intelligent”by looking at adjectives that collocate with them

Word Sketch[Kilgarriff et al. 2004]

12

Treebank development

• Training data– Small-scale treebank created by Lee & Kong (2014)– 50k characters– Finely tagged by Buddhist specialists– POS-tag set: adapted from Penn Chinese Treebank– Dependency label: largely followed Stanford

Dependencies for Chinese + 5 new relations for MC

13

Treebank development

• Pre-processing:– Transplanted punctuations to the Tripiṭaka Koreana 高麗藏 from the Taishō edition 大正藏

• No parser for Classical Chinese– Word segmentation, POS-tagging by CRF++

– Dependency parsing by MST parser– External dictionaries

• Soothill-Hodous Dictionary of Chinese Buddhist Terms• Person and Place Authority Databases from DDBC

14

Interesting problems

• How close are the characters in the Buddhist world?

15

Interesting problems

• How close are the characters in the Buddhist world?

• We aim at answering this question by making inquiry on the conversation in Buddhist texts.

16

Most Frequent Say verbs

• 言 yán ‘to say’ (10979) • 告 gào ‘to tell; to announce to’ (5401) • 白……言 bái… yán ‘to address … and say’ (5015)• 答曰 dáyuē ‘to reply and say’ (2157)• 曰 yuē ‘to say’ (2126)• 問 wèn ‘to inquire’ (2091)• 告……言 gào… yán ‘to tell… and say’ (737)• 白 bái ‘to address’ (475)• 語 yù ‘to say’ (453)

17

Extraction of speaker and listener

18

The case of Goddess of Mercy

Kwun Yam, Gwan-eum, Kanon, Guānyīn, and Quan Âm 觀音

Avalokiteśvara

19

The case of Goddess of Mercy

Buddha92%

bodhisattva6%

others2%

Distribution of listeners of the Goddess of Mercy (N=195)

address86%

tell1% unmarked

11%reply2%

Distribution of type of saying verbs, the Goddess of Mercy as speaker (N=195)

20

The case of Goddess of Mercy

Buddha90%

bodhisattva9%

others1%

Distribution of speakers of the Goddess of Mercy (N=143)

ask/reply3%

unmarked41%

tell55%

address1%

Distribution of type of saying verbs, the Goddess of Mercy as listener (N=143)

Visualization of conversational network

22

Conversational network

Conversational network of the CBC, showing edges with 200 utterances or more

23

Protagonists

24

Who Talked the most?• Subhūti • Maudgalyāyana

• Avalokiteśvara • Śākyamuni Buddha

25

Protagonists

26

Interlocutors of the protagonists

27

Speak to Listen Ratio

28

Buddhist network without Buddha

29

Mahāyāna and Hīnayāna

30

Mahāyāna section

Absolute fundamental realityPerfection of wisdom

Theory of wisdom endowed with insight into emptiness

Conversational network of the Mahāyāna section of the Buddhist Canon, showing edges with more than 280 utterances.

31

Hīnayāna section

Conversational network of the Hīnayāna section of the Chinese Buddhist Canon, showing edges with 100 or more utterances

32

Mahāyāna and Hīnayāna

690彌勒菩薩

878帝釋天

924比丘

1333天子

1772菩薩

2702阿難

3316舍利弗

3553文殊菩薩

5013須菩提

17457釋迦牟尼佛

365摩訶迦葉456優波離457婆羅門564王612舍利弗682比丘尼833人2273阿難10096比丘

15154釋迦牟尼佛

33

Conclusion

• Built the first corpus of CBC of 46 million characters semi-automatically with limited manually annotated data of 50k chars

• Demonstrated how to exploit linguistic annotations to conduct analysis of the characters in a large-scale Chinese literary texts by using dependency relations– Studied conversational network– Statistics, e.g. protagonists, interlocutors of protagonists– Mahāyāna versus Hīnayāna

Thank you!

Q&A

top related