buddha ngram viewer: a n-gram visualization tool of chinese buddhist translations

28
Buddha Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations Jen-Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings 2013, Kyoto University, DEC 10-12, 2013

Upload: jarvis

Post on 23-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Buddha Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations . Jen- Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings 2013, Kyoto University, DEC 10-12, 2013. Building the Digital Research Platform for Chinese Buddhist Literature. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Buddha Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Jen-Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings 2013, Kyoto University, DEC 10-12, 2013

Page 2: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Building the Digital Research Platform for Chinese Buddhist Literature

Jen-Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings 2013, Kyoto University, DEC 10-12, 2013

Page 3: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Achievements of Digitized Chinese Buddhist Texts

CBETA (Chinese Buddhist Electronic Text Association) is founded in 1998.

In the last 15 years, CBETA has converted a substantial number of Chinese Buddhist scriptures to digital format.

In CBETA 2011 DVD, it consists of more then 160 million Chinese characters.

Page 4: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Statistics of CBETA Digitized Content

Time Name of Collection Works ( 部 ) Fascicles Characters

1998-2003 Taishō Tripiṭaka ( 大正藏 ) 2,373 8,982 78,770,000 2004-2007 Shinsan Zokuzōkyō ( 卍續藏 ) 1,229 5,066 71,220,000

2008-2011

Passages concerning Buddhist activities from the Official History ( 正史佛教資料類編 )

1 10 333,000

Buddhist texts not contained in the Tripiṭaka ( 藏外佛教文獻 )

77 136 1,663,000

Selection of stone rubbings from Northern Dynasties( 北朝佛教石刻拓片百品 )

100 100 74,000

Supplement from other editions of Tripiṭaka ( 歷代藏經補輯 )

385 2631 24,193,000

2012-2013

Chinese Translations of Pali Canon(Based on Yuan Heng Temple Edition)

36 c.a. 7,500,000

Selections from the Taiwan National Central Library Buddhist Rare Book Collection.

64 c.a. 5,500,000

Total ( 總計 ) 4,265 16,925 c.a.189,253,000

Page 5: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

The Chance and Challenge with “BIG DATA” (I)

The rapid growth of digital resources let scholars to be able to acquire more relevant materials with less time.

However, most of digital resources are not integrated. Scholars have to find an more efficient way to master the large amount of data in order not to be drown in the data ocean.

Page 6: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

The Chance and Challenge with “BIG DATA” (II)

We also believe that these large amount of digital resources will not only provide a convenient research environment but also will help to gain new insights.

One very promising solution is to perform text analysis on Buddhist electronic text corpus to find out hidden pattern behind texts.

However, it sounds like a very difficult task for Buddhist scholars.

Page 7: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Digital Research Platform for Chinese Buddhist Literature

Main Mission of the Digital Research Platform :1. Data Providing: Provide complete, integrated

reference data in easy access way.

2. Data Organizing: Provide customization tools for user to organize materials into knowledge.

3. Data Analyzing: Provide digital analysis tools for discovering hidden patterns.

Page 8: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Project Information 2 years project, granted by National Science Council. (Digital

Humanities Project). It consists of three sub-projects:

Sub-project1: responsible for digitizing new resource for supporting this platform. (directed by Aming TU)

Sub-project2: responsible for developing new methodology for analyzing digital corpus, especially focusing on phonology materials. ( directed by Chien-Kang Huang)

Sub-project3: responsible for integrating project result, develop text quantitative analysis tool and establishing the platform.

Page 9: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Plan for the First yearTarget 1: build up the platform for integrating resources

Design a good way to integrate digital resources. Incudes: CBETA full text, catalogue, dictionaries,

phonology materials, other digital resource created by DDBC.

Target 2: implement text analysis functions Building up data set for text analysis. Creating tools. Ex: Buddha N-gram viewer is an example

tool for this purpose . It visualizes over time occurrences of inputted phrases in Chinese Buddhist texts.

Page 10: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Target1: Building the Digital Research Platform

Page 11: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Idea of the Research Platform Our experience: in the last decade, we have

executed more than 20 digital achieve projects.

Every database has its own archive content, design principle and different media type.

The only overlap is perhaps the sutra text

To integrate those resources, we decide to establish a rich functional sutra reading interface, and bind other related information to the text.

Page 12: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Sutra Reading

Text Analysis

Tools

Phonology

Materials

Dictionaries

Word Segmentation Tools

Tripitaka Catalogu

e

Main Idea of Integration

Page 13: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Catalogue Data

Basic Information

Only embed critical apparatus, and gaiji information.

Information from Sutra catalogue, click here will be leaded to our catalogue Project.

N-gram Information

婆羅 ,727如是 ,705比丘 ,694羅門 ,693沙門 ,614

世尊 ,477如來 ,469云何 ,428眾生 ,388由旬 ,387

爾時 ,384復有 ,358是為 ,346阿難 ,317無有 ,313

Other Related Sutra

• Other Parallel Translation. • List of Commentary• Related Research

Page 14: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Catalogue Data

N-gram Information

婆羅 ,727如是 ,705比丘 ,694羅門 ,693沙門 ,614

世尊 ,477如來 ,469云何 ,428眾生 ,388由旬 ,387

爾時 ,384復有 ,358是為 ,346阿難 ,317無有 ,313

Other Related Sutra

• Other Parallel Translation. • List of Commentary• Related Research

Extra Information for Selected Terms

婆羅

Dictionary Lookup

Occurrences of 婆羅 in different time period

婆羅《丁福保佛學大辭典》【職位】 Vihārapāla ,維那之別名,譯曰次第,司僧中之次第順序者。行事鈔下二曰:「維那出要律儀翻為寺護,又云悅眾。本正音婆邏,云次第。」

Information from our glossaries project, click here will be leaded to glossaries project website.

Word Segmentation Tools

Place Name, Person Name, Calendar Look up

This information is from Buddha Ngram Viewer.

Page 15: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Target 2: Implement Text analysis Functions

Page 16: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

What is the Text Analysis Text analysis: utilizing computer software to analyze

the text content in large size corpus, e.g.: CBETA. The objective is to discover hidden patterns and further derive new insights.

The patterns could be: Words that are frequently used in one place but never

show anywhere else. High-frequency collocations in a group of documents. Special usage patterns of commonly used words. Other possible and meaningful patterns ……

Page 17: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Difficulties in applying text analysis to the CBETA corpus Data is too complex:

The textual content and structure of Buddhist works are highly complex and complicated.

Analysis Tool is very difficult to learn The leverage of general text analysis tool requires some skills

in computer programming and advanced statistical knowledge.

How to let more (Humanity) scholars to adopt ‘text analysis’ technique in addressing their research questions?

We create some easy-use tools.

Page 18: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Buddha Ngram Viewer: (http://dev.ddbc.edu.tw/BuddhaNgramViewer/) Buddha Ngram Viewer (under construction)

A tool that allows users to visualize the over-time occurrences of inputted phrases in Chinese Buddhist texts.

Click any point in the chart to start.http://dev.ddbc.edu.tw/BuddhaNgramViewer/

Page 19: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Idea of Buddha Ngram Viewer

Combine Search result and sutra translation time from triptaka catalogue.

Search result in CBReader

Sutra No.

Sutra Name

Dynasty

T01n0001 長阿含經 後秦T01n0005 佛般泥洹經 西晉T01n0023 大樓炭經 西晉

Number of occurrences of search term in different time period.

+

||

+後秦 = C.E. 410

西晉 = C.E. 314

Page 20: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

泥洹 , 涅槃

Number of occurrences

Western Years

Chinese Dynasties

Click this point to see the details of CE.401

Page 21: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

The occurrences of 泥洹 , 涅槃 in the sutras translated in C.E. 401

Scroll down for more sutras

The occurrences in the 22 fascicles of T1 ( 長阿含經 ).

A quick way to understand the frequencies of selected terms in texts.

Click this point to see the details of 3rd fascicles in T1

Page 22: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Shows the matched place of 泥洹 , 涅槃 in the third fascicle of T1

Click here for displaying only matches of 泥洹

Page 23: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Only display matches of 泥洹 in the third fascicle of T1

Click for viewing this line in CBETA Text

Page 24: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

CBETA Full text of the selected line

Page 25: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Integrate Buddha Ngram Viewer to the Research Platform

婆羅

Dictionary Lookup

Occurrences of 婆羅 over time

婆羅《丁福保佛學大辭典》【職位】 Vihārapāla ,維那之別名,譯曰次第,司僧中之次第順序者。行事鈔下二曰:「維那出要律儀翻為寺護,又云悅眾。本正音婆邏,云次第。」

Word Segmentation Tools

Place Name, Person Name, Calendar Look up

This information is from Buddha Ngram Viewer.

Page 26: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Future Work

Page 27: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Future Work

Keep adding temporal and spatial information of sutras: Taisho shinshu Daizokyo, Showa hobou

makuroku. The Korean Buddhist Canon: A Descriptive

Catalogue by Dr. Lewis R. Lancaster, 1979.

Complete the sutra reading interface and continue to integrate more related information to the platform.

Keep bring new idea to the platform.

Ex

Page 28: Buddha  Ngram Viewer: a N-gram Visualization Tool of Chinese Buddhist Translations

Thank you for listening.

Q & A !!