Transcript
Page 1: Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields Kenli Li, Wei

1

Supplementary Material forHadoop Recognition of Biomedical Named

Entity Using Conditional Random FieldsKenli Li, Wei Ai*, Zhuo Tang, Fan Zhang, Lingang Jiang, Keqin Li, and Kai Hwang

1 CONDITIONAL RANDOM FIELDS MODEL

This section provides more detailed description of theCRF approach.

1.1 Conditional Random Fields

In Figure 1, we give a simple but concrete exampleto explain the two-phase biomedical named entityrecognition using CRF. In Figure 1(a), B-protein inoutput denotes the beginning of protein class, and I-protein denotes the rest of the protein class. Becausethe final result of recognition using BIO format, in theinput , ”peri-kappa B factor” is a protein. Figure 1(b)gives the major components of the CRF model.

1.2 L-BFGS Algorithm for CRF

As shown in Figure 2, L-BFGS first sets parameters an

initial value ~λ0. Then L-BFGS repeatedly improves the

parameter estimates: ~λ1, ~λ2, .... The whole process usesa loop with a given number L of iterations. From ~λt

to ~λt+1, L-BFGS finds the search direction ~Pt, decidesthe step length at, and moves in this direction ~Pt. Thekey of finding the search direction ~Pt is to calculate agradient vector ∇Lt.

1.3 Viterbi Algorithm for CRF

We give a simple example to explain the detailed stepsof Algorithms 2 in Figure 3. Figure 3(a) shows Steps1–3, and Figure 3(b) illustrates Step 4.

• Kenli Li, Wei Ai, Zhuo Tang, Lingang Jiang, and Keqin Li are with theCollege of Information Science and Engineering, Hunan University,Changsha, Hunan, China, 410082.E-mail: Kenli Li ([email protected]), Wei Ai ([email protected]), ZhuoTang ([email protected]), Lingang Jiang ([email protected]).

• Fan Zhang is with Kavli Institute for Astrophysics and Space Research,Massachusetts Institute of Technology, Cambridge, MA 02139, USA.E-mail: f [email protected]

• Keqin Li is also with the Department of Computer Science, StateUniversity of New York, New Paltz, New York 12561, USA. E-mail:[email protected].

• Kai Hwang is with the Department of Electrical Engineering, Uni-versity of Southern California, Los Angeles, CA 90089, USA. E-mail:[email protected].

• Corresponding author: Wei Ai

�����������

���

� � �����������

����� �

���

���������

���� �

���

���������

�� ���

����

��

���

��������

� ��

� � �����

���

�����

����

��

���

���� �����

���� ���

���� ���

���� ���

�����������

�����������

�����������

������

� !

" ���

�����#

� !

" ���

����� ������������$�����

(a) The two-phase approach using CRF

�������������� �

�����������

�� �������������

�����������

���������

���������������

��

���������

������������

������������

�����������

�������

��������

���������������

������

��������

���

������� �����

�����������������

���������������

��

(b) The CRF model

Fig. 1: An example of biomedical NER based on CRF

2 EXPERIMENTAL SOFTWARE AND HARD-WARE CONFIGURATIONS

Tables 1 and 2 provide the software and hardwareconfigurations in the Hadoop cluster.

Page 2: Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields Kenli Li, Wei

2

Fig. 2: The process of L-BFGS algorithm

iX

1

iX

2

iX

3

iX

4

*

4y

3.011

=��α 08.012

=��α 005.013

=��α 0281.014

=��α11

2=��φ 11

3=��φ 21

4=��φ

4.021

=��α 02.022

=��α12

2=��φ

015.023

=��α12

3=��φ

056.024

=��α22

4=��φ

(a) Steps 1–3

iX

1

iX

2

iX

3

iX

4

*

4y

.,,,

4321CyOyCyOy

iiii ====(b) Step 4

Fig. 3: An example for the process of the Viterbi algorithm

Page 3: Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields Kenli Li, Wei

3

TABLE 1: The software and hardware configurations in the Hadoop cluster

The node type Operating system CPU Memory QuantityNameNode Open suse 11 4-core, 3.07 GHz 8G 1DataNode Open suse 11 4-core, 2.70 GHz 8G 40

TABLE 2: Key configuration in the Hadoop cluster

Configuration items Configuration properties ValueMap slots mapred.tasktracker.map.tasks.maximum 4

Reduce slot mapred.tasktracker.Reduce.tasks.maximum 2Copy thread mapred.reduce.parallel.copies 5

HDFS replications dfs.replication 1Input file size dfs.block.size 64 M

DataNode heartbeat dfs.heartbeat.interval 3 s


Top Related