implementing web page classi cation method by anchor-related text

8
DEWS2007 C7-5 Web BuiQuang Hung 560-8531 1 3 E-mail: †{otsubo,bqhung}@nishilab.sys.es.osaka-u.ac.jp, ††{hijikata,nishida}@sys.es.osaka-u.ac.jp Web Yahoo! Excite Web Web Web Web Web LSP, USP Implementing Web Page Classification Method by Anchor-related Text Masanori OTSUBO , Bui QUANG HUNG , Yoshinori HIJIKATA , and Shogo NISHIDA Graduate School of Engineering Science, Osaka University 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, JAPAN E-mail: †{otsubo,bqhung}@nishilab.sys.es.osaka-u.ac.jp, ††{hijikata,nishida}@sys.es.osaka-u.ac.jp Abstract With the exponential growth of information on the Internet, we need to categorize web pages auto- matically. Many studies extract keywords from web pages to classify them by using the keywords. Recently, they extract keywords not only from a target page which they want to categorize, but also from the pages which link to the target page. However these approaches conduct the same extraction method even if the format of web pages differs. In our research, we change extraction method by the format of web pages in order to adapt each web page. Key words Web page classification, Anchor-related text, Local Semantic Portion, Upper-level Semantic Portion 1. 2005 Google 80 Web Yahoo! Excite Web Web Web 80 Web ( ) Web [1] TOYOTA (http://toyota.com/) “Cars” “Trucks” Aflac( ) (http://www.aflac.com/) “insurance” “health” “cancer” ( ) Bulm [2]

Upload: others

Post on 03-Feb-2022

19 views

Category:

Documents


0 download

TRANSCRIPT

DEWS2007 C7-5

��������������� ��������Web � ����������������

�! "$#† BuiQuang Hung† %'& ($) † *�+ "$,

† -/.0-213-212465272821292:3;<560-8531 -/.0=2>@?0A2B2C/D0E 1 F 3

E-mail: †{otsubo,bqhung}@nishilab.sys.es.osaka-u.ac.jp, ††{hijikata,nishida}@sys.es.osaka-u.ac.jp

GIHKJILWeb MINPORQISKTVUXWZYI[R\V]P^R_ Yahoo! ` Excite acbPNVdcePa Web fVgchViPjRkIlcmonVpqWorRsI\t gcuKvRwRxzyZNK{R|cSK}I~���r�sc\R�R���q�PNKwRxzyZ�K�c[R\ Web fRgchRNKnRpc�K_��R�R^R�c�K�Vrc�K�c�'WK�

WZ_��R�R^'�K�R�'aK�'N Web fRg'hRi��R���c\��c�K��^c��aKs��c��_ Web fRgch�Nc���Rn�pcS��R|q��aR\����R��^��_�n�p��R 'N�f�gch�¡'N�¢RN�i�£�¤R�'\�N�^c��a'¥¦_�¡'N�fRg'h��'mK§�¨'W�rRs'\�f�gch�i�£�¤qW�r�n�pci��$e�©ª S�«$¬�­Z��r�s'\������'~K^'N�© ª ^'�K_�f�g'h�N�®�¯'�R°��'��±�²�³cN�´�§�j�g�µR¶'N�k'·�¸�yZi�¹'º'W�n�pc�¤�s�r�s��c��»�© ª ^'��_Rf�gch�N�®�¯'�Rd���r�k'·�¸�y�N�¹cº�¼�½'i�¾c¿�\��'��^�_cd'À�Á�Â'N�Ã�\Kk'·�¸�yÅÄRni�¹'º'W�_�¡��'i�n�p'��¤�s'\��'��i�Æ�Ç'\��È�É'Ê�É�Ë

Web f�gÌh�n�p�_'´�§�j�g'°ÎÍ�k'·�¸�yÏ_ LSP, USP

Implementing Web Page Classification Method by Anchor-related Text

Masanori OTSUBO†, Bui QUANG HUNG†, Yoshinori HIJIKATA†, and Shogo NISHIDA†

† Graduate School of Engineering Science, Osaka University

1-3 Machikaneyama, Toyonaka, Osaka 560-8531, JAPAN

E-mail: †{otsubo,bqhung}@nishilab.sys.es.osaka-u.ac.jp, ††{hijikata,nishida}@sys.es.osaka-u.ac.jp

Abstract With the exponential growth of information on the Internet, we need to categorize web pages auto-

matically. Many studies extract keywords from web pages to classify them by using the keywords. Recently, they

extract keywords not only from a target page which they want to categorize, but also from the pages which link to

the target page. However these approaches conduct the same extraction method even if the format of web pages

differs. In our research, we change extraction method by the format of web pages in order to adapt each web page.

Key words Web page classification, Anchor-related text, Local Semantic Portion, Upper-level Semantic Portion

1. ÐÒÑÔÓÖÕ×ÎØÎÙÛÚ�ÜÎÝßÞÎà�áãâåä�æÎçÎè�éßê�ëÎæÎìÎí�î�ïÌðòñòó�ôÌõöÙ

2005Øø÷Pù

Google úøûøüPý Þ3ÝoþøÞPÿ��øé 80 ��� æ Web� Þ�� ú���� � ó��� �Îç�è�æ�ê�ë������Ù��Þ��æ��������������î����������� ú�!#"�$�% ñòó����òæ ú ÷�&�'�(����)* % çÎè�æ�+ ,.-/��Þ�� ú ������çÎè�î 0 ü�1 ��æ�é2 !'�(3� ï54 ÙYahoo!6 Excite

�87:9 �8� � Web ý�;�<8= â> ú@?A$CBAD@�C� óA�@�C� " , " � � - æ Web ý�;�<C= â >��ô 1 � Web� Þ���æEF�é GH' %� � óKô'õöÙ 80 ��� æ� Þ��Îî�E�F�����Îé�I�J ú (���LK���'ÎÙ Web

� Þ���î�M�N"O/P EF " ?�Q /��� RS úT�U� ó�V�ï�WAX æAYZA'RéZÙEAFA[A\Kæ � Þ@�

( ]A^ Ù�ÝZÞA_�á'â � Þ�),`-aKÎæ � Þb���dce���efbgßÿ âih�E îejbk " ÙCK æb+òæ�lm�î D � ó Web

� Þ�� î OnP E�F " ó�� ï��:� � épo ÝßÞ�_ á

â � Þ���î q �?$ 9 " ó��� lm�éÎÝ�Þ_�á'â � Þ��+�(� r# /� Q/s�t � u�v��ïHZ'�(��� " , " Ý�Þ�_�á'â � Þ�+�Îé�Ùwx " � K�æ � Þ���î y�z��� ?�Q % {�|�ú (�� é�I#- % ���� ú E,�ñòó���� [1]�d} t�~ Ù TOYOTA

æ��ÎÞ� � Þ��

(http://toyota.com/)��é�Ù��@+�æ9 { ' “Cars”

6 “Trucks” ���ñßïAlAmRéC��ùA�@� � æKæKÙ�f@gZÿqâ�'9K÷

�e� ï OnPd� ��ce��� {d| é % ���C� ï Ù Aflac( ��� >a� Üd���� > Þ��A��)æ��Þ � � Þ��

(http://www.aflac.com/)+

�Îé“insurance”

n� Q lm�é�(� ú “health” 6 “cancer” %ðÅæl�m ú�% ��æ'�Ù������� % æ, O/P� ����� % æ,�é E�,.- % ��K�� '�Ý�Þ_�áÌâ � Þ�� '�é %$ ÙAK�æ � Þ��� > Ü =�" ó��� � Þ��( ]d^ Ù > Ü =�� � Þ�� )

����� � ��l�m�î D �Îó�EF����YZ ú ×�Ø�.� � � ó��� � ]^ ��Ù > Ü =�� � Þ��+æAlmRî D �KóEAF�î T.Q RSA�A}�îC���A� � Bulm-

[2]éZÙ

� Ü � ÞAf@gZÿqâ�� � 1 � æ��Rî D �KóAEAF " óA�@� ú Ù > Ü =C�� Þ�����Pæ:f�goÿ â î� Q �� Køæ���3é����� U - %, ñòïd�L� ïÎÙChakrabarti

-[3] 6 Furnkranz

-[4]éßÙ � Ü� Þf�g�ÿqâÅî ����������� 6 > Ü = � � Þ���æ0�k " î��ñóEF " ó��� ú Ù���� ú���$% �� /E�F���� ú���$% ñ�ó�ï3�

Glover-

[1]éoÙ > Ü =C� � Þ@�oæ � Ü � Þ���� 25

l3mî���� � Ü � Þf�g�ÿ$â ! �" " Ù � Ü � Þf@g�ÿ$â Ù���� �Ü � Þf�g�ÿqâ Ù#�f@g�ÿ$âÏæ Q�$ ÙA��x ��ú�%�& '�(A� ,�î'�( ��) " ó��� ��*#-�é���� � Ü � Þ�f�g�ÿ$â ú q � ��� ú? �. /� Q ���î�+ " ï�" , " � � - æ:R S3é Ù���,�Þ.-�á'â �:c U - x/ �0213 Þ 3 ' � Ü � Þ�4�5f�g�ÿ$âÅî j�k " ó�ô'õöÙ > Ü = � � Þ��æ���,ÎÞ�-�á'â ú�6 ì�'�(��� �î����. " ó��� ����æ�ï4ZÙ�798�: 6�; +�< % ð!=�> % ÿRÝZÚ 3 '�?A, � ï Web

� Þ���@ D ���� ÅÙ�A�cCB % f�g�ÿ$â hE�î j�k " ó " � Q�D�EF ú�G ��H RS'�é�Ù � Ü � Þ��c�I " ó��� D�E F æ G �f�g�ÿ$âhE�æ���î j�k " EF�� D ����� /'�Ù ? õ�����æ G � O/P EFRî@� JA�L��KK÷3YAZKæ

1 ü " óKÙ Web� Þ@�oæ

DOM LM î 0���� ú ����- � � � DOMéCN�?�æ�O�P�Q L M 6 N�?R æ ��=CS ÿ�YZÎÙTN#?�h�E R æ#U#VÎæ�Y�Z�îC #"��� � æ [5]'�ÙKÎæ�W�7���X = âåæZY�['õ�é�N�\ÎæZY�['õ (��]#��c�I

ú (�� _^ U� � ��`�>�é�a�= % ��,�Þ#-�á'â/��@�b�����ï�4��ÙDOM < þ 3 æ�N�? L M ,.- � Ü � Þ��c�I��� f@g�ÿ$âhdE( ]d^ Ù � Ü � ÞbccI�fbgòÿ â ) îejbke�b���b òîe�ed��b�e�

]�f Ù 2. g '���d���� � Ü � Þ�c�I�f�gßÿ$â nK�æj�k Y�Z� ü �Îó |�h Ù 3. g ' H RS�' D ��� SVM� ü ��ó�il�� yz��������ïÎÙ

4. g ' H HZ�î�j�k����ßï�4�CK�l " ïEFnmÿf � æ�o � î z#+��� � 5. g '�V�p " ï�EFZmßÿf � � ? �j�k�K#q�æ���î�+ " Ù 6. g '�# 4� �r#��æ�s�t�î ��� Ù Hu î�v�4 $�$ � �2. wyx{zZ|~}Z�~�Z�{���2. 1 �~������ Ü � Þ�c�If�g�ÿ$âÏæj�k Y���î���4��(�ï'õöÙ������[6]î T ñòï���� � éßÙd9 1

æ ?�Q �CK#��æ Web� Þ��Îî��

a $ � h ���� /' TU� ï�Official Page Personal Page�������������C�

50 50�������������752 356���

1108�1  �¡�¢�£�¤ Web

������¥

¦ �§Q � |§h � ÙA� x Ý Þ�_�á'â � Þ � " ó Official

Page� �

2 � Personal Page� �

3 � îòÙ K �e¨�� 50� Þb� x ü Open

© ª1 «�¬®­°¯�±³²�´ µ·¶�¸·¹»º�¼ ½¾µ ¿ÁÀ�Âô·µ ¶¾Ä�­°¯_±cÅ�Æ_ÇcȳÉËʾÌ#ÂÎÍÏ ¸ËÐ!ÑÓÒ»Ô�Õ Ö ×_Ø < Ahref = ”xxx.html” > Ù�Ù�Ù < /A > º»Ú�ÛÝÜ

È°Ù�Ù�Ù»Í Ï Ò© ª2 «�¬ßÞ_àÁ¸ Web ´ µ ¶�¸·áÁâÝã³äËÖ!ã°´ µ ¶�¸ËÐ Ñ© ª3 «�¬æå Ï ¸³çËè!ãcéÓê¾ë ì»Ü·í°î�ï·Ö!ã Web ´ µ ¶�¸·Ð Ñ

Directory [7], -�ðøÜñ � ��òó " ï:��� � - 100

� Þ�� æÝßÞ�_�áÌâ � Þ���K �#¨�� ��[���� > Ü = � � Þ��ÎîßÙ Googleæ > Ü = � Þ@�C�3� [8]

î D �Kó�ó54ZïA���Zæ��KÙ�ó@4A� > Ü=C� � Þ@��æKäAIRîõôRÝZÞA_�á'â � Þ@�A�C[ " 20

� Þ@�A� "ó������ æ ?�Q � " ó�ó�4�- � ï > Ü = � � Þ�� 1108

� Þ����e[ " Ùð æ�h#ö ú ÝßÞ�_ á â � Þ�����c���� {�|Îú (��ßæ�,#� � " ï����æ�� � � ? õöÙ � Ü � Þ�c�If�gZÿ$â/��é “Local Semantic

Portion(]�^ oLSP

r)”

“Upper-level Semantic Portion(]^ o

USPr

)”æ * V $ 2 ÷ F ú (���� ú E�,�ñòï�� ]�^ Ù��� ���,.-��� " ï�Ù LSP

USPæj�k YZ�î |�h � �

2. 2 Local Semantic Portion (LSP)

LSPé � Ü � Þ�î �#� � Ü � Þ�4#5�f�gßÿ!â h�E�'�(�õöÙeN? L M ä�Ù � Ü � ÞZø�Þ�ù �0 < þ 3 �(� � æ�î�J�L� ]^'�é�Ù

4 ü (� LSPæj�k YZ�î�ú�û��� �

2. 2. 1 � Ü � Þ ú oü���< P >

rZý��(� ���� Ü � Þ ú �#��+���(��� �VßÙ��#��+��Cþ T Ý#: ú�%�1���~ Ù�����#��î j�k ��� �#����+��þ T Ý�: ú ( ��~ Ù#����+�æ�þ

T Ý�:ßæ�ö�ÿ�c�B�î s��." ój�k ������ö�ÿ�c�B�é�Ù#þ T Ý�:æ��#��� � Ü � Þ ú (� ��� �þ T Ý�:�æ������ f�g�ÿ$â ú (� �#� æ 2 ÷ F ú (� �eþ T Ý#:�æ��#��� � Ü � Þ ú (� ���é�Ù � Ü � Þ�æ�����(��þ T Ý�: ,#-���æ � Ü � ÞÎæ����(�þ T Ý�: �'�î j�k ��� ( � 1)��þ T Ý�:�æ������ f�g�ÿqâ ú(� �#� é�Ù���æ � Ü � Þ�æ����(���þ T Ý�: ,.- � Ü � Þ�æ���(���þ T Ý�: �'�î j�k���� ( � 2)

�1 LSP � ������ ����� � ¤����

�2 LSP ������ ��� ����� � ¤���

���'�Ù‘A’é�ÝßÞ_�áÌâ � Þ���æ � Ü � ÞÎÙ ‘a’

é K�æ���æ� Ü � ÞÎÙ ‘BR’

éCþ T î�9��L���ÎïÎÙ��#� ú f�Þ#7 3 æ S 3 ����� ����� (�ñ�ï ú Ù���4�ó���� " , ���." % �Îï�4�Ù s��" % ���� � " ï�2. 2. 2 � Ü � Þ ú o > ÿXâ

< OL >,< UL >, < DL >r

ý�(�� ���� Ü � Þ ú Ordered List(OL) 6 Unordered List(UL)

æ �Údf � 'b(d�� eVòÙ<LI>Ýe:e'dE 1 - � ï��`�»�e� îej�k " ( �

3)Ù

Definition List(DL)æ � ÚAf � '5(3�@ VZÙ <DT>

<DD>'E 1 - � ï��.�_�#��î�j�k���� ( � 4)

�Ordered List

æ � Úf � æ�+��Ù� $�ü ,�þ T Ý�: ú (� ����ú (�ñ�ï ú Ù���4�ó��� " , �#�."�% �Îï�4�Ù! �" %�s�� é�#x Ù!$n��ð� C0�=��ßÙ�.���#��î j�k ���

( � 5)�

�3 OL � UL� ¤����

�4 DL

� ¤���

�5 ��� ���������� � ¤���

2. 2. 3 � Ü � Þ ú o f Þ#7 3< TABLE >

r�ý��d(�� �e�� Ü � Þ ú føÞ7 3 ý3�:(:�� 8V ÙÌð æ $ - �øæ S 3 �ú ñ�ó����æ, s�� ����w� ú (� ��K�� '���æ � Ü � Þ�î 0ü�1 � ��' S 3 î���� " Ù���Ý�Þ�Ü� ? õ/j�k� ���î�� t �� � " ï��x�Ý�Þ�_�á'â R æ � Ü � Þ�î ��� S 3 î 0 ü�1 ï.-ÅÙ�K�æ��� æ S 3 � K�æ���æ � Ü � Þ ú (� ,�� h � �K�æ��Îæ � Ü� Þ ú�%�1 ��~ Ù�������æ S 3 � K�æ��Îæ � Ü � Þ ú (�� ,'ðQ ,�î�� h � � ��" 0n1 T æ S 3 � K�æ���æ � Ü � Þ ú�%1 �~ Ù�ä ^ æ S 3 R � ������� ��� Q ��ñßï�����î K�æ���æ � Ü� Þ ú 0 ü ,�� �'��'õ�� " Ù�Ý�Þ_�áÌâ � Þ�� R æ � Ü � Þ /K � � [�b���� f�g�ÿ$â hE�î�j�k ��� � ¦ �}�î � 6

�7� ���� �

‘A’é�Ý�Þ_�á'â � Þ@��æ � Ü � Þ�Ù ‘a’

é KKæ��æ � Ü � ÞÎÙ ‘T’

é f�g�ÿ$âÅî 9�L�

�6 � ������ ¤���

� 6æ ��� Ù 2 T 3 � � � � Ü � Þ ú (� ��x � Ü � Þ�æ��� æ S 3 î�� h Ù'ð $ - � f�g�ÿ$â '�(��� ú�� ñßï� �-��eKÎæ�� æ S 3 îC� h ( ��� ����� " ï �#� é��Îæ T R�� � )

Ù�Kæ � Ü � Þ�î 0 ü�1 ï���A�C'�Ù 2 T �Zî 0Kó����@ o � Ü� Þ� f�gßÿ$â r�æ�!�"�'�(� � ? ñ�óÎÙ�Ý�Þ_�á'â � Þ�� Ræ � Ü � Þ�î ��� S 3 ÙK�æ � �(��f�g�ÿ$â S 3 ú j�k [\. % � �

�7 � ������ ¤���

���ßÙ � 7î 0Îó#���� ÅÙ � Ü � Þ�é 2 T 4 � ���(�� �!$��ð �0C=�� K�æ ��� æ S 3 î�� h Ù���æ � Ü � Þ�î 0 ü�1 ��d' Ùõ��� " ód� $ ��òæd} æ �e� é o f�gòÿ â# � Ü � Þ�r�æ!�"�� % ñßó��� � ? ñßó�Ù 2 T �_��ó�î j�k ������ �� % � �

2. 2. 4 � Ü � Þ ú o 7Z8�á = < DIV >rZý��(� ���

LSP ú 728�á = +:�:(:� 8Vøé Ù 2. 2. 1 $ �0�=øæ&%.P3îT.Q �

2. 3 Upper-level Semantic Portion (USP)

USPé � Ü � Þ5�(' " ó3� % �3f5gZÿXâ h3EA'PÙ�N�? L MäPÙ � Ü � Þ øoÞ ù ? õ � ä�ö < þ 3 ��ö�ÿ3�5� � æRî�J3�L�

]^ Ù USPæj�k Y���� ü ��ó |�h � ����� � æ3�:� ? õ Ù o � Þ��PÝ ÚXâ 3 r�é � Ü � Þ5�:c�I

" ó������ ú�� ñòï��d�ÎïÎÙ H1 6 H2,) ,H6æ ?�Q�% o *�á

ñ�Þr � � Ü � Þ��c�I ú (�õöÙ ��" 0 < þ 3 æ�*�á�ñ�Þ ú �$ ü ,�(:� �� é Ù � Ü � Þøæ:�,+ ä��:(:�*�á�ñ Þ ú:� Ü� Þ��c�B ú (�� n� Q �� ú�� ñßï�� � 8

��}�î ����� ���æ�}�'�é

<H2>ú 2 ü (�� ú Ù � Ü � Þ (<A>)æd�-+ßä���(�� Y

î j�k " ó��� �

�8 USP .�/�0 � ��1Z�2��3�4���5�� ¤����

�@� � Ü � Þ ú o f�Þ�7 3<TABLE>

r9ýA�A(� ��� æj@k YZ�î |#h � �f�Þ�7 3 *�á�ñ�Þ�é � Ü � Þ. c�B ú (� ú Ù >Ü = � � Þ���æ�6�7�é�f�Þ�7 3 *�á#ñ�Þ�î( �'õ� U�% ���� ú ���#� � 'E,�ñ�ï���8�/�Ù�6�7�é f�Þ�7 3 *�á#ñ�Þ�æ7 Uõ �ßÙ�K�æfÎÞ�7 3 æ

1 T � 6 ÙÎä#ö < þ 3 æf�Þ#7 3 î D �� � ? ñ�ó�Ù ��" f�Þ�7 3 æ S 3 � � Ü � Þ ú ( ��~ ÙK�æqä#� �e� Ü � Þ���cCBÎæ�(���fbgßÿ!â h�E�'�(��� s�t - � ���� � -Åæ�� � ? õöÙf�Þ�7 3 ý� � Ü � Þ ú (� ��� Ù���æ?�Q ��j�k ����� /��� �9

1 : f�Þ�7 3 *�á�ñ�Þ ú ( ��~ j�k ��� ( � 9)92 : f�Þ�7 3 æ

1 T � ú ��æ�� ? õ ��;�%1 �@~ Ù 1 T �î j�k����( � 10

�)9

3 : fKÞ�7 3 æ�ô T � � Ü � Þ ú (Rõ Ù 1 T � �Zæ�� � Ü� Þ ú�% � ��� Ù 1�.��î j�k ���

( � 10�

)94 : ä { 1< 3

æ�V>=Rî?�@�Zä�ö < þ 3 æAfKÞ�7 3 � ü �ó T.Q �

�9 � ������4���5�� ¤���

�10 � ����� 1 @�A�¤���

3. SVM

SVM [9] [10] é Ù

Support Vector Machine�:'P����

� 3�� >�� � æ ü '�(:�8����� � 3�� >�� � é ��� Naive

Bayes 6 �� � % ð ú � - � óA�@� ú Ù SVMéõ�@(A��� � �3�� >�� � æ�+ ' � ÙAq � o G � � ���A�r. /� Q ��Rî � ñóA�@�C� H R3SKæAEAFA[3\Ré

WebäKæ�N�?3'@(RõöÙK@�3�Ak

÷����a#= % lm�î�þ = â 3 æ���. /���ßï�4�ÙK�æ�� � é��/�� G � � æ. % �C� ? ñßó�Ù H RS��Ké SVM ú q � @ " ó���� �� t � �SVM úT.Q * �, % o � " ó ]^ æ � æ ú ����- � � �9

1 : ��� D�ý Þ�Ý�æ� �����î SVM����� � #��9

2 : -ÎÞ���Ü q *�� î T ñ�óÎÙ��� ��J�î�� $93 :�� - � ï������ ý 3 � fÎÿ$â D�ý Þ�Ý�î(�ó��94 : fÎÿ$â D�ý Þ�Ý�î EF����ßæ#o � éßÙTa $ æ���� � 3 � >!� � �#"�8Îæ � æ�'�(�� ú Ù$ ü 4Zæ � �C�A(A� o -KÞ@�KÜCq * � r�éZÙ

SVMæAq * æ

�. � � t � ��-�Þ���Ü�q *�� �é�Ù���� ý Þ�Ý�æ�+ '�q � �æ = ð�ÿ# Å×��ö�ÿ�� ��� � æ 9&%�' Þ!âÏþ = â 3 : î u�(. " ó�Ù %�' Þ$âÏþ = â 3 ,.-�� = ð�ÿ��'�æ�-�Þ���Ü ú q * �% � ?�Q % �� ��J�î�)� ����� /'�(� � 9 � 11:

�11 * ������+�,�-

SVM'Réõ-�Þ@�KÜCq * � æ �KÙ�� . /KæAEF 0�t@�C[ " ó

“ 1 �'â�-�Þ���Ü ” 6 “� Þ�à 3 â > á = ” % ðÅæMZ úD �.-

� � � 1 �'â�-�Þ���ÜÎé��� ��J,.- (���]���æ�2�3��'�î�:< Þ�4�Þ�Ü� " Ù'ð $ -Åæ = ð�ÿ ú � " ó �? ��5�6�î#) 1 �MZ'�(�� ���ï�Ù � Þ�à 3 â > á = é ���þ = â 3 ����.�/��7�î�8 " Ù�.�/E�3�0�t�����7���� MZ'�(� �

4. 9;:=<Z�~��>H RdS�'ej#k#Keq�� D �b�#mòÿ�f � æ#o � î � 12

�C+��8� HmZÿAf � éZÙ�? � Þ@�

((�CEAF " ïA� � f � > ��A�@� � Þ�

) /K �] @ æ�A � Þ���æ URL

> ÿ$âÅî�B C ý Þ�Ý@ /��� �B�C ý Þ�Ý��é���� D�ý Þ�Ý� /f�ÿ$â D�ý Þ�Ý ú w�'�Ù����D�ý Þ�Ýßæ�������î � ��f�ÿ$â D�ý Þ�ÝßæEF����=DFE�÷�Gîk�C��� � ]�f é��ænm�ÿf � æ�o � � K�ñ�ó�Ù�K#��æ�K�q' T.Q V�=�î |�h � �

4. 1 H#I�J�K�L�M�N�O�P�QÝ�Þ_�á'â � Þ���æURL

æ > ÿ$â ú B�C � � �� ÅÙ Google

SOAP Search API (Beta) [11](]3^ Ù GoogleAPI)î B�D." ó

�12 R���S�¤�T#U

K ��¨�� æ�Ý�Þ_�áÌâ � Þ��� [��� > Ü = � � Þ���î�ò�ó�� ���oæ���ò�ó3�@� > Ü = � � Þ5�KéoÙ Google ú �A� > ÿqâî K�æ�� D ����� �é#x Ù ]^ æ�� ; 3 Ý�î D ��ó %�V�%> Ü =�� � Þ���æ�W�î�ò�ó��� ���æ�� ; 3 Ý��é�Ù�Ý�Þ_�á'â � Þ�� R æ � Ü � Þ ú ��ù "% � Web

� Þ���î#X �." ó " � Q �� �î�Y 1 ��Z�[ (H�!

4a) Ù � Þ��#ý > Ü = î " ó���� Web� Þ�� 6 0 ÷ æ Web

� Þ� � �

4 � îõa���X � " ó " � Q �@ Zî�Y 1 ��Z [ (H>!

3a,5a) ú(���� � Þ��#ý > Ü =�6 0 ÷ � Þ��ÎéßÙ Web' ?�$#\ ������]� % 0#t'�(� �

1)Ý�Þ_�áÌâ � Þ��

Tæ�ù � Ú�Ü DT

� �5 � î j�k

2) GoogleAPI ú ��� > Ü =� � Þ#� X(i)� �

6 � æ ù � Ú'ÜDX(i)

î j�k3a) DT = DX(i) % - ~ Ù � Þ���ý > Ü = � � 7 � �W %5" X(i)é�^�_�`�a�Ù

i�

1î�ë t ó 2) R

3b) DT |= DX(i) % - ~ Ù X(i)æ 1 Þ ÿ SX(i)

îGoogleg�b á�mdc#� �

8 � ,#-!X � 4) R (g�b á#m�c î D ���CP egf > Ü

= � � Þ�� ú�h ó � � Ù�Ý�Þ_�á'â � Þ�� R æ � Ü � Þ ú A $% ñ�ó�@�A�� ú (�C��g�b á�m�c�îõ� h ó��Kæ�çKèRî�X �� ��~ Ù�� � î�Y 1 ��� ú '�V� � )

4a) SX(i) ú Googleg�b á�m�c��8�øù "8% �3,øÙ SX(i)

+:�DT

æ�i�j�ù �#k�l (H $�m�n�o )SDT p�q r�s t X(i) u ^d_`�a�v

i w 1 x�y�z�{ 2) |4b) SDT p SX(i) }�~ b�����c�� w������ s�t 5) |5a) DX(i) = DX(k) ��� t X(i) x�������� v i w 1 x�y�z#{2) |5b) DX(i) p DX(1) � DX(i−1) }=� s=����� �=� � r�s�tX(i) x����;� v i w 1 x�y�z�{ 2) |���=����������

4a) ����� ��� i�j¡ �¢ k�l�w�£�¤�{�¥ ��¦#§�¨�©�ª ij= «¢ k l � u­¬  «¢ k l � }�®d¯�°�±�²d³�° xµ´ ¶ �«� � }d·� ¤�z ©�ª�¸ ¹º��� { v “www.” x�»�¤ � URL

� } ¼ ½�}¿¾À Á

4 ÂFà BLOG Ä�ŵÆÈÇÊÉ�Ë«ÌÎÍ«Ï�ÐÒѵÓÕÔÕÖ�×ÙØ�ÚÜÛ#ÝßÞ�àâá�Ääãæå!ç�èÕé�êëæìæíÊî�ï�ðÊñ Þäò�ËÕó�ô«õ�ö�êÊ÷�×ßøÊù�ç Web úæôæûæüþý�ÿ��À Á5 ÂFà ��� ñ�� Þ URL �#ÚFÍ!ô���� (“www” �µÚ���Ý� Îç ) ü���� î����Ï�Ë����æç ��� üFý«ÿ���� à “http://www.google.com” Ñ�Ú�� “google.com”ç����À Á6 ÂFà i � GoogleAPI ê���ÿ� �!À Á7 ÂFÃ#"Îô�$µÇµÐ úäôæû��µÚ%"Îô�$µÇµÐ úäôæû�&æç('ßË�)«ç ��� � “home &�*Ý ” � �,+�'ÕË�) ñ.- Ý�/!ô�0�ê«Ö����À Á8 ÂFà http://www.google.co.jp/help/features.html#cached ü21�3

���  �� � }���� � © ( � � www.abc.com � s�t v abc }��� ) ª������ v�� }�������� r � u� � ¦ ´ ¶#� © � � u#��� ¤ ª���� � ��������° p “www” � © � u�� �#� ¤ � ��� ª��z t “http://news.abc.com/” }�� � �! ���"�} ��# v �����° u “news” �� ©�ª$ � � v ������° w � ¦&% ¤ � s�© °�'�}�(�) x&*�+ ��ª () u v Yahoo ,.-0/�132 � [12] }�4 3 5�6 � � w#��78 s {�¤© 40,190 ���" x&9�:¿w&*�+ ��ª (�)�}<;�= v 10 ���"&><?�A@<B s { ¤ © �A�<� ° } �DC 2äxFE 2 w!G � ª�HAI<J �¿u v� s �ä}��<����° x�» ¤ ��K } URL } �!L v ¼�½ }�¾ �M�  � � }���� x&N j¡ �¢ k�l �MO�P�����ª

Q2 R&SMT&U�V�W&X&U�YMZ&[&\&]

arts, astro, chem, cs, education, fan, health, home,

homepage, honors, info, law, lib, library, math, med,

members, music, news, people, pharmacy, physics,

psych, sci, science, state, usembassy, users, web

4. 2 ^ ��_&`�a�b�c�d�egf�h�i�j4. 1 k��mlmn ��� � lm1po m�q" }qrp� C }msmt u v “Cy-

berNeko HTML Parser [13] u v 9 w ” p�x�y;�!� DOM z x�{�|��© � � �A*+ � ª � }<}A�A~ u v HTML �A� }<�A� xF�A� �©F�A� u v 10 w xF��+�{ ¤ © } � v�� � ¦F�A� s { ¤ � ¤ HTML

r&� C w � 9������ ©�ª���DOM z } � � � � ��� � 2 ���" | }�� l�� � x�{�|

� ©#ª � }�� v � ��� � 2 ��"µ} URL � � l�� ��} URL x�<� w!�<��� ©�� URL p ���<� � v � l!� � x&�<� ��� � ¤��# p�� ¤ � �!p (�)���� � + �dª � ��� � 2 ��"#} URL

x “http://www.abc.com” ��� { >�� w URL } ��� ����� ��;� � ¤�� x&��� ©�ª

• http://www.abc.com/index.html

( �� URL x&G�������k � x&  §;� {�¤ © )

• http://abc.com

( ������° x&¡�¢ � {�¤ © )

• http://dinamic.cgi?url=http://www.abc.com

( £ }�� kg2�x&¤�¥�� © )

• http://www.abc.org

( !¢ k�l }�¦�§���� p�¨ � © )

H<I�J ��u � l&� � x!{�| � © � v �<� � © URL x!©�5<ªw�N�« � {d¤ ¦ � � � � s �«}�¬�­ x s�® � © � � w �!�dª “©5�ª�w ” � u ><� w&G � 5 £ } ��� ��¯�$ s�°�s w!9 � { v 1)

� � � w DOM z²± }A³ { }A� lp� � xp{A| � © � � xF´mµ� ©�ª1) � ��� � 2 ��"�} URL � } ��� ���2) “http://” ��� K�¶ w� © ‘/’ >�· x&¸ �M¹ {����3) 2) } ��¯ w�y�z v ������° (www º ) } »�»�x&*�¤����4) ‘?’ p s�t URL ³�³ � � � ��� � 2 URL x&{�|5) 4. 1 k�� §�¨�� N�«  �¢ k�l�x&���À Á

9 ÂFý¼D¾ Ô �À¿ÂÁ êÀÃDÄ Ã Java 1.1¿�Å Þ Xerces 2.0.0 [14]

¿MÅÀ Á10 ÂFà � à Web úæôæû�çÀÆÈÇ�êÀÉËÊÍÌ!Û îDÎ�Ï "äÌ�üÑÐÂÒÈÓ

5) } ��� ��¯ u�Ô �MÕÖ � ������ © � �#w�×�z ©#p v�Ø �u 1)� 4) } ��� x&Ù z ��K } ������ ©#��Ú&Û 95 Ü }�Ý�Þ p ��ß���à %�á�� � ��� ��¯ �� ©�ª � � v 1)� 5) ³�³�}�� l� � {�| Ý�Þ u Û 97 Ü���+ ��ª>�?�} � � u v 11 w xâ¤�{mn � s � � ��� � 2 m�ã" | }m� l

� � x&o w v 2. ä�� §�¨�� ��¯ x % ¤�{ � l&� ��å&æ�ç�~ C 2p ´�¶�8 s�©�ª

4. 3 è�é�ê h�i�j4. 2 k }<ë�ì �<n � s � � l&� �íå&æ<ç�~ C 2�x SVM |<îï 8 � ©���Ú w v $ s ��}�ð�ñ�ò x#´�¶�� ©�ª ð�ñ�ò�} ´�¶ ë

ì u “ó�ô ��õ �MC 2 ë y ” � “ð�ñ�ò ´ ¶ ” } 2 ©�5 w�ö�� ¦� r � s�©�ª�÷ 13 w ð�ñ�ò ´�¶ }�ø s x&G�� ª

ù13 ú&û&ü&ý&þ�ÿ

a ) ó�ô ��õ �MC 2 ë y�A� v óAô �Aõ �ÂC 2 } o � � © �Aõ �ÂC 2 1 x ë y � © ª� s u�î ï w % ¤ © ³ { } � l�1&o ���" w� © v � l&� ��åæ�ç�~ C 2 � } ��õ x���� � ¦���¨ v�� ��õ }���Þ x��y ����MC 2F�� ©�ª� w v ��õ �MC 2 1

� }�����Þ õ x�»�» � v �õ �MC 2 2 x!n ©�ª�� w� ¿l 2�� ¾!� � C [1] x���� � v ¿l2�� ¾�� � C } ö0�#¤ ¶�� �È?�� t

��õ x���� ��õ ��� { v t o } ó�ô ��õ ��C 2!x#´ ¶#� ©#ª � } ó�ô ��õ ��C 2!u v SVM

} î ï�� ç C 2 }�� ¶ w % ¤ ©�ªb ) ð�ñ�ò ´ ¶ w v SVM w������ � �¿© ð<ñ<ò x�´¿¶�� ©�ª����º� u ¨��� v�� � l�1&o ���"� � w ð�ñ�ò x�´�¶ � v ��!� ��� �2 ��" | }�ð�ñ�ò x�#���� ©�ª £ � � ó�ô ��õ ��C 2�u ³�³� 1 £ � p v ð�ñ�ò u�� ��� � 2 ��"�}�!�� r ë y 8 s�©#ª" ³ ª �&ð�ñ�ò ´�¶ ¶ ¯ x §�¨�©�ª ��� v � l�1&o ���" �} ó�ô ��õ �MC 2�w ��� � © ��õ }���Þ x ! z ©�ª � ��õ }��Þ x ���" � }�³ ��õ ! ��#�+�{ ��$�% � v ��õ�& 1 2 � x&n©�ª��(' � ��§�¨���) � v ���&� ��� � 2 ���" | } � l�1&o m�²" } �mõ*& 1 2 � u ³ { y+��� © ª � � � {mn � s � �� ��� � 2 ���" w&9�� © ��õ�& 1 2 � p ð�ñ�ò � � © u v 12 w ª><� v 4. 3. 1 k � ���<Þ õ x�»<» � ©!¶ ¯ x § ¨ v 4. 3. 2 kÀ Á

11 ÂFà 4. 1 ,æÞ 4. 2 , � ø Ï.- +ÙÑ./�0�ü21 +âê«Þ 4. 1 , � “ 3�4 ñ Ñ�� Webúæôæûæü��#× ” �.� ü6587 �:9 î /�0 ñ.-�; Þ 4. 2 , � “ ÃÂÄ�Ñ����µüÕØ�< ð>=ÿ ”�.� ü?587 �:9 î /�0 ñ Þ�@6AÎê>BæÑ�Ý��À Á

12 ÂFÃDC?Eäç!Å�0�F:G ñ��IH�J úäôäû�çLK�üNM�O ��98P � ÝÕê�QLR J ç.S�T�çK�ü ¾�� P � Ý À UWV 7ÈÄ�X � ¾.� P �æÑ��! îZY Q>[�\:X>]�^�ѵÓÎü ¾.�«Û �à`_ J Ô� NM�a ñ ÊæÝ

��u� �l 2�� ¾&� � C } O�P x������ ©�ª4. 3. 1 ����Þ õ } »�»��� v���� x�� O � © � � w ��� v ���<Þ õ x�»�» � ©�ª >

��}� x� � � ��õ f x ����Þ õ � � ©�ªf x���� � , � �� }�³ , � � < τ ∩

f x����� �, � � }�³ , � � < τ

��� � τ u ��� �� ©�ª ����Þ õ x#»�»�� © � � � v ����� C2�x � � © � ����w ����Ý�Þ � ? p�©���� z � s�©�ª4. 3. 2 �l 2�� ¾�� � CC � f x v• C ��� ��p © � ç�� � w���� ©�� :• f ��� ��p © ��õ x���� � :� � v Pr() x���� } � } � : p�� � ©� �!º� � © � v >A� }� ��w O�P ��� ©�ª

• Pr(C) ��� ��p © � ç"� � w���� ©� �!• Pr(C) ��� ��p © � ç"� � w���8 � ¤ �!• Pr(f) ��� ��p © ��õ x���� �!• Pr(f) ��� ��p © ��õ x�� � � ¤ �!� s � u v >���}�� ��w�# Ú � s�©�ªPr(C) =

� } , � � !³ , � � !

Pr(C) = } , � � !³ , � � ! = 1 − Pr(C)

Pr(f) =f x�����, � � !³ , � � !

Pr(f) =f x�� � � ¤�, � � !

³ , � � ! = 1 − Pr(f)

� s � x O�P;���;� � v�� ��õ } � ' l 2�� ¾&� u }�� �w�����8 s ©�ª

e = −Pr(C) log Pr(C) − Pr(C) log Pr(C) (1)

��¶ v Pr(C|f), P r(C|f), P r(C|f), P r(C|f) pPr(C|f) =

f x���� � , � �f x�����, � �

Pr(C|f) =f x����� �, � �f x�����, � �

Pr(C|f) =f x�� � � ¤ � , � �f x�� � � ¤�, � �

Pr(C|f) =f x�� � � ¤� �, � �f x�� � � ¤�, � �

w ��� ����8 s¿©�� � ©�� v�� � v f p � � s�©�� � } ��K l 2�� ¾�� ef � v � � s � ¤ � � } ��K l 2�� ¾&� ef u }�� ��w�����8 s�©�ª

ef = −Pr(C|f) log Pr(C|f) − Pr(C|f) log Pr(C|f)

ef = −Pr(C|f) log Pr(C|f) − Pr(C|f) log Pr(C|f)

� +�{ v ��K �l 2�� ¾&� u

e = efPr(f) + efPr(f) (2)

� ��� v (1) � (2) �� v �l32�� ¾&� � C uE = ( � ' �l 2�� ¾&� ) − ( ��K �l 2�� ¾�� )

= e − e

��# Ú � s�©�ª �l32�� ¾&� � C u v�� , � � � �, � � }�� ¶ w � ¶�� ©� � � ��õ w!9 � {�u v «�8 � � � � ©�ª £ � � v 4. 3 k � §¨�� ¬ ö���¤ � } � � � w ��¨ v ?�� ��õ w�$ © · � ¤ � } u¬ � , � � w � A, � � w � � ¦ � � � © �Aõ u v ��� w�%�&� � ¤ · � ¤ � � z p o�w � +�{�¤ ©�ª����Þ õ } »�»�� Ú + � w�¶�{ � � ¤ õ x�»�» � v �l 2��¾&� � C } «�8�¤ � } x�» ¦ � � � C 2 ��'�( �   }�� � ���¶ � ��) © ��õ � »�» �!��ª � } 2 ©�5 } & 1 2 � o�*�+�w��� v ����Ý�Þ u�, ? � ©���� z � s�©�ª

4. 4 SVM -�.�/�0�13254�6>m?�}87 � xp¤�{mn � s�� ðmñmò x % ¤�{ v î ï�� ��� x

* � ª�� � � v ç C 29��w�u ðmñmò�} ´ ¶ } � ��wpómô ��õ� C 2 }më y up*mB � v î ï ��w ë y��� ómô �mõ � C 2�x% ¤ © ª î ï � �:� w u SVM x % ¤ © p v H � C ç<; � u“LIBSVM [15]” x % ¤ ��ª

5. =?>A@?B4. ä�� §�¨�� � C ç"; x % ¤�{ vDC�E�Ø�F x&*0+ ��ª�H ä��u vâØ�F w % ¤ � , � ��G � 2 � Ø�F�H�I�v�Ø�F ;�= x §�¨ ©#ª

5. 1 J `�K�L�M fØ�F w % ¤ © � } � �<� � 2 A�í" � � { v Yahoo! Direc-

tory [12] }<><� w&G �<� çN� � � � � Web <�í" x�O�P �ä�( � � ��±�u�P Ú�� ���"�! ) ª, � � 1 Science/Biology (634)

, � � 2 Science/Biology/Zoology/Animals (1594)

, � � 3 Entertainment/Music (757)

, � � 4 Entertainment/Music/Genres/Rock&Pop (634)

, � � 5 Recreation/Sports (1090)

, � � 6 Recreation/Sports/Baseball/MajorLeague (713)

, � � 7 Computers/Internet (953)

, � � 8 Computers/Internet/WWW/Weblogs (698)

O�P ¶�Q u ¬ �� "DR l ��} � ç�� � x 2 £ � £�� 4 S�O�P� v , � � 1,3,5,7 u�T²8 2 � � v , � � 2,4,6,8 u�T²8 4 �� O�P�� © � � · � ©�ª "�R l � w � +�{ ����Ý�Þ w�U p ¶©&� v � ç�� � } T8�w � +�{ ����Ý�Þ w�U p ¶ ©�� v x ��Ú�© � ��p ��� © � �#w�O�P ����ª � � v } � ��� � 2 ��" � � { v Yahoo! Directory ³ ³ � �Â?   } � çN� �WV r {4942 ���" x�O�P ����ª w Google SOAP Search API (Beta) [11] x % ¤�{ v O�P��� 9 , � � ( , � � 1� 8 XY m, � � )

$ sm°�s } � �m� �2 ���" ( Û 12 Z ��í" ) w&9�� © � l�1!o ���" ( Û 200 [ ���" ) x�O�P ����ª

5. 2 \3]3^3_`�a } Ø�F �¿u v 5. 1 k � §�¨�� v O�P ��� � �<� � 2 ��

" � � v $ s<° s�� l�� ; w 150 <�²" x���� v 50 <�í" xî ï % , � � v 100 ��í" x ç C 2 % , � � � � { % ¤ ��ª �� � l<1!o <�í" u v�� � �<� � 2 <�í" w!9 � ¼ ö 50 <�" ������ª

4. 3. 1 k�� §�¨�� ����Þ õ »�» } ��� u 0.07 ��� v 4. 3. 2 k� § ¨�� ¿l 2�� ¾!� � C } ö���¤ ?�� 1000

�<õ x!ó<ô ��õ�MC 2 ������ª � s � u Glover � [1] p Ø�F x�*+ � � � ���� �� ©�ª5. 3 \ ]����� �<¯ u v 5. 1 k � §�¨�� � l<1&o <��" � � ¬ � l&� �åpæmç�~ C 2 · x ´�¶ � �8� w % ¤�{ ¤ © ª � �m¯ x >m�¬ STP· ����� ª � � v Glover � [1] p *í+ � ¬ � lp� �A' K25��õ · x % ¤�{ ��� � © ��¯ x���� ��¯ � � ©�ª�� �!} ��¯

u v�� � } Web ��í"!��� �<¯ } � � v ¼ ��� ¤ ���<Ý<Þ x¶ � {�¤ ©#ª ��� ��¯ x >�� ¬ Fix · ������ª÷ 14 w v�� , � � � }���� F

� � $ }���� x!G�� ª �<�� v F

� u (2× Ý�Þ × ��� ! )/( Ý�Þ X���� ! ) � © ( Ý�Þ u¬ � C ç ; w � +�{ � w ��� 8 s � A�²"�} ± v�� } � �<� �2 ���" u#¤ ¦ � � · x�G � v ��� ! u ¬ � } � ��� � 2 ��"�} ± v¿� C çN; w � +�{ � w ��� 8 s � <�í" u�¤ ¦ � � ·x&G�� ) ª

ù14 �&[��� "!"#"$�% - F &

STP w å � { u �«} , � �dwµ£�¤d{ � F�

80 >�? � ©#p vFix w�£ ¤�{�u!, � � w � +�{ t � £�� p © � ��p ��' r �s�©�ª 8 � w"( ��¦ � {�¤ ¦¿� v Fix }�;�= w�£�¤ { v , � �1 ��� � , � � 2

v , � � 3 ��� � , � � 4v � ¤�+ � " #�w v

, � � 1,3,5,7 �²� � , � � 2,4,6,8 } ¶ p F� p�� ¤ � �æp

� � + ��ª 5. 1 k�� �#§�¨���p v , � � 1,3,5,7 u�� ç�� � } T8 p 2

v , � � 2,4,6,8 uF� ç � � } Tí8 p 4 �² ©�ª � } ��M� � ¬ Fix u STP w&� ¨�) ¤�� ç�� � }���� p�* � �� © ·� ¤ ��+�� p n � s���ª� }�, ¥dx.-�/ � {�� ©µª�÷ 15 w STP(LSP,USP) � Fix( 'K 25

��õ) } ´ ¶.0�1�� x G�� ª � } ÷�� �È� � © � �µw v Fix

u LSP �²� � ö²� ¤�0�1 � ´ ¶�� © � �äpA� ¤ ª £ � � Fix

u v ¼ � å�æ � {�¤ © ¼ « ��� (LSP) }�' K ç ~ C 2 � ´ ¶#�© } � v� }�' K ç�~ C 2 ��� w � s � r }32"4�5 p�6 � ���

ù15 þ ÿ87:9

��p ����Ý�Þ w";�<�x"=3>�� ª� l!� � p “T ¤<� ç � � w�� � © � �<� � 2 <�í" ” x�?� { ¤ © �A# v LSP }A' K ç ~ C 2æu v � �A� � 2 m�²" }±"@�w&�<��ª�A�¤ � } �í ©�p v�B w “) ¤<� ç�� � w���� ©� ��� � 2 ���" ” x"? � {�¤ © ��# v LSP }�' K w�u v � �� � 2 ��"#} ±�@ � �DC ¤ ç ~ C 2 p � � s © � � z � s ©#ª��z t v � �m� � 2 m�q" p 5m6mª�wYT�¤ “Jazz w å � ©

<�í" ” �í�+ � �<# v $ }<' K } � l<1 u Pops �í<+ � �Rock �²�+ � � � © ��E � ª;�Â� � v � �<� � 2 A�í" p 56<ª¿w ) ¤ “Music ³�F w å � © <�í" ” �í�+ � �<# v $ }' K } � l�1�u Movie ���+ � � Game ���+ � � � ©�ª³ { p � }�� � �"G�H x&��+�{�¤ © � u&� ��� ¤ p v�� � ���I , p © � ¤¿z © �JE � ª � s �æ} � �Â� � �� �A¯ u v� ç�� � 5�6 p T�¤�� ��� � 2 ��" w �") ¤�� ��� � 2 ��" w ��K�O;��� ��� x&* � � ��p ��� ©�� ¤�z ©#ª

5. 4 L�M3\3] w v STP w�£�¤�{ ��� ¥�( w Ø�F x&*3� ª 2. 1 k�� ��§�¨�*) � v STP u LSP � USP � ��� ©�p v ��� w � L � p ö� ¦"N�O;� {�¤ © } � x �P � ©�ª

ù16 LSP/USP Q!"#�$"% - F &

÷ 16 w LSP/USP x � r {d´ ¶ � �D� �«� ;�= x Gd� ª , �

� 3 � u B�� � {d¤ ©d� }d} v LSP �0� � USP } ¶dp �D��Ý�Þp�� ¤ � ¤ z ©µª � }�, ¥ �d� { v USP p “Title” ± “Header”

��� v 2"4�5�} 6 � w ¦ ¤ ��� x�´�¶ � {�¤ © } w�9 � v LSP

p 2"4�5 x 6�s � w�´�¶#� © } p ����ª������� © ��� � � ´¶ � {�¤ © � ��p�� z � s�©�ª� u $ }J2 4�5 p v LSP ´ ¶ �<¯ (2. 2 k n o ) } ± v �æ}´¿¶ �<� w�� � s { ¤ © } � x�-�� � ©�ª `�a u LSP ´¿¶ �¯ x >���} 5 £�w � # ����ª

• DIV � � l&� � p� � � 1�±�w� © ��#• TBL � � l�� � p ç�� � � ±�w� © ��#• DLI � � l&� � p�O�P �MC 2M±�w� © ��#• LST � � l&� � p �MC 2M±�w� © ��#• PGF � � l&� � p ©�±�w� © ��#

ù17 LSP ��� !"#"$"%

÷ 17( ) w v LSP } � ´¿¶ �<¯ } �¿x % ¤ � �<# }<���<;= xFG � ª “LSP” u 5 £ } ´ ¶ �A¯ x ³ {A@�+ � �A# }A���;m= �q © (DIV � PGF

� � } �mõ ! x ³ {�� � #AB � ©��LSP } ��õ ! w&º ��¦ � © ) ª � } ÷ x&� ©�� v LSP ´�¶ ��¯}�� ¤ p ���<Ý�Þ w O z © ;�< ��� � v DIV ± DLI w!� � s© � ��w v ��� w % ¤ © ��õ ò�}� ¤ p ����Ý�Þ w O z © ;�<} ¶�p ö��#¤ � ��p � ��©�ª$ � � ` Þ u v LSP } � ´�¶ ��¯ w USP x�y�z © � � � ¼� ��õ ! x � � v $ }�����Ý�Þ x"-� � {�� ©�ª $ }�;�= p÷ 17(� ) �í ©�ª DLI p��<� � p �Â���<Ý<Þ x � ��{ ¤ © �} } v ó H ª¿w LSP ´¿¶ �<¯ } ± �ä} �<¯ � v ���<Ý<Þ x �¦ � © � � u��+µ{ ��� ;�< x�= >#� � � u � ¤ � �!p � � ©#ªH k } ¥�( Ø�F x );� {�n � s�� +���u >���} ) � � ©�ª• LSP � USP x � r {�´�¶ � v $ s�°�s }�����Ý�Þ x&�

� ��� ��# v USP } ¶�p�� ¤ Ý�Þ � ��� p ��� ©#ª• LSP ´ ¶ ��¯ } � � � ;�<�x�=3> � {�¤ ©�� } u � ¤ ª� s �æ} � �D� � v Web A�²"F��� w�� ¤ { ¬ ��� w % ¤

© ��õ x#´�¶�� © ��# v 2"4�5 p�6 ��� ¤ � ��w�´�¶�� © � �� ����� ©�p v � ©#¨�¦ ��õ ò x ��Ú w&��£ � ��� ������© · � ¤ � � ��p ¤�z © } ��u � ¤ ��ª ��� ¥ � ¤�-� p� �� u� ©#p v.� } � � u “SVM x % ¤�{ 2

� w ��� � © ” � ¤� Web ���"&����ð��}�ð�ñ�� � ¤�z © �3E � ª

6. �������H�� �¿u v � l!� �íå!æ<ç¿~ C 2äx % ¤ � Web <�í"!�����¯ x �� � v � l�� ��å&æ�ç�~ C 2 } O�P;� $ } ´�¶ ��¯ x§�¨���ª � l�� �å�æ�ç�~ C 2!w�u LSP � USP } 2 � p � v $ s�°�s�� '�(�) w�ó"!��#´ ¶ ¶ ¯ x ® O�� {�¤ ©#ª � � v� l�1&o ���"�} O�P ¶ ¯ � ���" � � }�� l&� ������}�ð O¶ ¯ w#£�¤�{ §�¨ v ´�¶ ��� � l&� ��å&æ�ç�~ C 2 � �Mð�ñ�òx������ ©&¶ ¯ x&G ����ªC�E�Ø�F �í� v Fix

�<¯ u ) ¤�� ç�� � w���� © <��"�}��� p�* � �� © } w&9 � v �� ��¯ u K�O;��� ��� p ��� © ���p � � + ��ª �� ��¯ } LSP/USP x � r { ��� w % ¤ � �# v USP } ¶ p ���<Ý<Þ p�� ¤ � �äp Gí8 s � p v LSP p �� w � ;�<�x O z�{�¤ ©�# � u � ¤ ª` K u v 5. 4 k � ��§ ¨ � p “

�Aõ ò � ���<ÝAÞ }íå�$ ” x��� ¥�(�w ( ¨ {�� ©�� � p ©�ª � � 5. 2 k�� §�¨�� “���Þ »<» } �íÚ } ��� ” � “ ¿l 2:� ¾!� � C }<?�� �<õ w�$ ©��} ��õ ! ” x�% % 8 � � ��#¿w v ���<Ý�Þ p ��}í� ��w�% %� © } � x"-� ��� ¤ ª'& (

[1] Eric J.Glover, Kostas Tsioutsiouliklis, Steve Lawrence,

David M.Pennock, Gary W.Flake: Using Web Structure for

Classifying and Describing Web Pages) WWW2002) pages

562-569, Hawaii, USA, 2002.

[2] Avrim Blum, Tom Mitchell: Combining Labeled and Unla-

beld Data with Co-Training ) COLT98, pages 92-100, Madi-

son, USA, 1998.

[3] Soumen Chakrabarti, Byron Dom, Piotr Indyk: Enhanced

hypertext categorization using hyperlinks ) SIGMOD’98,

pages 307-318, Seattle, USA, 1998.

[4] Johannes Furnkranz: Exploiting Structural Information for

Text Classification on the WWW) IDA’99, pages 487-497,

Berlin, Germany, 1999.

[5] World Wide Web Consortium: Document Object Model

(DOM) Level 1 Specification Version 1.0, 1998

http://www.w3.org/TR/REC-DOM-Level-1/

[6] Bui Quang Hung) Masanori Otsubo) Yoshinori Hijikata )Shogo Nishida: Extraction of Semantic Text Portion Re-

lated to Anchor Link) IEICE Transactions on Information

and Systems, pages 1834-1847, VOL.E89D, NO.6 JUNE

2006.

[7] Netscape: dmoz - Open Directory Project (ODP))http://dmoz.org/

[8] Google Advanced Search)http://www.google.com/advanced_search

[9] *+-,�.�/ : Z�0&[-1324"�M[�5�6798�:�)http://www.neurosci.aist.go.jp/

˜kurita/lecture/svm/svm.html

[10] ;+=<�> : Support Vector Machine)http://www.bi.a.u-tokyo.ac.jp/˜tak/svm.html

[11] Google SOAP Search API (Beta))http://code.google.com/apis/soapsearch/

[12] Yahoo! Directory) http://dir.yahoo.com/

[13] Andy Clark: CyberNeko HTML Parser )http://people.apache.org/˜andyc/neko/doc/html/

[14] The Apache Software Foundation: Xerces2 Java Parser )http://xerces.apache.org/xerces2-j/

[15] Chih-Chung Chang, Chih-Jen Lin:

LIBSVM – A Library for Support Vector Machines)http://www.csie.ntu.edu.tw/˜cjlin/libsvm/