flink: semantic web technology for the extraction and ... · keywords: semantic web; social...

13
Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223 Flink: Semantic Web technology for the extraction and analysis of social networks Peter Mika Department of Computer Science, Vrije Universiteit Amsterdam (VUA), De Boelelaan 1081, 1081HV Amsterdam, The Netherlands Received 24 May 2005; accepted 25 May 2005 Abstract We present the Flink system for the extraction, aggregation and visualization of online social networks. Flink employs semantic technology for reasoning with personal information extracted from a number of electronic information sources including web pages, emails, publication archives and FOAF profiles. The acquired knowledge is used for the purposes of social network analysis and for generating a web-based presentation of the community. We demonstrate our novel method to social science based on electronic data using the example of the Semantic Web research community. © 2005 Elsevier B.V. All rights reserved. Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish and gather personal in- formation (such as the interests, works and opinions of our friends and colleagues) has been a major factor in the success of the web from the beginning. Remark- ably, it was only in the year 2003 that the web has become an active space of socialization for the major- ity of users. That year has seen the rapid emergence of a new breed of web sites, collectively referred to as social networking services (SNS). The first-mover Friendster 1 attracted over 5 million registered users in Tel.: +31 20 5987753; fax: +31 20 5987653. E-mail address: [email protected] URL: http://www.cs.vu.nl/pmika 1 http://www.friendster.com the span of a few months [13], which was followed by Google and Microsoft starting or announcing similar services. Although these sites feature much of the same con- tent that appear on personal web pages, they provide a central point of access and bring structure in the process of personal information sharing and online socializa- tion. Following registration, these sites allow users to post a profile with basic information, to invite others to register and to link to the profiles of their friends. The system also makes it possible to visualize and browse the resulting network in order to discover friends in common, friends thought to be lost or potential new friendships based on shared interests. (Thematic sites cater to more specific goals, such as establishing a business contact or finding a romantic relationship.) 1570-8268/$ – see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.websem.2005.05.006

Upload: others

Post on 08-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

Web Semantics: Science, Services and Agentson the World Wide Web 3 (2005) 211–223

Flink: Semantic Web technology for the extraction andanalysis of social networks

Peter Mika∗

Department of Computer Science, Vrije Universiteit Amsterdam (VUA), De Boelelaan 1081, 1081HV Amsterdam, The Netherlands

Received 24 May 2005; accepted 25 May 2005

Abstract

We present the Flink system for the extraction, aggregation and visualization of online social networks. Flink employs semantictechnology for reasoning with personal information extracted from a number of electronic information sources including webpages, emails, publication archives and FOAF profiles. The acquired knowledge is used for the purposes of social networkanalysis and for generating a web-based presentation of the community. We demonstrate our novel method to social sciencebased on electronic data using the example of the Semantic Web research community.© 2005 Elsevier B.V. All rights reserved.

Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology

1

fotabioaF

U

yilar

con-ide acessiza-s totohewsein

ewsitesng aip.)

1d

. Introduction

The possibility to publish and gather personal in-ormation (such as the interests, works and opinions ofur friends and colleagues) has been a major factor in

he success of the web from the beginning. Remark-bly, it was only in the year 2003 that the web hasecome an active space of socialization for the major-

ty of users. That year has seen the rapid emergencef a new breed of web sites, collectively referred tos social networking services (SNS). The first-moverriendster1 attracted over 5 million registered users in

∗ Tel.: +31 20 5987753; fax: +31 20 5987653.E-mail address: [email protected]

RL: http://www.cs.vu.nl/pmika1 http://www.friendster.com

the span of a few months[13], which was followed bGoogle and Microsoft starting or announcing simservices.

Although these sites feature much of the sametent that appear on personal web pages, they provcentral point of access and bring structure in the proof personal information sharing and online socialtion. Following registration, these sites allow userpost a profile with basic information, to invite othersregister and to link to the profiles of their friends. Tsystem also makes it possible to visualize and brothe resulting network in order to discover friendscommon, friends thought to be lost or potential nfriendships based on shared interests. (Thematiccater to more specific goals, such as establishibusiness contact or finding a romantic relationsh

570-8268/$ – see front matter © 2005 Elsevier B.V. All rights reserved.oi:10.1016/j.websem.2005.05.006

Page 2: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

212 P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223

The latest breed of social networking services com-bine social networks with the sharing of content suchas bookmarks, documents, photos, reviews. The idea ofnetwork-based knowledge sharing is based on the so-ciological theory that social interaction creates similar-ity and vice versa, interaction creates similarity (friendsare likely to have acquired or develop similar interests).Lately, the notion of ratings and social networks-basedtrust are also investigated as a filtering mechanism inloosely controlled environments.

Despite their early popularity, users have later dis-covered a number of drawbacks to centralized socialnetworking services. First, the information is under thecontrol of the database owner who has an interest inkeeping the information bound to the site. The profilesstored in these systems cannot be exported in machineprocessable formats, and therefore the data cannot betransferred from one system to the next. (As a result,the data needs to be maintained separately at differ-ent services.) Second, centralized systems do not allowusers to control the information they provide on theirown terms. Although Friendster follow-ups offer sev-eral levels of sharing (e.g. public information versusonly for friends), users often still find out the hard waythat their information was used in ways that were notintended.

These problems have been addressed with the useof Semantic Web technology. The friend-of-a-friend(FOAF) project2 is a first attempt at a formal, machineprocessable representation of user profiles and friend-s tes,F ndi-vl an-c byu edF

de-s pre-s iesw ist hat iti ll).

te oft INKt

These properties make FOAF the ideal basis for thesemantic integration of personal information extractedfrom heterogeneous knowledge sources.4

Flink, our system to be introduced is the first toour knowledge that exploits FOAF for the purposesof social intelligence. By social intelligence, we meanthe semantics-based integration and analysis of socialknowledge extracted from electronic sources under di-verse ownership or control. In our case, these sourcesare largely the natural byproducts of the daily work of acommunity: HTML pages on the web about people andevents, emails and publications. From these sources,Flink extracts knowledge about the social networks ofthe community and consolidates what is learned usinga common semantic representation, namely the FOAFontology.

The raison d’etre of Flink can be summarized inthree points. First, Flink is a demonstration of the lat-est Semantic Web technology (and as such a recipi-ent of the Semantic Web Challenge Award of 2004).In this respect, Flink is interesting to all those who areplanning to develop systems using Semantic Web tech-nology for similar or different purposes. Second, Flinkis intended as a portal for anyone who is interested tolearn about the work of the Semantic Web community,as represented by the profiles, emails, publications andstatistics. Hopefully Flink will also contribute to boot-strapping the nascent FOAF-web by allowing the ex-port of the knowledge in FOAF format. This can betaken by the researchers as a starting point in setting

talataet-

urees.ctsm

ussg itsork

per.

ain-pre-ent

hip networks. Unlike with Friendster and similar siOAF profiles are created and controlled by the iidual user and shared in a distributed fashion.3 Muchike the way web pages are linked to each other byhors, these profiles link to the profiles of friendssing therdfs:seeAlso relation, creating the so-callOAF-web.

The alert reader may note that for the purposescribed above, namely providing a structured reentation of user profiles, the use of XML technologould have sufficed. In fact, the real value of FOAF

hat it represents an agreement on key terms and ts described in a semantic format (namely, OWL fu

2 http://www.foaf-project.org.3 FOAF profiles are typically posted on the personal websi

he user and linked from the user’s homepage with the HTML Lag.

up their own profiles, thereby contributing to the poras well. Lastly, but perhaps most importantly, the dcollected by Flink is used for the purposes of social nwork analysis, in particular learning about the natof power and innovativeness in scientific communiti

In this paper the focus is on the first two aspeof Flink. We begin with the introduction of the systefrom a user perspective in Section2. In Section3, wedescribe the architecture of Flink in detail and discthe lessons that have been learned while developincomponents. We briefly introduce the idea of netwanalysis using Flink in Section4. Related and futurework are discussed in the last two sections of this pa

4 While FOAF carries a necessary level of commitment, the mtainers of ontology are also careful not to overly restrict the intertation of the ontology in order to keep its wide appeal to differcommunities and usage scenarios.

Page 3: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223 213

Fig. 1. Semantic Web researchers and their connections across theglobe.

2. Flink: a who is who of the Semantic Web

Flink is a presentation of the professional work andsocial connectivity of Semantic Web researchers. Forour purposes, we have defined this community as thoseresearchers who have submitted publications or held anorganizing role at any of the past International Seman-tic Web Conferences (ISWC02, ISWC03, ISWC04) orthe Semantic Web Working Symposium (SWWS01).5

This means a community of 608 researchers from bothacademia and industry, covering much of the UnitedStates, Europe and to lesser degree Japan and Australia(seeFig. 1).

Flink takes a network perspective on the Seman-tic Web community, which means that the navigationof the website is organized around the social networkof researchers. Once the user has selected a startinpoint for the navigation, the system returns a summarypage of the selected researcher, which includes profilinformation as well as links to other researchers thathe given person might know. The immediate neigh-bourhood of the social network (the ego-network ofthe researcher) is also visualized in a graphical form(seeFig. 2).

The profile information and the social network isbased on the analysis of webpages, emails, publication

5 An common alternative way of defining the boundary of scientificcommunities is to look at the authorship of representative journals( rnalo ppeai

and self-created profiles. (See the following section forthe technical details.) The displayed profile informa-tion includes the name, email, homepage, image, af-filiation and geographic location of the researcher, aswell as his interests, participation at Semantic Web re-lated conferences, emails sent to public mailing listsand publications written on the topic of the SemanticWeb. The full text of emails and publications can beaccessed by following external links. At the time ofwriting,6 the system contained information about 5147publications authored by members of the communityand 8185 messages sent via five Semantic Web-relatedmailing lists.

The navigation from a profile can also proceed byclicking on the names of co-authors, addressees or oth-ers listed as known by this researcher. In this case, aseparate page shows a summary of the relationship be-tween the two researchers, in particular the evidencethat the system has collected about the existence of thisrelationship. This includes the weight of the link, thephysical distance, friends, interests and depictions incommon as well as emails sent between the researchersand publications written together.

The information about the interests of researchers isalso used to generate an ontology of the Semantic Webcommunity. The concepts of this ontology are researchtopics, while the associations between the topics arebased on the number of researchers who have an in-terest in the given pair of topics (seeFig. 3). An inter-esting feature of this ontology is that the associations

ersans

ol-m-thesges

achen

a-ssnce

e-of

see e.g.[10]). However, the Semantic Web has a dedicated jou

nly since 2004 and many Semantic Web related publications an journals not entirely devoted to the Semantic Web.

g

et

s

r

created are specific to the community of researchwhose names are used in the experiment. This methat unlike similar lightweight ontologies created froma statistical analysis of generic web content, this ontogy reflects the specific conceptualizations of the comunity that was used in the extraction process (seefollowing section). Also, the ontology naturally evolveas the relationships between research topics chan(e.g. as certain fields of research move closer to eother). For a further discussion on the relation betwesociability and semantics, we refer the reader to[17].

The visitor of the website can also view some bsic statistics of the social network. Degree, closeneand betweenness are common measures of importaor influence in social network analysis, while the dgree distribution attests to a general characteristic

6 May 14, 2005.

Page 4: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

214 P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223

Fig. 2. The social network of a researcher.

the network itself (seeFig. 4). Geographic visualiza-tions of the Semantic Web offer another overview of thenetwork by showing the places where researchers arelocated and the connections between them (seeFig. 1).

3. System design

Similarly to the design of most Semantic Web ap-plications, the architecture of Flink can be divided inthree layers concerned with metadata acquisition, stor-age and visualization, respectively.Fig. 5 shows anoverview of the system architecture with the three lay-

ers arranged from top to bottom. In the following, wedescribe the layers in the same order.

3.1. Acquisition

This layer of the system concerns the acquisition ofmetadata. Flink uses four different types of knowledgesources: HTML pages from the web, FOAF profilesfrom the Semantic Web, public collections of emailsand bibliographic data. Information from the differ-ent sources is collected in different ways but all theknowledge that is learned is represented according tothe same ontology (see the following section). This on-

Page 5: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223 215

Fig. 3. The ontology of research topics.

tology includes FOAF and minimal extensions requiredto represent additional information.

The web mining component of Flink employs aco-occurrence analysis technique first applied to so-cial network extraction in the work of Kautz et al.[14]. Given a set of names as input, this componentof the system uses the search engine Google to obtainhit counts for the individual names as well as the co-occurrences. (The term “Semantic Web OR ontology”is added to the query for disambiguation.) The strengthof association between individuals is then calculatedby normalizing separately with the page counts of theindividuals. The resulting value is a non-negative realnumber from a power-law distribution. We considerthis value as evidence of a directed tie if it reaches acertain predefined threshold and the hit counts for theindividuals are also above a certain minimum, in orderto ensure that the support for the co-occurrence is highenough.

The web mining component also performs the ad-ditional task of finding topic interests, i.e. associatingresearchers with certain areas of research. Given a setof names and a list of interests (or any other kind ofdomain concept), the system calculates the so-called

Google Mindshare for each researcher to determinewhether a given person is associated with a certain in-terest or not. The Google Mindshare of a person withrespect to an interest is simply the number of the pageswhere the names of the interest and the person co-occurdivided by the total number of pages about the person.Note that we do not factor in the page count of the inter-ests, since we are only interested in the expertise of theindividual relative to himself.7 The resulting measureis again a zero or positive real term with a power-lawdistribution. We assign the expertise to an individual ifthe logarithm of this value is at least one standard de-viation higher than the mean of the logarithmic values.(Note that we are following here are a ’rule of thumb’in network analysis practice.)

FOAF profiles are gathered from the Semantic Webin two steps. First, an RDF crawler (a so-called scut-ter) is started to collect profiles from the FOAF-web.

7 By normalizing with the hit count of the interests, the measurewould assign a relatively high score—and an overly large numberof interests—to individuals with many pages on the web. Since wenormalize only with the page count of the person involved, we cannotcompare the association strength across interests. However, this is notnecessary for our purposes.

Page 6: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

216 P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223

Fig. 4. The architecture of Flink from metadata acquisition (top) to the user interface (bottom).

A scutter works similar to an HTML crawler in that ittraverses a distributed network by following the links(rdfs:seeAlso properties) from one document to thenext. Our scutter is focused in that it only collects po-tentially relevant statements, i.e. those triples wherethe predicate is in the RDF, RDF-S, FOAF or WGS-84namespace. The scutter also has a mechanism to avoidlarge FOAF producers that are unlikely to provide rel-evant data, in particular blog sites. (The overwhelmingpresence of these sites also make FOAF characteriza-tion difficult, see[11].) The scutter also discards docu-ments that are simply too large, and therefore unlikelyto contain a personal profile. These restrictions are nec-essary to limit the amount of data collected, which caneasily reach millions of triples after running the scutterfor only an hour. In a second step, the FOAF individualsfound in the collection are matched against the profilesof the members of the target community to filter out rel-evant profiles from the collection. (See the followingsection.)

Information from emails is also processed in twosteps. In this case, the first step requires that theemails are downloaded from a POP3 or IMAP store

and the relevant header information is captured in anRDF format, where FOAF is used for representing in-formation about senders and receivers of emails, inparticular their name (as appears in the header) andemail address. The second step is then the same asabove.

Lastly, bibliographic information is collected in asingle step by querying Google Scholar with the namesof individuals (plus the disambiguation term). From theresults, we learn the title and locations of publicationsas well as the year of publication and the number ofcitations where available.8 This knowledge is repre-sented in the SWRC ontology format (except for cita-tion counts, which cannot be expressed). An alternativesource of bibliographic information (used in previousversions of the system) is the Bibster peer-to-peer net-work[9], from which metadata can be exported directlyin the SWRC ontology format.

8 Note that it is not possible to find co-authors using GoogleScholar, since it suppresses the full list of authors in cases wherethe list would be too long. Fortunately, this is not necessary when thelist of authors is known in advance.

Page 7: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223 217

Fig. 5. Simple network statistics such as the degree distribution of the network (shown on the bar chart) and the most common importancemeasures (shown in table) are available through the web. Other statistics can be computed by exporting the data to network analysis packagessuch as Pajek or UCINET.

3.2. Representation, inference and storage

This is the middle layer of our system with the pri-mary role of storing and enhancing metadata throughreasoning.

The network ties, the interest associations and othermetadata are represented in RDF using terms from theFOAF vocabulary such asfoaf:knows for relationshipsandfoaf:topic interest for research interests. (FOAF isthe native format of profiles collected from the Seman-tic Web.) A reification-based extension of the FOAFmodel is necessary to represent association weights.(For a more detailed treatment of current issues in so-cial ontology, we refer the reader to[19]).

Extensions to the FOAF model are also necessaryto record the context of the statements collected.9

Currently, this is also expressed using the RDF reifica-tion mechanism, which significantly adds to the amountof data that needs to be handled. We hope that in the fu-ture our storage facility will provide native features forcontext support, which would improve the efficiencyof storing and querying such information. This supportwould be also necessary to implement efficient updatesof the information.

9 Context in our system consists of the source of a statement andthe time it was collected.

Page 8: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

218 P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223

The aggregated collection of RDF data is storedin a Sesame server. (For more information about theSesame RDF storage and query facility, we refer to[4].)Note that since the model is a compatible extension ofFOAF, the knowledge can be further processed fromthis point by any FOAF-compatible tool. An exampleof that is a generic component we incorporated for find-ing the geographical locations (latitude and longitudecoordinates) of place names found in the FOAF pro-files. This component invokes the ESRI Place FinderSample Web Service, which provides geographic loca-tions of over 3 million place names worldwide.10 WebService invocation is facilitated by the Apache WebService Invocation Framework, which uses the WSDLprofile of a web service to generate the code requiredto interact with the service.

Besides storage, inference is another major task ofthe middle layer. Sesame applies the RDF closure rulesto the data at upload time. This feature can be ex-tended by defining domain-specific inference rules inSesame’s custom rule language. (Note that barring astandard rule language for the Semantic Web, this re-mains as a practical alternative.) We use this facilityto express mappings and metaknowledge, for examplethat co-authors of publications and senders/receivers ofemails know each other in the FOAF sense.

Flink also makes use of the rule language for carry-ing out identity reasoning, otherwise known as smush-ing. Identity reasoning is required to determine theidentity of instances (in this case individuals) acrossm sh-i jecti per-t ax-ic agea findt es eA inc ings( ver.W rdeu

thes e

has no built-in support for OWL equivalence, we ax-iomatize theowl:sameAs property using the rule lan-guage as well. The rule-based expansion of equivalencehas the disadvantage that it requires the storage of thesame information about all the equivalent instances. Inprinciple, the repository could be ’cleaned’ by remov-ing all but one of the equivalent instances. However,the size of the repository is still moderate (also dueto the filtering of irrelevant person instances) and theremoval of statements would likely require significantadditional processing.

From a scalability perspective, we are glad to notethat the Sesame server offers very high performance instoring data on the scale of millions of triples, espe-cially using native repositories. (Native storage refersto a file-system-based back-end as opposed to repos-itories built on top of relational databases.) Speed ofupload is particularly important for the RDF crawler,which itself has a very high throughput. Unfortunately,the speed of upload drops significantly when customrules need to be evaluated.

While the speed of uploads is important to keep upwith other components that are producing data, the timerequired for resolving queries determines the respon-siveness of the user interface. At the moment query op-timization is still a significant challenge for the server.In many cases, the developer himself can improve theperformance of a query by rewriting it manually, e.g.by reordering the terms or breaking the query in two.The trade-off between executing many small queries

theisntoeher

p-

aon-cers

ultiple information sources. The methods for smung in Flink are based on name matching and obdentification based on the inverse-functional proies (IFPs) of FOAF. IFP-based matching is directlyomatized in the rule language. IFPs of thefoaf:Personlass include mailbox, mailbox checksum, homepnd several other properties. For example, if we

hat instances A and B of thefoaf:Person class have thame value for thefoaf:mbox property, we can conclud

owl:sameAs B. Name matching is implementedode and is based on the similarity of names as strDifferences in the last names are disallowed, howe

hen matches are found, the match is again recosing theowl:sameAs property.

The merging of profile information is based onemantics of theowl:sameAs relation. Since Sesam

10 http://www.esri.com/software/arcwebservices/.

.)d

versus executing a single large query also requirescareful judgement of the developer. The trade-offin terms of memory footprint versus communicatiooverhead: small, targeted queries are inefficient duethe communication and parsing involved, while largqueries produce large result sets that need to be furtprocessed on the client side.

3.3. Browsing and visualization

The user interface of Flink is a pure Java web aplication based on the Model-View-Controller (MVC)paradigm. The key idea behind the MVC pattern isseparation of concerns among the components respsible for the data (the model), the application logi(controller) and the web interface (view). The ApachStruts Framework used by Flink helps programmein writing web applications that respect the MVC

Page 9: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223 219

pattern by providing abstract application componentsand logic for the pattern. The role of the programmeris to extend this skeletal application with domain andtask specific objects.

The model objects of Flink use the graph modelof the JUNG programming toolkit. JUNG11 is a Javalibrary (API) that provides an object-oriented represen-tation of networks as well as implementations of impor-tant measures and algorithms used in (social) networkanalysis. The model objects loosely map the underly-ing ontology and retrieve data dynamically from theRDF store as needed for the presentation.12 The net-work itself and the most commonly accessed objectsare cached to improve performance.

In the view layer, servlets, Java Server Pages (JSP)and the Java Standard Tag Library (JSTL) are used togenerate a front-end that hides much of the code fromthe designer of the front-end. This means that the designof the web interface may be easily changed withoutaffecting the application and vice versa. In the currentinterface, Java applets are also used on parts of the siteto allow the user to interact with the visualization.

We consider the flexibility of the interface impor-tant because there many possibilities to present socialnetworks to the user and the best way of presentationmay depend on the size of the community as well asother factors. The possibilities range from “text only”profiles (such as in the SNS Orkut13) to fully graphi-cal browsing based on network visualization (as in the

gn-rs

forndin

sisi-

l-

un-the

izes geographic coordinates and geodesics by mappingthem onto surface images of the Earth (seeFig. 1).

4. Social network analysis

The information extraction in Flink is not only thebasis of the web application described above, but alsoprovides the data for a sociological study about the roleof networks in scientific innovation.

Social network analysis[20,22] is a specializationof the study of networks[1] and it has been appliedto a variety of social settings including networks ofentrepreneurs, terrorist networks, health (sexual) net-works, networks of innovation, etc. Network analysisprovides the necessary techniques to prove hypothe-sis (theories) that link network participation to effectson substantial outcomes such as the performance of anindividual or groups of individuals.

A key idea in the structural approach to social sci-ence is that the way an actor (an individual or a group)is embedded in a network offers opportunities and im-poses constraints on the actor. Occupying a favored po-sition or having preferred kinds of personal connectionsmeans that the actor will have better access to valuableinformation, resources, social support, etc. and will beexceedingly thought after for such opportunities by ac-tors in less favorable positions. In short, social networkparticipation (social capital) might explain a signifi-

n

t.

e

f

k

-s-

FOAFnaut14 browser). The uniqueness of presentinsocial networks is also the primary reason that we canot benefit from using Semantic Web portal generatosuch as HayStack[5], which are primarily targeted forbrowsing more traditional object collections.

The user interface also provides mechanismsexporting the data. For more advanced analysis avisualization options, the data can be downloadedthe format used by Pajek, a popular network analypackage[3]. Users can also download profiles for indviduals in RDF/XML (FOAF) format. Lastly, we pro-vide marker files for XPlanet, an application that visua

11 http://jung.sourceforge.net.12 The danger of a close mapping between the ontology and the r

time model is that the application needs to be rewritten wheneverunderlying ontology changes.13 http://www.orkut.com.14 http://www.foafnaut.org/.

cant proportion of the differences in performance whelooking at different, but comparable actors.

With our study of the Semantic Web communityour goal is to verify and extend existing theories tharelate network participation to innovation in scienceIn context of the related work (see also Section5), ourmethods offer a unique opportunity in terms of the sizeof the network, the amount of data available and thpossibility to observe the dynamics of the network.

A couple of notes are in order about the quality othe data that we obtain, especially in light of using thisdata for the purposes of social network analysis:

• Interpretation of the networksOne might have noted already that the networ

obtained from mining the web is a multiplex networkon its own, possibly reflecting the co-authorship network, the discussion networks obtained from emailor some other relationship. A closer look at the re

Page 10: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

220 P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223

sults for a single person (Frank van Harmelen) showsthat 44 of the first 100 results returned (from a to-tal of about 10,000) relate to publications and 9 toemails. (Note that the same publication may be ref-erenced in different web pages.) Nevertheless, thisnetwork may complement the other networks fordifferent types of relationships (such as informal re-lationships) and data missing from the other sources(e.g. we may not be aware of all mailing lists relatedto the Semantic Web).

• Errors in the extraction of specific casesThe network is also bound to contain errors due

to the method of collection. The search for co-occurrence is carried out on the syntactic level andshows the typical drawbacks of internet search. Forexample, it is possible that some of the returnedpages are about a different person than the one in-tended by the query. Ambiguity particularly effectspeople with common names, e.g. Martin Frank. Thisdanger is mitigated by including the disambiguationterm in the query.

Queries for researchers who commonly use dif-ferent variations of their name (e.g. Jim Hendler ver-sus James Hendler) or whose names contain interna-tional characters (e.g. Jerome Euzenat) may returnonly a partial set of all relevant documents knownto the search engine.15 Name ambiguity also ef-fects Google Scholar. For example, the person “YorkSure” is identified as a co-author of publications thatare published in New York.

nal-ues-ose

rep-ac-ll.

•e-

ple,d notnsees intednce

ectt c Webt

can alter the measured association values depend-ing on the time of the query. Such noise in the data,however, will not skew social network statistics aslong as it is distributed in an independent manner.

Despite the above difficulties in data collection, weare confident that the quality of the data will allow us touse it for the purposes of network analysis. To verify ourmethod, we also plan to execute a separate study, wherewe compare the results from a traditional questionnairemethod to the acquisition methods described here.

The results of our study of the Semantic Web com-munity may be of interest to both this community andthe area of research policy in general, therefore we planto report on this work in future publications.

5. Related work

Due to the interdisciplinary nature of the work,namely a technological innovation supporting a socialscience study, the related work is far and wide.

Semantic Web research has produced a number ofdemonstrations in the area of semantics-based knowl-edge management, in particular semantic portals forbrowsing large collections of documents or other ob-jects. Ontology-based knowledge management was thefocus of the European On-To-Knowledge project[18]and the more recent SEKT project.16 The specific areaof ontology-based portals has been the subject (amongo ator[ tackf nda thatt andt cialc nglyi hem-s aratep areu texti

et-w ally( tingc ns

With respect to our use case, the situation is aogous to obtaining incorrect data on a network qtionnaire for a part of the respondents, namely thwith problematic names. However, this does notresent a problem in computing statistics if the frtion of the cases effected this way remains smaGeneral noise

Information extraction will not only effect spcific cases, but create a general noise. For exama co-occurrence of names on a web page neeindicate any social relation in the sociological seand may be in fact a pure coincidence (e.g. nama phone directory). Reliability may also be effecby Google itself: the phenomenon of Google Da

15 Worthwhile to note that the ambiguity of queries with respo the content is precisely the problem addressed by Semantiechnology, in particular FOAF for finding people.

thers) of the early work on the SEAL portal gener16] and the more recent development of the Haysramework[5]. Flink shares a technological basis architecture with these projects, with the difference

he “collection” to be presented is a set of personshe links between them are provided by their soonnectivity. The focus on these connections stronfluences the presentation. For example, the ties telves are presented as individual objects on sepages. Also, network visualizations (sociograms)sed to orient the user and to provide relevant con

nformation.In traditional works of scientometrics, scientific n

orks are investigated by collecting data manuthrough interviews or questionnaires), by investigao-authoring and co-citation in scientific publicatio

16 http://www.sekt-project.com.

Page 11: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223 221

[6,2] using commercially available databases or bylooking for other kinds of evidence of co-participationin research activities, such as public information aboutproject grants[10,8].

Our approach to data collection is part of the morerecent trend of applying methods of Computer Sci-ence to mining networks from electronic data. As thesemethods are advanced by computer scientists with aninterest in networks, the focus of this literature is clearlyon the methods of extraction or analysis rather than thesocial theory. Emails are the source of social networksin [7,21], while other projects extract networks fromweb pages with methods similar to ours[14,12] or—somewhat less successfully—by analyzing the linkingstructure of the web[10]. As first to publish such astudy, Paolillo and Wright offer a rough characteriza-tion of the FOAF-web in[11].

With our interdisciplinary approach, we hope tocontribute both to the methods of network analysis andto the theory of research and innovation. We build ourwork on the possibilities offered by Semantic Web tech-nology in the collection of data, in particular, the aggre-gation of information from heterogeneous sources. Wecomplement this with the methodology of social net-work analysis to learn new insights about the role of thenetworks in the work of the community, thereby ben-efitting both network theory and the community underinvestigation.

Lastly, it is important to note with respect to Flinkthat the system is applicable to a broader range of com-

tt-lds

e-

fic

e-

ce

dh

them more than with each other, making the world lessof a social space. Paradoxically, it seems that nothingcould be less true: in the end we shaped our informa-tion systems to our form and made them the carriersof efficient forms of communication (from emails toblogs), which allowed us to move much of our sociallife in the electronic domain.

Our social connectivity might have even increasedin importance in the last years simply by the virtueof the information overload we are facing. Browsingthe web has become almost futile: the likelihood offinding valuable information by simply following linksfrom page to page has dropped considerably due to thesheer size of the web. Picking up the valuable piecesof information from the mailings lists or blogs that wepretend to follow would require reading them all. Thatis impossible, unless someone has informed us beforeabout the relevance of an item.

Our social connections not only direct our searchin infospace by alerting us to relevant information, butalso help to weigh in the authority of the information.When forming a “first impression”, the content of awebpage is almost secondary as to how we got there.Was it an email from someone we consider an expert?Was it a link from a website we came to trust? (Infact, this is the thinking behind Google’s PageRankalgorithm: a webpage is only as authoritative as theones referring to it.)

If we only had a way to program the underlyingreasoning into our machines and provide them with

lp-teln

ot-oreonstr

r

ft-ionwain

munities than the one that is featured in the currenapplication. The few comparative studies in webomerics (web-based scientometrics) suggest that real-wornetworks of largely academic research communitie(such as the Semantic Web community) are closely rflected on the web[10,15]. This suggest that our systemcould be used to generate presentations of scienticommunities in different areas, potentially on muchlarger scales. With different sources of data, the framwork could also be used to visualize communities inareas other than science, e.g. communities of practiin a corporate setting.

6. Conclusions and future work

With the spread of the first computers we believethat as machines replace humans we will interact wit

the necessary background information, they could heus much further in distinguishing relevant from irrelevant, trustworthy from corrupt. However, as with mosof the content in the electronic domain, almost nonof the existing electronic information about our sociaconnections is directly processable to our informatiosystems. Most of it is locked in formats that were nchiefly intended to carry this information. This information needs to be extracted and represented in mformal ways. We may also need these representatito allow the users to enter additional information nodirectly accessible from an information system. (Afteall, much of our social life still occurs outside of ousystems. . . )

Thus, the first challenge in the area of social soware is the extraction, representation and aggregatof social knowledge. In this article, we have shown hoadvanced technologies from the Semantic Web dom

Page 12: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

222 P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223

(applied information extraction, knowledge representa-tion, ontology mapping) can help in this process. Whiletechnology is important, keeping in touch with socialscience will be just as important in the future. For ex-ample, a practical question we encountered in our workconcerns the multiplexity of social relations: a rela-tionship between two individuals may have a differentsignificance to different areas of social life. (The mosttrivial example is the occasional overlap between workand private relations.) Creating asocial ontology thatwould allow to classify social relationships along sev-eral dimensions is among the future work and so is thefinding of patterns for identifying these relationshipsusing electronic data.

In terms of technology, the current bottleneck inscalability is the performance of aggregation (iden-tity reasoning) due to the lack of standard query andrule languages and efficient implementations in RDFstores. Representing context information in a standardand efficient manner will also be necessary to exchangecontext information among servers.

The extracted and aggregated information, possiblycomplemented by additional input from the users in theform of a user profile, provides the valuable data neededfor adding more intelligence to knowledge intensiveapplications, in particular improving the navigation oflarge information stores through collaborative filtering.In our work, the information is constituted by publica-tions and emails, the works and communications of thecommunity. However, networks themselves may also

e-

e

-e-

of.)edi-re

--

be the subject of much debate in the future, especiallyif these sources were originally created for a differentpurpose, and thus their integration could not have beenforeseen. Standard representations, distributed storageand privacy mechanisms should provide the answer byproviding protection over one’s own social informa-tion, but still allowing it to be exchanged with relativeease when required.

Acknowledgement

Funding for this research has been provided by theVrije Universiteit Research School for Business Infor-mation Sciences (VUBIS).

References

[1] A.L. Barabasi, Linked: The New Science of Networks, PerseusPublishing, 2002.

[2] A.L. Barabasi, H. Jeong, Z. Neda, E. Ravasz, A. Schubert, T.Vicsek, Evolution of the social network of scientific collabora-tions, Physica A 311 (3–4) (2002) 590–614.

[3] V. Batagelj, A. Mrvar, Pajek—program for large network anal-ysis, Connections 21 (2) (1998) 47–57.

[4] J. Broekstra, A. Kampman, F. van Harmelen, Sesame: An Ar-chitecture for Storing and Querying RDF and RDF Schema,Proceedings of the First International Semantic Web Con-ference (ISWC 2002), Number 2342 in Lecture Notes inComputer Science (LNCS), Springer–Verlag, 2002, pp. 54–68.

ser,fer-

tternntific

ual-tion03:for-, pp.

Trence

M.b, C.r-to-me-Webrlag,

be the focus of interest. Network analysis can benfit communities by identifying the network effects onperformance and helping to devise strategies for thindividual or for the community accordingly.

In terms of social network analysis, the use of electronic data provides a unique opportunity to observthe dynamics of community development. (This is difficult, if not impossible, with the traditional question-naire methods of data collection due to the amountwork required from both participants and researchersIn the future, as our social lives will become even moraccurately traceable through ubiquitous, mobile anwearable computers, the opportunities for social scence based on electronic data will only become moprominent.

We conclude by noting that the aggregation of social information from disparate sources without permission from the individuals involved is also likely to

[5] D. Quan, D.R. Karger, How to Make a Semantic Web BrowProceedings of the 13th International World Wide Web Conence, New York, USA, 2004, pp. 255–265.

[6] D.J. deSolla Price, Networks of scientific papers: the paof bibliographic references indicates the nature of the scieresearch front, Science 149 (3683) (1965) 510–515.

[7] P.A. Gloor, R. Laubacher, S.B.C. Dynes, Y. Zhao, Visization of communication patterns in collaborative innovanetworks—analysis of some w3c working groups, CIKM ’Proceedings of the Twelfth International Conference on Inmation and Knowledge Management, ACM Press, 200356–60.

[8] M. Grobelnik, D. Mladenic, Approaching Analysis of EU ISprojects database, Proceedings of the International Confeon Information and Intelligent Systems (IIS-2002), 2002.

[9] P. Haase, J. Broekstra, M. Ehrig, M. Menken, P. Mika,Plechawski, P. Pyszlak, B. Schnizler, R. Siebes, S. StaaTempich, Bibster—a semantics-based bibliographic peepeer system, in: S.A. McIlraith, D. Plexousakis, F. van Harlen (Eds.), Proceedings of the Third International SemanticConference (ISWC 2004), Hiroshima, Japan, Springer–VeNovember 2004, pp. 122–136.

Page 13: Flink: Semantic Web technology for the extraction and ... · Keywords: Semantic Web; Social networks; Ontology extraction; Social ontology 1. Introduction The possibility to publish

P. Mika / Web Semantics: Science, Services and Agents on the World Wide Web 3 (2005) 211–223 223

[10] G. Heimeriks, M. Hoerlesberger, P. van den Besselaar, Mappingcommunication and collaboration in heterogeneous researchnetworks, Scientometrics 58 (2) (2003) 391–413.

[11] J.C. Paolillo, E. Wright, The Challenges of FOAF Characteriza-tion, Proceedings of the First Workshop on Friend of a Friend,Social Networking and the (Semantic) Web, 2004.

[12] J. Mori, Y. Matsuo, M. Ishizuka, B. Faltings, Keyword Ex-traction from the Web for FOAF Metadata, Proceedings of theFirst Workshop on Friend of a Friend, Social Networking andthe (Semantic) Web, 2004.

[13] L. Kahney, Making Friendsters in High Places, Wired, July2003.

[14] H. Kautz, B. Selman, M. Shah, The Hidden Web, AI Magazine18 (2) (1997) 27–36.

[15] H. Kretschmer, I. Aguillo, Visibility of collaboration on theweb, Scientometrics 61 (3) (2004) 405–426.

[16] A. Maedche, S. Staab, R. Studer, Y. Sure, R. Volz, SEAL—tying up information integration and web site management

by ontologies, IEEE Data Eng. Bull. 25 (1) (2002) 10–17.

[17] P. Mika, Social networks and the semantic web: the next chal-lenge, IEEE Intell. Syst. 20 (1) (January–February 2005) 82–85.

[18] P. Mika, V. Iosif, Y. Sure, H. Akkermans, Handbook on On-tologies in Information Systems, Chapter 24: Ontology-basedContent Management in a Virtual Organization, InternationalHandbooks on Information Systems, Springer–Verlag, 2003,pp. 447–471.

[19] P. Mika, A. Gangemi, Descriptions of Social Relations., Pro-ceedings of the First Workshop on Friend of a Friend, SocialNetworking and the (Semantic) Web, 2004.

[20] J.P. Scott, Social Network Analysis: A Handbook, 2nd ed., SagePublications, 2000.

[21] L.A. Adamic, E. Adar, Social Netw. 27 (3) (2005) 187–203.[22] S. Wasserman, K. Faust, D. Iacobucci, M. Granovetter, Social

Network Analysis: Methods and Applications, Cambridge Uni-versity Press, 1994.