mining and supporting community structures in sensor network research
DESCRIPTION
TRANSCRIPT
- Mining and supporting community structures in sensor network research Alberto Pepe (University of California at Los Angeles) Marko A. Rodriguez (Los Alamos National Laboratory) CENS Friday Seminar | May 2, 2008
- Outline.
- Studying Collaboration at CENS
-
- Introduction to Data Practices
-
- Detection of Structural Communities
-
- Data Set and Methods
-
- Results
- Supporting Collaboration at CENS
-
- Introduction to the Semantic Web
-
- Semantic Networks and Graph Databases
-
- Analyzing Semantic Networks
-
- Demo
- Data practices group.
- Background research questions:
-
- What are CENS data?
-
- What context data is necessary to support interpretation during re-use?
-
- How can we automate the capture of context data?
-
- How can we link scholarly and scientific data into meaningful aggregations/chains?
-
- What are the social and academic settings that yield the production of scientific and engineering data/knowledge?
- Current study.
- Question: how do collaboration communities differ from socioacademic communities?
- Method : comparative analysis of coauthorship network community structure and selected socioacademic community structures (e.g. academic department, affiliation, country of origin, academic position)
- Steps of the study.
- Gather bibliographic and socioacademic data.
- Generate coauthorship network.
- Determine structural communities in the coauthorship network.
- Test for statistical independence between the structural and socioacademic communities.
- Steps of the study.
- Gather bibliographic and socioacademic data.
- Generate coauthorship network.
- Determine structural communities in the coauthorship network.
- Test for statistical independence between the structural and socioacademic communities.
- Gather data.
- Population data :
-
- Collected from eScholarship repository
-
- 291 CENS and non-CENS authors
-
- Multi-institutional and interdisciplinary
-
- 560 manuscripts (379 conference papers, 163 journal articles)
-
- Published over a ten year period (1998-2007)
-
- Gathered academic department, academic affiliation, country of origin, and academic position
- Steps of the study.
- Gather bibliographic and socioacademic data.
- Generate coauthorship network.
- Determine structural communities in the coauthorship network.
- Test for statistical independence between the structural and socioacademic communities.
- Generate coauthorship network.
- @article{
- author={Marko A. Rodriguez and Alberto Pepe },
- title={On the relationship },
- journal={Journal of Informetrics },
- year=2008,
- editor={Leo Egghe },
- }
- CENS population statistics. Socioacademic communities
- Study model. Alberto Marko coauthor Affiliation: UCLA Department: IS Origin: Italy Position: PhD Student Affiliation: LANL Department: CS Origin: USA Position: PostDoc
- Steps of the study.
- Gather bibliographic and socioacademic data.
- Generate coauthorship network.
- Determine structural communities in the coauthorship network.
- Test for statistical independence between the structural and socioacademic communities.
- Structural communities.
- Structural communities are c liquish subgraphs composed by groups of vertices that are highly connected between them, but poorly connected to other vertices.
- Community detection methods.
- edge betweenness [1]
- walktrap (random walks) [2]
- spinglass [3]
- leading eigenvector [4]
- Coauthorship network map. 27 structural detected CENS communities (LEV).
- Coauthorship network statistics.
- Typical clustering coefficients:
- mathematics: 0.34
- physics: 0.56
- biology: 0.60
- less-cliquish, sparse collaboration patterns
- CENS community fragmented in research agenda
- Newman, M. E. J.,The structure and function of complex networks, SIAM Review, 45, 167, 2003.
- Steps of the study.
- Gather bibliographic and socioacademic data.
- Generate coauthorship network.
- Determine structural communities in the coauthorship network.
- Test for statistical independence between the structural and socioacademic communities.
- Chi square test.
- Chi square test determines whether two nominal/categorical properties are statistically independent.
- Chi square analysis. N.B. p-value greater than 0.05 is considered statistically independent leading eigenvector (LEV), walktrap (WT), edge betweenness (EB), spinglass (SG).
- Anecdotal example.
- Anecdotal example.
- Remarks.
- Findings :
-
- Community structure is representative of department and affiliation
-
- Academic position and country of origin are independent of the structural community of the scholar.
- Generalization :
-
- Policy recommendations to increase interdisciplinarity
-
- Extension to other coauthorship network and other socioacademic (demographic) variables
-
- Useful to predict or infer topological/socioacademic configuration when data is scarce
- Metadata reuse.
- Metadata can be used to support scholarly collaboration.
- Everything is metadata. Borgman Article2 JCDL Pepe Italy UCLA CENS writtenBy writtenBy member country attended hasLab Article1 Sensor Networks cites topic researches contains member member
- Introduction to the Semantic Web.
- The World Wide Web is used to link documents, where documents are given universal identifiers/locators called URIs (e.g. URL).
-
- The structure is machine processable, but the documents/elements are primarily human processable.
- The Semantic Web is used to link data, where data is given universal identifiers/locators called URIs (e.g. URL).
-
- The structure and the data are both human and machine processable.
- The Uniform Resource Identifier.
- Resource = Anything.
-
- Anything that can be identified.
-
-
- Some discrete entity.
-
- The Uniform Resource Identifier (URI):
-
- : [ ? ] [ # ]
-
-
- http://www.lanl.gov
-
-
-
- urn:uuid:550e8400-e29b-41d4-a716-446655440000
-
-
-
- urn:issn:0892-3310
-
-
-
- http://www.lanl.gov#MarkoRodriguez
-
-
-
-
- prefix it to make it easier on the eyes -- lanl:MarkoRodriguez
-
-
- The Semantic Web
-
- first identify it, then relate it!
- The undirected network.
- There is the undirected network of common knowledge.
-
- Sometimes called an undirected single-relational network.
-
- e.g. vertex i and vertex j are related.
- The semantic of the edge denotes the network type.
-
- e.g. friendship network, collaboration network, etc.
- Example undirected network. Herbert Marko Aric Ed Zhiwu Alberto Jen Johan Luda Stephan Whenzong
- The directed network.
- Then there is the directed network of common knowledge.
-
- Sometimes called a directed single-relational network.
-
- For example, vertex i is related to vertex j , but j is not related to i .
- Example directed network. Muskrat Bear Fish Fox Meerkat Lion Human Wolf Deer Beetle Hyena
- The semantic network.
- Finally, there is the semantic network
-
- Sometimes called a directed multi-relational network.
-
- For example, vertex i is related to vertex j by the semantic s , but j is not related to i by the semantic s .
- Example semantic network. SantaFe Marko NewMexico Ryan California UnitedStates LANL livesIn worksWith cityOf originallyFrom stateOf stateOf locatedIn hasLab Cells Atoms madeOf madeOf researches Oregon southOf hasResident Arnold governerOf northOf
- The technologies of the Semantic Web.
- Resource Description Framework (RDF): The foundation technology of the Semantic Web. RDF is a distributed, semantic network data model. In RDF, URIs and literals (e.g. ints, doubles, strings) are related to one another in triples.
- RDF Schema (RDFS) and the Web Ontology Language (OWL): The ontology is to the Semantic Web as the schema is to the relational database.
-
- Anything of rdf:type lanl:Human can lanl:drive anything of rdf:type lanl:Car .
- Triple-Store : The triple-store is to semantic networks what the relational database is to the data table.
-
- a.k.a. semantic repository, graph database, RDF database.
- RDF and RDFS. lanl:marko lanl:cookie lanl:Human lanl:Food lanl:isEating rdf:type rdf:type lanl:isEating rdfs:domain rdfs:range ontology instance RDF is not a syntax. Its a data model. Various syntaxes exist to encode RDF including RDF/XML, N-TRIPLE, TRiX, N3, etc.
- RDF, RDFS, and OWL. lanl:fluffy lanl:marko lanl:Pet lanl:Human lanl:hasOwner rdf:type rdf:type lanl:hasOwner rdfs:domain rdfs:range ontology instance _:0123 rdfs:subClassOf owl:onProperty 1 owl:maxCardinality lanl:bob lanl:hasOwner owl:Restriction rdf:type
- General-purpose modeling. next next next item item item item key value key value entry entry el el el el el el List Map Set
- General-purpose computing. next value test PC item heap el Program Virtual Machine false true next next stack el next item next el Rodriguez, M.A., General-Purpose Computing on a Semantic Network Substrate, in review, Journal of Web Semantics, LA-UR-07-2885, April 2007.
- A web of data and process. 127.0.0.1 127.0.0.0 127.0.0.2 127.0.0.3
- The triple-store. SELECT ?a ?c WHERE { ?a type human ?a wrote
?b ?b type article ?c wrote ?b ?c type human ?a != ?c }
- There are two primary ways to distribute information on the Semantic Web.
-
- 1.) publish a serialized RDF document on a web server.
-
- 2.) expose a public interface to an RDF triple-store.
- The triple store is to semantic networks what the relational database is to data tables.
-
- Storing and querying triples in a triple store.
-
- SPARQLUpdate query language.
-
-
- like SQL, but for triple-stores.
-
- Triple-store vs. relational database. Triple-store Relational Database SQL Interface SPARQL Interface SELECT ?x1 ?x2 WHERE { ?x1 lanl:hasFriend ?x2 . ?x2 lanl:worksFor ?x3 . ?x3 lanl:collaboratesWith ?x4 . ?x4 lanl:hasEmployee ?x1 . } SELECT friendTable.personId1, friendTable.personId2 FROM personTable, authorTable, articleTable, friendTable, hasEmployeeTable, organizationTable, worksForTable, collaboratesWithTable WHERE personTable.id = authorTable.personId AND personTable.id = friendTable.personId1 AND friendTable.personId2 = worksForTable.personId AND worksForTable.orgId = collaboratesWithTable.orgId2 AND collaboratesWithTable.ordId2 = personTable.id Give me all pairs of people that are friends, but whom work for collaborating companies. Now!
- Triple-store and graph-analysis.
- Nearly all network analysis algorithms can be decomposed into a graph traversal problem.
-
- Spreading activation and the energy diffusion.
-
- PageRank and the random walker.
-
- Geodesics and the breadth-depth search.
- Relational database is not optimized for graph traversal.
-
- Indexes are not appropriate for graph traversal.
-
- Every traversal is a table join.
- Triple-store is more optimized for graph analysis.
-
- While the triple-store is optimized for graph pattern matching, it is more optimal for graph traversal than the relational database.
-
- Hybrid statement/linked-list databases are good at both pattern matching and traversal.
- Graph analysis can be used for ranking and recommendation.
- Modeling the scholarly community.
- Agents : humans and groups.
- Artifacts : articles, books, journals, proceedings, conferences, datasets, software, websites, [sensors, deployments].
- Relationships : citations, authorship, publisher, contains, attends, coauthor, members.
- Demonstration.
- Conclusion.
- Thank you for coming. Good life.