scientific jargon and the flow of ideas thus reflect the inefficiency of jargon for...

9
Scientific jargon and the flow of ideas Daril A. Vilhena * , Jacob G. Foster , Martin Rosvall , Jevin D. West * , James A. Evans , and Carl T. Bergstrom * * Department of Biology, University of Washington, Seattle, Washington, United States of America, Department of Sociology, University of Chicago, Chicago, Illinois, United States of America, and Integrated Science Lab, Department of Physics, Umea University, Umea, Sweden Submitted to Proceedings of the National Academy of Sciences of the United States of America The study of scientific communication typically focuses on cita- tions. Yet scientific language has a substantial impact on how information flows through citation networks. For example, spe- cialist language or “jargon" allows experts to communicate effi- ciently. Using jargon with outsiders, however, makes communi- cation highly inefficient. To explore this duality, we developed a simple information-theoretic model of scientific communication, quantifying the presence and impact of jargon. We apply this measure to scientific papers from JSTOR. We find systematic dif- ferences between major domains of scholarship; the biological sciences have higher jargon barriers on average than the social sciences. We also demonstrate that citation distance does not always reflect jargon distance. We visualize the combination of citation and jargon information in a novel topographic map of sci- ence, in which distance reflects citation rates between fields and topographic height represents the jargon barriers between them. We find that the ability of two fields to communicate efficiently de- cays exponentially with the citation distance between them. We argue that this field-specific decay rate reflects the relative insu- larity of the field. Taken together, these rates reveal hidden pro- cesses of cohesion or fragmentation. For example, the ecological sciences are balkanized by jargon, reflecting distinct interests, while the biomolecular sciences are integrated with each other, despite high overall jargon. Our results highlight the promise of contextualizing the citation network of scholarship with semantic analysis. Scientists’ awareness of other work is independent of their ability to understand it. Together, these complementary per- spectives enhance our view of the complex relationship between scientific fields. sociology of science | scholarly communication | complex networks | jar- gon boundaries Introduction “E very science requires a special language,” wrote the French philosopher Etienne Bonnot de Condillac, “because every science has its own ideas" [1]. This maxim makes two points that are often overlooked in the quantitative analysis of scientific com- munication. First, each scientific discipline has an extensive mental catalogue of concepts that reflect distinct concerns (de Condillac’s “ideas"). Second, disciplines often develop specialized terms (jar- gon) to refer efficiently to the most common concepts. Discipline- specific jargon allows scientists to communicate with other members of their discipline more concisely and precisely than would be pos- sible using everyday language. When an evolutionary biologist uses the term “fitness landscape," this is a highly condensed shorthand for comparing expected relative reproductive success across multiple genotypes. An intertidal ecologist would require little further expla- nation to understand “fitness landscape," but a financial economist might need to spend a long time on Wikipedia. In other words, not only does every science have its own ideas, these ideas may or may not overlap with those dear to other disciplines. This simple observation has profound implications for the study of scientific communication. Information does not flow seamlessly from author to publication to reader; it is expressed in a particular lan- guage, with important consequences for its transmission [?]. Jargon allows scientists to communicate new results quickly and effectively within the context of discipline-specific paradigms, but inhibits effi- cient knowledge transfer in many other situations. To list just a few examples, specialist language should be avoided when communicat- ing medical information to a patient [2], when publishing material intended for public outreach [3], and when presenting technical in- formation to a multidisciplinary audience [4]. In other words, jargon accelerates the flow of information within disciplines by compress- ing language, but impedes communication between disciplines and makes knowledge transfer more difficult. This tradeoff between efficient communication within fields and inefficient communication between fields merits closer attention. No general framework has been available for exploring how jargon af- fects the structure of scientific communication and vice versa. In this paper, we introduce an information-theoretic model of scientific com- munication and develop a simple measure of jargon usage with a clear qualitative interpretation. We deploy this measure on a large collec- tion of scientific papers in the JSTOR corpus. These papers have both full-text and citation information. This allows us to contextualize our jargon measurement within a citation network. In such networks, nodes represent papers and directed links represent citations among papers. We use the citation network to extract the macroscopic struc- ture of scientific fields in JSTOR, building on the extensive work that maps science in this way, e.g., [5, 6, 7, 8]; see Materials and Methods 1 . Results Modeling scholarly communication. To quantify the effects of jar- gon, we construct a simple model of scientific communication. For reference, we first consider a model for optimal communication within each field, then we quantify the penalty for using these lan- guages between fields. Imagine a writer communicating with a reader through a channel [9]. Let X denote the space of all phrases they might use to communicate. The writer is characterized by a code- book Pi that maps from phrases to concepts; the subscript i denotes the writer’s field of science or scholarship. The codebook Pi has a corresponding probability distribution over a random variable Xi , with values x ∈X . This probability distribution tracks the impor- tance of each phrase in field i; important phrases are used frequently, e.g., “fitness landscape" in evolutionary biology. The writer generates a message by drawing phrases at random with probability pi (x). She then transcribes the phrases into whatever “language" is appropriate for her reader’s scholarly field, using that field’s codebook Pj . Most Reserved for Publication Footnotes 1 Although we apply our jargon measure to scientific fields as identified using the map equation, this approach can be applied to any meaningful grouping of documents, authors, or corpora, and can scale down to individual journals, authors, and articles. Indeed, a similar methodol- ogy is currently in use to screen submissions to the online preprint server ArXiv [P.Ginsparg, private communication]. Articles with excessive jargon distance from existing fields are likely to be produced by authors from outside of the mainstream research community, on topics from perpetual motion machines to novel theories of everything. www.pnas.org/cgi/doi/10.1073/pnas.0709640104 PNAS Issue Date Volume Issue Number 16

Upload: lamduong

Post on 25-Mar-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Scientific jargon and the flow of ideasDaril A. Vilhena ∗ , Jacob G. Foster † , Martin Rosvall ‡ , Jevin D. West ∗ , James A. Evans † , and Carl T. Bergstrom ∗

∗Department of Biology, University of Washington, Seattle, Washington, United States of America,†Department of Sociology, University of Chicago, Chicago, Illinois,United States of America, and ‡Integrated Science Lab, Department of Physics, Umea University, Umea, Sweden

Submitted to Proceedings of the National Academy of Sciences of the United States of America

The study of scientific communication typically focuses on cita-tions. Yet scientific language has a substantial impact on howinformation flows through citation networks. For example, spe-cialist language or “jargon" allows experts to communicate effi-ciently. Using jargon with outsiders, however, makes communi-cation highly inefficient. To explore this duality, we developed asimple information-theoretic model of scientific communication,quantifying the presence and impact of jargon. We apply thismeasure to scientific papers from JSTOR. We find systematic dif-ferences between major domains of scholarship; the biologicalsciences have higher jargon barriers on average than the socialsciences. We also demonstrate that citation distance does notalways reflect jargon distance. We visualize the combination ofcitation and jargon information in a novel topographic map of sci-ence, in which distance reflects citation rates between fields andtopographic height represents the jargon barriers between them.We find that the ability of two fields to communicate efficiently de-cays exponentially with the citation distance between them. Weargue that this field-specific decay rate reflects the relative insu-larity of the field. Taken together, these rates reveal hidden pro-cesses of cohesion or fragmentation. For example, the ecologicalsciences are balkanized by jargon, reflecting distinct interests,while the biomolecular sciences are integrated with each other,despite high overall jargon. Our results highlight the promise ofcontextualizing the citation network of scholarship with semanticanalysis. Scientists’ awareness of other work is independent oftheir ability to understand it. Together, these complementary per-spectives enhance our view of the complex relationship betweenscientific fields.

sociology of science | scholarly communication | complex networks | jar-gon boundaries

Introduction

“Every science requires a special language,” wrote the Frenchphilosopher Etienne Bonnot de Condillac, “because every

science has its own ideas" [1]. This maxim makes two points thatare often overlooked in the quantitative analysis of scientific com-munication. First, each scientific discipline has an extensive mentalcatalogue of concepts that reflect distinct concerns (de Condillac’s“ideas"). Second, disciplines often develop specialized terms (jar-gon) to refer efficiently to the most common concepts. Discipline-specific jargon allows scientists to communicate with other membersof their discipline more concisely and precisely than would be pos-sible using everyday language. When an evolutionary biologist usesthe term “fitness landscape," this is a highly condensed shorthandfor comparing expected relative reproductive success across multiplegenotypes. An intertidal ecologist would require little further expla-nation to understand “fitness landscape," but a financial economistmight need to spend a long time on Wikipedia. In other words, notonly does every science have its own ideas, these ideas may or maynot overlap with those dear to other disciplines.

This simple observation has profound implications for the studyof scientific communication. Information does not flow seamlesslyfrom author to publication to reader; it is expressed in a particular lan-guage, with important consequences for its transmission [?]. Jargonallows scientists to communicate new results quickly and effectivelywithin the context of discipline-specific paradigms, but inhibits effi-cient knowledge transfer in many other situations. To list just a fewexamples, specialist language should be avoided when communicat-

ing medical information to a patient [2], when publishing materialintended for public outreach [3], and when presenting technical in-formation to a multidisciplinary audience [4]. In other words, jargonaccelerates the flow of information within disciplines by compress-ing language, but impedes communication between disciplines andmakes knowledge transfer more difficult.

This tradeoff between efficient communication within fields andinefficient communication between fields merits closer attention. Nogeneral framework has been available for exploring how jargon af-fects the structure of scientific communication and vice versa. In thispaper, we introduce an information-theoretic model of scientific com-munication and develop a simple measure of jargon usage with a clearqualitative interpretation. We deploy this measure on a large collec-tion of scientific papers in the JSTOR corpus. These papers have bothfull-text and citation information. This allows us to contextualize ourjargon measurement within a citation network. In such networks,nodes represent papers and directed links represent citations amongpapers. We use the citation network to extract the macroscopic struc-ture of scientific fields in JSTOR, building on the extensive work thatmaps science in this way, e.g., [5, 6, 7, 8]; see Materials and Methods1.

ResultsModeling scholarly communication. To quantify the effects of jar-gon, we construct a simple model of scientific communication. Forreference, we first consider a model for optimal communicationwithin each field, then we quantify the penalty for using these lan-guages between fields. Imagine a writer communicating with a readerthrough a channel [9]. Let X denote the space of all phrases theymight use to communicate. The writer is characterized by a code-book Pi that maps from phrases to concepts; the subscript i denotesthe writer’s field of science or scholarship. The codebook Pi hasa corresponding probability distribution over a random variable Xi,with values x ∈ X . This probability distribution tracks the impor-tance of each phrase in field i; important phrases are used frequently,e.g., “fitness landscape" in evolutionary biology. The writer generatesa message by drawing phrases at random with probability pi(x). Shethen transcribes the phrases into whatever “language" is appropriatefor her reader’s scholarly field, using that field’s codebook Pj . Most

Reserved for Publication Footnotes

1Although we apply our jargon measure to scientific fields as identified using the map equation,this approach can be applied to any meaningful grouping of documents, authors, or corpora,and can scale down to individual journals, authors, and articles. Indeed, a similar methodol-ogy is currently in use to screen submissions to the online preprint server ArXiv [P.Ginsparg,private communication]. Articles with excessive jargon distance from existing fields are likelyto be produced by authors from outside of the mainstream research community, on topics fromperpetual motion machines to novel theories of everything.

www.pnas.org/cgi/doi/10.1073/pnas.0709640104 PNAS Issue Date Volume Issue Number 1–6

of the extra cost she incurs will come from phrases that are commonin her field but rare in her reader’s field, while non-technical phrases(“here we show") will tend to have comparable frequencies in bothfields.

We assume that the language of each scholarly field is optimizedbased on how frequently a given phrase is used; this assumptionis commonly used to explain the power-law distribution of Englishwords [10, 11]. The optimum codeword for a phrase x used in fieldi with probability pi(x) has length − log2 pi(x) in bits [12]. First,consider a scientist from field i sending a message to a reader fromthe same field. Writer and reader have the same probability distribu-tion and codebook, denoted by the blue boxes.

Phrase

Writer ReaderChannel

The writer selects phrase x with probability pi(x). Becauseshe and her reader have the same codebook, she encodes it with acodeword of length − log2 pi(x). The expected message length perphrase is simply the Shannon entropy of Xi given probability distri-bution pi(x):

H(Xi) = −∑x∈X

pi(x) log2 pi(x) [1]

This result follows directly from the Shannon source coding theorem[9]. Indeed, this is the most efficient encoding of messages generatedby sampling phrases x ∈ X with the writer’s probability distributionpi(x). In fields where many phrases are used with equal probabil-ity (i.e., probability mass is spread evenly across phrases) the en-tropy will be large, as will the average message length per phrase. Ifsome phrases are used very frequently, the entropy will be smaller.These two situations model the efficiency of jargon for communica-tion within fields.

Now consider the more interesting case, in which the writer andthe reader come from different fields and have different codebooks.This situation is represented below.

Phrase

Writer ReaderChannel

What is the average message length per phrase, given that thewriter now encodes phrases optimally for a reader in a different field?Denote the writer’s probability distribution over phrases pi(x) andthe reader’s probability distribution pj(x). The writer selects phrasesx with probability pi(x) and encodes them in codewords with length− log2 pj(x). The expected length of her message per phrase is thenthe cross entropy of the two distributions [13], i.e., the entropy of Xigiven pi(x) plus the Kullback-Leibler divergence between pi and pj[14]:

Q(pi||pj) = −∑x∈X

pi(x) log2 pj(x) [2]

This quantity will always be larger than the Shannon entropy; a mes-sage sent to a reader in a different field will, on average, be longerthan the same message sent to a reader in the same field. We canunderstand this result intuitively. If a particular phrase y is used veryfrequently in field i and very rarely in field j, the respective code-words will vary enormously in length,− log2 pi(y)� − log2 pj(y).Insofar as the length of codewords correspond to the time or energyrequired to decipher a message, an article written by someone in fieldi, will require a reader from field j to expend much more energyto understand it than a reader from field i. The different messagelengths thus reflect the inefficiency of jargon for communication be-tween fields. In the worst case scenario, the two fields concentratetheir probability mass on largely disjoint subsets of phrases, i.e., theyhave distinct jargon, forcing a reader from the one to look up nearlyevery phrase used by an author in the other.

With these results in hand, we can quantify the efficiency of com-munication from field i to field j as the ratio of the average messagelength within field i to the average message length between fields:

Eij =H(Xi)

Q(pi||pj)=−∑x∈X pi(x) log2 pi(x)

−∑x∈X pi(x) log2 pj(x)

[3]

and similarly define the jargon barrier for a reader from field j read-ing a paper in field i as Jij = 1−Eij . We denote the average jargonbarrier of field i as Ji =

∑j Jij/N , where we sum over theN = 60

potential “reading" fields.This operationalization of semantic distance, while simple, has

several advantages. First, unlike most attempts to include seman-tic information, ours is based on an explicit model of communicationand has firm information-theoretic foundations [13]. Second, becausewe are agnostic about what scholarly “phrases" are, we can incorpo-rate many kinds of jargon, e.g., mathematical formula or canonicalcases in legal scholarship. Finally, this framework is easily exten-sible to more complex models of phrase generation, to hierarchicalstructure in the codebook, etc.

Application to JSTOR. To illustrate this approach, we measure jar-gon barriers between scholarly fields in JSTOR. Fields are identifiedbased on patterns of citation between articles using the map equation[5]. Catalogues of phrases and their field-specific frequency distribu-tions were assembled by parsing full-text trigrams from a represen-tative subsample of articles in each field written between 1990-2010.See Materials and Methods for more technical details.

First, we find systematic patterns in the distribution of jargon bar-riers across the major domains of science in JSTOR. The biologicalsciences (including ecology, physiology, evolutionary, and molecularbiology) have higher jargon barriers on average (i.e., higher valuesof Ji) than the behavioral and social sciences, such as psychology,economics, sociology, political science, business, religious studiesand education (T-test, P < 10−12); see Fig. 1A. Further, the so-cial sciences are more accessible to the biological sciences than viceversa. If i is a social science field and j is a biological science field,then Jij < Jji on average, i.e., the jargon barrier for a biologicalscientist reading a social science paper tends to be lower than forthe social scientist reading a biological science paper (Jij < Jjifor 893 of 899 pairs; significant, see supporting information). Notethat this pattern does not follow trivially from the number of distinctphrases or technical terms (Fig. 1B); this property is field- ratherthan domain-specific, with some fields from each domain having fewterms and some having many. We find, however, that the ratio ofdistinct phrases (trigrams) to distinct words does vary systematicallyby domain. Social science fields tend to use fewer words in morecombinations, generating many distinct phrases from their stock ofdistinct words. The ratio is small for biological science fields, sug-gesting that the constraints on word combination are stronger thereand that fields with many distinct phrases have many distinct words(Fig. 1C). This domain-specific asymmetry may provide a partial ex-planation for the domain-level asymmetry in jargon barriers: distinctphrases from the biological sciences are likely to contain many wordspoorly represented in the social science codebooks.

Second, we confirm our intuition that the structure of scientificcommunication as traced by citations is distinct from the structuretraced by semantic difference, i.e., by jargon. Hierarchical clusteringanalysis (UPGMA) of the average shortest citation path distances be-tween fields yields a dendrogram that splits the biological from thesocial sciences at the domain level (Fig. 2A) [15]; see Materials andMethods. When these fields are clustered by jargon, however, thedendrogram does not reflect these major domains (Fig. 2B). We per-form this clustering using a symmetrized version of the jargon bar-rier: Jij = (Jij + Jji)/2. The subfields broadly corresponding toecology, evolution, and physiology are separated from the molecular

2 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 Footline Author

fields (Figs. 2A and 2B, molecular fields shown in red). This sepa-ration reveals a language barrier within biology that cuts across theflow of the citation network. In economics, a smaller jargon barrier isrevealed: growth economics and consumer theory are clustered withportfolio theory, option pricing, and time series analysis by citation(Fig. 2A, shown in blue). When clustered by jargon, however, theygroup with subfields having to do with labor (Fig. 2B, also blue).

These findings suggest the need to combine citation and seman-tic information systematically. To visualize these linguistic barriersas they cut across the traditional map of science as defined by ci-tations, we embed scholarly fields in a two-dimensional topographydefined by citation and jargon (Fig. 3). The relative location of fieldsis determined by the citation flows between them. Fields with sub-stantial citation flow are placed close together, and those with littleflow are placed far apart. The topographical features of this land-scape, in turn, are determined by jargon. More specifically, the Xand Y coordinates of each field are assigned via principal coordinateanalysis of the citation distances. We measure dissimilarity betweenfields using the average shortest citation path between them (good-ness of fit = 0.25) [16]. The height of each pixel in the topographicoverlay is a weighted average of the symmetrized jargon barriers be-tween nearby fields; see Materials and Methods. High ridges andpeaks represent significant jargon barriers. Researchers in fields sep-arated by such barriers must invest substantial resources to learn theirneighbor’s “codebook" in order to translate their articles and incor-porate them into work within their field.

This visualization exposes several key features of scientific com-munication. First, maps of science based purely on citation are miss-ing a large part of the story. The numerous citation barriers crosscut-ting this landscape make it clear that the efficient flow of informationassumed by classical citation analysis is often impeded by jargon.Second, the large-scale structure of the map makes sense. The so-cial sciences cluster together on the left, the biological sciences onthe right. At least in JSTOR, statistics sits between the social and thebiological sciences, reflecting its role as a common resource. Interest-ingly, the semantic barrier between statistics and the social sciencesis lower than between statistics and the biological sciences. Withinthe social sciences, there is a relatively clear path, with small citationbarriers, from education to psychology to sociology to economics andbusiness. This reflects the relative coherence of the social sciences interms of jargon. In the biological sciences, by contrast, many smalljargon ridges separate clusters of fields, with a more substantial bar-rier between the ecological sciences (bottom right) and genetics, phy-logenetics, and systematics (middle right). Fields comprising molec-ular biology are far from the rest of biology in both citation distanceand jargon, with massive jargon barriers cutting molecular biologyoff from nearly every discipline in JSTOR.

Visual inspection of our topographical map suggests that thelandscape is more rugged in the biological sciences than the socialsciences; in other words, the biological sciences are more balkanizedby jargon. To make this intuition precise (and check that the result isnot an artifact of embedding the high-dimensional citation network intwo dimensions), we exploit the fact that the jargon barrier betweenfield i and j grows with the average shortest citation path betweenthem. Equivalently, the efficiency Eij = 1 − Jij decays with cita-tion distance. To control for possible differences in citation practicesbetween fields or domains, we normalize our measure of citation dis-tance: we divide the average shortest path from an article in field ito one in field j by the average shortest path between two articles infield i. We then model the decay in communication efficiency withcitation distance as:

Eij = 1− β(1− e−αdij ) [4]

where dij is the normalized citation distance, 1−β gives the asymp-totic efficiency for fields at infinite distance, and α controls the decayof Eij with distance. Fits were obtained via non-linear least squaresin R. Fig. 4A shows the field with the slowest decay (option pric-

ing) and the field with the fastest decay (environmental toxicology).These examples suggest the conditions under which slow and fastdecay occur. Fields with slow decay have substantial citation flowto several neighboring fields, and relatively efficient communicationwith those fields, i.e., low semantic barriers. Fields with fast decay,by contrast, may have several fields at similar proximity but muchless efficient communication with these fields, i.e., high semantic bar-riers. The decay rate α varies across fields (Fig. 4B). Verifying theintuition suggested by Fig. 3, we find that the decay rate is higher inthe biological sciences than the social sciences (Fisher’s exact test:top half vs. bottom half, P < 0.0001). There are a number of ex-ceptions to this pattern, however; most surprisingly, fields related tomolecular biology, which have relatively small values of α, so thatefficiency of communication falls off slowly with citation distance.These counterexamples further illustrate the fact that decay rate isnot systematically related to the overall amount of jargon in a specificfield. Although the molecular biology fields have the most jargon inJSTOR (see Fig. 1A), decay rates for these fields (excepting HIV) arecomparatively low.

DiscussionOur results suggest that combining the analysis of citation flows withexplicit models of communication processes (e.g., jargon) exposesimportant structural and organizational features of scholarly commu-nication. We further suggest that the interaction of jargon and citationreveals previously neglected social processes. For example, we ar-gue that the decay of communicative efficiency with citation distancereflects the relative insularity of scholarly fields. Fields with fasterdecay rates are less accessible to others close by. This suggests, fol-lowing our model, that scholars working in these fields make exten-sive use of jargon that is not shared with neighboring fields, makinginter-field communication less efficient. Readers from neighboringfields are sufficiently aware of this nearby knowledge to reference it,but can only understand it through potentially prohibitive study anddecoding. By contrast, scholars in fields with slow decay rates likelyuse jargon that is shared with neighboring fields—their probabilitydistribution over phrases is similar—making their work much moreaccessible through phrases of common concern.

This varying relationship between shortest citation paths andcommunicative efficiency helps us interpret between-field differencesin the total amount of jargon, Ji. Molecular biology fields have thehighest jargon measures Ji in JSTOR; see Fig. 1A. The decay rate ofthese fields is quite slow, however, in comparison with other fields inthe social and biological sciences (with the exception of HIV and en-vironmental toxicology). This surprising combination suggests thatmolecular biology does not have high jargon because it is insular orexclusionary, per se. Slow decay rates exclude this interpretation onthe field level. Indeed, the several molecular biology fields cluster to-gether in citation distance and have relative low jargon barriers withone another, reflecting the substantial overlap in matters of interest.Instead, molecular biology fields have high jargon simply becausethey are remote from most other fields in JSTOR, both in citation dis-tance and semantic distance, i.e., matters of concern. By contrast,vole research is close to many fields by citation (in the ecology clus-ter) but has a high decay rate. Vole research has only a moderateamount of jargon overall, but this overlaps little with its neighbors inmatters of interest. It is much more insular than the molecular biol-ogy fields. These cases illustrate that the decay rate can be used toassess the relative insularity of different fields, taking into accountthe distance between fields in the citation network as measured byshortest path.

These examples suggest an interesting interpretation of the conti-nuities and discontinuities between fields of science and scholarshipin JSTOR. First, JSTOR fields fall into four great camps: the socialsciences; the ecological and evolutionary branches of biology; themolecular branches of biology; and statistics. Concentrating on the

Footline Author PNAS Issue Date Volume Issue Number 3

substantive fields (social, ecological, and molecular), we note thateach of these clusters, tied closely together by citation flow, is alsounited by concern with a particular scale (macroscopic and human;macroscopic and non-human; microscopic). The social sciences tendto have lower jargon barriers, on average, and to communicate quiteefficiently with neighboring social sciences. This suggests that thesefields are not absorbed by particularities, but instead share many mat-ters of concern, remaining relatively integrated with one another. Theecological sciences, by contrast, have higher jargon barriers, andcommunicate inefficiently with neighboring ecological fields. Theexample of vole research suggests a broader principle: these fieldsare absorbed in many particularities which they do not share with oneanother (I am concerned with voles; you with bears; she with bats).Finally, the molecular biological sciences have high jargon barriersbut communicate efficiently with their immediate neighbors. This re-flects an orientation toward many shared particularities between themolecules and processes that form the physical substrate of life. It isespecially interesting that the ecological sciences—popularly asso-ciated with holistic, anti-reductionist thinking—are so balkanized intheir matters of concern, while the molecular sciences, despite divid-ing life into so many distinct building blocks, nevertheless seems tobe much more integrated at the semantic level. A deeper explanationof these patterns is beyond the scope of this paper, but note that noneof this would be apparent from a pure citation or semantic analysis.

ConclusionThe analysis of scientific communication remains one of the centralprojects in the science of science [17]. Some qualitative studies haveconsidered content of communication in conjunction with citations[18], but the vast majority of quantitative analyses rely exclusivelyon citations to map the structure and flow of scientific communica-tion [5]. In cases where content is addressed directly, it is viewedas a substitute for citation analysis [19, 20], or citation patterns areused to construct a measure of semantic similarity [21]. In this paper,we have demonstrated that scholarly semantic information is not asubstitute for citation analysis. Rather, the two sources of informa-tion are complementary, together revealing patterns that otherwiseremain invisible. We introduced an easily understood measure of se-mantic distance, grounded in a simple model of communication. Wethen showed that the social and biological sciences differ systemat-ically in their use of jargon. We demonstrated that science takes ona different shape when viewed through citation or semantic structurealone, and introduced two procedures for combining such informa-tion: first, an attractive and easily interpretable “topographical map"of science, in which citation structure establishes location and jargonestablishes topography; and second, a rigorous procedure for captur-ing how communication efficiency scales with citation distance. Us-ing these two procedures, we made several striking discoveries aboutthe structure of the scholarly fields in JSTOR: the coherence of thesocial sciences; the balkanization of ecological sciences by jargon;and the surprising coherence of molecular biology in the face of theirisolation from the other fields in our sample.

Our research has marked limitations. Sampled trigrams mapimperfectly to the phrases of most interest to scientists and schol-ars. Neither do we currently distinguish different types of phrasesor those of special scientific importance. More importantly, we can-not currently identify the cause behind the adoption of jargon andthe barriers it imposes: we cannot say when jargon was introducedto maximize communicative efficiency within a field without regardto outside audiences[22, 23, 10, 11], and when it was used to distin-guish the field and erect barriers that limit oversight, association andloss of status [24, 25, 26]. We believe future work that tags phrasesand citations with temporality, authorship, institution and semanticclass (e.g., methods) may begin to disentangle the origins of jargon.

Nevertheless, we believe that this investigation sheds substantial newlight on jargon’s distribution and consequences for science.

Our findings point to a broad new research program in the sci-ence of science. On the empirical front, scholars now have leanmachinery for assessing the efficiency of communication betweenfields, as well as their relative insularity. On the theoretical front,our framework can be extended to deal with complex syntacticalrules and richer models of scientific communication. Catalogues oftechnical terms can be expanded and structured to include term re-dundancy, hierarchy, and syntactic structure. Communication rulescan also be altered to capture the real social and cognitive processesthrough which scholars read and assess documents. Though our anal-ysis highlights major patterns in JSTOR and science at large, wehave barely scratched the surface of this landscape of possibility.Boundaries between fields constantly evolve as collaborative dynam-ics change. What happens to field-specific jargon when two fieldsmerge? Do technical terms follow standard evolutionary birth-deathdynamics, producing the field-level patterns we observe—or are theysubject to unknown social phenomena? Our results underline boththe exploding opportunities in the analysis of scientific communica-tion and the importance of building and interconnecting new modelsand sources of information as we seek to quantify, understand, andshape scientific behavior [27].

Materials and Methods

Field identification. We studied the 60 largest scientific fields in the JSTORcitation network (>1000 papers, 1990-2010). This network includes over 1.5 mil-lion interconnected papers. We identified these fields using the map equation [5],an algorithm for extracting the community structure of complex networks [28, 29].The map equation has been used extensively to delimit scientific fields in citationnetworks, e.g. [5]. In our case, the map equation tracks scholarly citation flow inthe JSTOR corpus. It partitions the network such that an idealized researcher,following citations at random, would spend a long time in a given field before tran-sitioning to a new field. Because citation networks are time directed, however,this idealized random walk approach can cause older papers to accumulate flowdisproportionately. To resolve this, we used the undirected version of the cita-tion network to infer the stationary distribution of this random walk (removing thetime-directionality and the problem of citation sinks), but evaluated the quality ofproposed partitions using the directed network. Field names were chosen man-ually, but we relied on a statistical approach to identify phrases that distinguishedeach cluster [30]. These phrases are available in the supporting information.

Codebooks. The phrase frequency distribution Pi for each scholarly field (its“codebook") was assembled using the empirical frequency of each triplet of con-secutive words (trigram) in a random sample of 500 papers in that field, publishedbetween 1990-2010. We chose 500 because it was the largest number of papersthat we could sample from every field; some fields had just over 500 papersfrom 1990-2010, making larger samples impossible. Samples of approximately100 papers yielded a consistent field-level entropy, so we are confident that alarger 500 paper sample is sufficient for our analysis. We removed some functionand stop words from the trigrams to decrease the overall number of terms in thedatabase. A complete list of these terms, and more detailed methodology, areavailable in the online supplement. To apply our model of scholarly communica-tion to real data, we need the codebooks for field i and j to contain the samephrases (domain[Pi]=domain[Pj ]) so that a concept from field i can always beexpressed in the codebook of field j. To ensure this, we introduce a teleportationparameterα, which merges a discipline codebookPi with the “corpus codebook"S generated from the papers of every field:

psi (x) = (1− α)pi(x) + αs(x)

psj(x) = (1− α)pj(x) + αs(x),

where the lowercase p and s indicate the probability distribution function overphrases for the appropriate codebooks. Intuitively, this means that with somesmall probability α the writer (or reader) consults not the discipline codebook Pibut the corpus codebook S. The value of this parameter does not change ourresults, so we chose the intuitive value of 1%.

4 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 Footline Author

Measuring distance in the citation network. We measure distance in the ci-tation network between fields i and j as follows. For randomly selected pairsof papers, one in field i and one in field j, we calculate the number of links inthe shortest path between them, where topological distance is computed for theundirected version of the citation network. The average value of this quantity isthe average shortest path between these two fields.

Jargon topomap. To construct the topographical map in Fig. 3, we first embedthe 60 fields in two dimensions using principal coordinates analysis (PCoA), i.e.,multidimensional scaling, as implemented in R [16]. We define the elements ofthe dissimilarity matrix using the average shortest path distance between twofields in the citation network. We refer to the xy coordinates of field i as F ixand F iy . Jargon mountains were calculated as the distance-weighted sum of

pairwise, symmetrized jargon barriers Jij = (Jij + Jji)/2. The underlyinglogic is as follows. The height of each pixel in the map should be more affectedby closeby fields. Thus we define a weight vector−−−→wPxy for pixelPxy , at positionx and position y. The components of this vector determine how much a givenfield i contributes to the pixel height at position x, y:

wiPxy= (

√(x− F ix)2 + (y − F iy)2)−α,

where α is a parameter controlling how rapidly the influence of a given field fallsoff with distance. For Fig. 3, we used α = 1, which gives the inverse distance.Finally, we exclude the nearest field ( max(−−−→wPxy ) ) from calculations of thepixel height; inclusion of the nearest field leads to “holes" in the topographical

map centered on each field (the sum is dominated by the self-barrier Jii = 0).The height of the pixel Pxy is then given by:

Pxy =∑

i6=max(−−−→wPxy )

∑j 6=max(−−−→wPxy ),j 6=i

wiPxywjPxy

Jij .

ACKNOWLEDGMENTS. This work was supported in part by NSF grant SBE-0915005 to CTB, a WRF-Hall research fellowship to DAV, and a generous giftfrom JSTOR.

1. de Condillac E (1782) Cours d’etude pour l’instruction du Prince de Parme...2. Reach G (2009) Diabetologia 52:1461–1463.3. Richardson ML (2010) American Entomologist 56:11–13.4. Bischof N, Eppler MJ (2010) In Proceedings of the Tenth International Knowledge

Management Conference IKnow, volume 10.5. Rosvall M, Bergstrom CT (2008) Proceedings of the National Academy of Sciences

105:1118–1123.6. Leydesdorff L, Rafols I (2008) Journal of the American Society for Information Sci-

ence and Technology 60:348–362.7. Small H (1999) Journal of the American Society for Information Science and Tech-

nology 50:799–813.8. Boyack KW, Klavans R, Borner K (2005) Scientometrics 64:351–374.9. Shannon CE (1948) The mathematical theory of communication, volume 27.

10. Zipf GK (1935) The psycho-biology of language. (Houghton, Mifflin).11. Zipf GK (1949) Human behavior and the principle of least effort. (Addison-Wesley

Press).12. Cover TM, Thomas JA (2006) Elements of Information Theory: Chapter 5 (Wiley-

interscience).13. Cover TM, Thomas JA (2006) Elements of Information Theory (Wiley-interscience).14. Kullback S, Leibler RA (1951) The Annals of Mathematical Statistics 22:79–86.15. Sokal RR (1958) Univ Kans Sci Bull 38:1409–1438.

16. Borg I, Groenen PJ (2005) Modern multidimensional scaling: Theory and applica-tions (Springer).

17. de Solla Price DJ (1965) Science 149:510–515.18. Ceccarelli L (2001) Shaping science with rhetoric: The cases of Dobzhansky,

Schrodinger, and Wilson (University of Chicago Press).19. Landauer TK, Laham D, Derr M (2004) Proceedings of the National Academy of

Sciences 101:5214–5219.20. Gerrish S, Blei DM (2010) In Proceedings of the 26th International Conference on

Machine Learning.21. Moody J, Light R (2006) The American Sociologist 37:67–86.22. Kemp C, Regier T (2012) Science 336:1049–1054.23. Fawcett TW, Higginson AD (2012) Proceedings of the National Academy of Sciences

109:11735–11739.24. Bourdieu P, Thompson JB (1991) Language and symbolic power (Harvard Univer-

sity Press).25. Coleman H (1985) International Journal of the Sociology of Language 1985:105–130.26. Holmes J, Meyerhoff M (1999) Language in Society 28:173–183.27. Evans JA, Foster JG (2011) Science 331:721–725.28. Fortunato S (2010) Physics Reports 486:75–174.29. Lancichinetti A, Fortunato S (2009) Physical Review E 80:056117.30. Forman G (2003) The Journal of Machine Learning Research 3:1289–1305.

Footline Author PNAS Issue Date Volume Issue Number 5

Fig. 1. (A) The average jargon barrier Ji =∑j Jij/N across fields. Note that biological sciences have systematically larger jargon barriers than social sciences.

(B) The total number of phrases (distinct three word combinations/trigrams) in each codebook. Note that there is no pattern at the domain level, i.e., biological sciences(blue) and behavioral/social sciences (orange) do not vary systematically in the number of phrases. (C) The number of phrases per distinct word in each field. Notethe systematic difference between biological sciences (blue) and behavioral sciences (orange). This pattern suggests that social science fields use fewer words inmore combinations, while word combination in the biological sciences is more constrained.

6 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 Footline Author

Jargon clustering

Citation clustering

B

A

Distance(jargon barrier)

0.25

0.30

0.35

0.40

0.45

4.5

5.5

6.5

7.5

Distance (cites)

Fig. 2. (A) Clustering the fields by shortest citation path. Note that the biological sciences cluster together on the left, and the social and behavioral sciences clustertogether on the right. (B) The biological sciences do not cluster together by symmetrized jargon Jij . The molecular and cell biology subfields (red) are distant fromthe other biological science fields and the social science fields by jargon. Similarly, economic fields clustered by citation (blue) are grouped with other fields by jargon.Growth economics and consumer theory cluster with portfolio theory, option pricing, and time-series analysis by citation but with labor related fields (gender and labor;unemployment) by jargon. All clustering performed using UPGMA, but result holds for different clustering methods.

Footline Author PNAS Issue Date Volume Issue Number 7

Fig. 3. Topographical map of science combining semantic and citation data. Fields that are close communicate frequently: positions in space are calculated byapplying Principal Coordinate Analysis to the matrix of shortest average citation paths and retaining the first and second principal coordinates. In this contour map,“mountains" (dark brown shading to green) represent the distance-weighted sum of symmetrized jargon barriers between fields, Jij ; see Materials and Methods.Despite substantial citation flow between fields separated by these mountains (e.g., survival analysis and medical outcomes), communication is inefficient. Note thatsocial sciences cluster together on the lefthand side and biological sciences on the righthand side, with statistics located between. Note also that molecular biologyfields (upper right) are separated from other fields, including other biological sciences, by huge jargon barriers, and that jargon barriers sensibly separate the remainingfields (smaller panels). In the small panels, labels have been slighted shifted to increase legibility.

8 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 Footline Author

0.0 0.4 0.8 1.2

0.4

0.6

0.8

1.0

Relative shortest path

Effi

cien

cy o

f com

mun

icat

ion

option pricing

Deca

y ra

te

4

6

8

10

12

opti

on p

rici

ngm

athe

mat

ics

educ

atio

nm

edic

al o

utco

mes

cong

ress

iona

l ele

ctio

nsar

t edu

cati

ongr

owth

eco

nom

ics

men

tal h

ealt

hso

ciol

ogy

educ

atio

nge

nder

and

labo

rti

me

seri

es a

naly

sis

gene

rali

zed

line

ar m

odel

spl

ant p

atho

gens

myc

orrh

izal

bio

logy

soci

olog

y of

rel

igio

nsex

ecut

ive

com

pens

atio

nm

arit

al d

isru

ptio

nso

cial

mov

emen

tsco

mpu

tati

onal

bay

esia

n st

atis

tics

surv

ival

ana

lysi

sU

S c

onst

itut

iona

l law

extr

acel

lula

r m

atri

xfr

ugiv

ory

unem

ploy

men

tC

hild

hood

dev

elop

men

tm

embr

ane

cell

bio

logy

mer

gers

and

acq

uisi

tion

sne

stin

g ec

olog

ycy

tosk

elet

onpo

rtfo

lio

theo

ryre

prod

ucti

ve d

emog

raph

yco

nsum

er th

eory

mar

keti

ngtu

rtle

sin

tern

atio

nal r

elat

ions

stra

tegi

c m

anag

emen

tte

en s

exua

l beh

avio

rav

ian

bree

ding

eco

logy

amph

ibia

l lif

e hi

stor

yri

pari

an e

colo

gypo

llin

atio

n ec

olog

yle

af e

colo

gym

itoc

hond

rial

gen

etic

sw

ater

fow

lke

rnel

ana

lysi

sfo

rest

soi

l eco

logy

land

scap

e ec

olog

yba

tsbe

ars

daph

nia

inte

rtid

al e

colo

gyli

zard

ther

mor

egul

atio

nm

ass

exti

ncti

onpl

ant−

herb

ivor

e in

tera

ctio

nsbi

olog

ical

inva

sion

svo

les

rain

fore

st e

colo

gyph

ylog

enet

ic in

fere

nce

HIV

plan

t sys

tem

atic

sen

viro

nmen

tal t

oxic

olog

y

A B

Scientific field

Fig. 4. (A) Efficiency of communication from focal field i to target field j (Eij = 1− Jij ) decays with the distance to field j differently for different fields. Here weshow the slowest decay (option pricing, dashed line) and the fastest decay (environmental toxicology, solid line) out of the 60 fields (see Supplement for others). Fieldsare plotted by behavioral science (orange) or biological science (blue). Focal fields with fast decay tend to have few fields nearby with which communication is efficient.Those with slow decay have several neighbors with relatively low jargon barriers, i.e., efficient communication. Distance is computed using the normalized shortestpath: the average shortest path from a paper in field i to a paper in field j, divided by the average shortest path from a paper in field i to another paper in field i. Wesubtract 1 from this value so that the normalized shortest path from a field to itself is 0. This normalization allows us to account for differences in citation norms thatcause focal fields to be tightly or loosely connected, i.e., to have a short or long average within field path distance.

Footline Author PNAS Issue Date Volume Issue Number 9