visualising typological relationships: plotting wals with heat maps

Post on 25-May-2015

485 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at EACL 2012.

TRANSCRIPT

Visualising Typological Relationships: Plotting WALS with Heat Maps

Richard Littauer¹, Rory Turnbull², Alexis Palmer¹

1 Universität des Saarlandes2 Ohio State University

Why?

• Data deluge in science• Typology has been shown to be useful for

linguistic studies (Greenberg 1963, Chomsky 2000, Dunn et al. 2001).

• Showing typological diversity visually can help cut down on research time and illuminate new areas of possible research.

Basic Overview

• Our visualisation technique combines: – geographic– phylogenetic– linguistic data.

World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2011).

Previous Work

Similar visualisation work:- Language Typology: Mayer et al., 2010;

Rohrdantz et al., 2010- Phylogeny: Multitree, 2009- Geographical variation: Wieling et al., 2011 Work with WALS:- Daumé & Campbell 2007, Daumé 2009

Pruning

WALS:– 2,678– 192 feature options (out of 144 features)– 16% of the data filled

Pruning:– 372 Languages– Average of 96 features– Only languages with 30% or more filled

Phylogenetic Distance

WALS’ Tree Hierarchy:– Three different levels– Doesn’t take into account language contact. • Family: ‘Sino- Tibetan’; • Sub-family: ‘Tibeto-Burman’; • Genus: ‘Northern Naga’.

– We used geographical proximity as a proxy for language contact.

Geographical Proximity Filtering

• Each language in WALS is associated with a geographical coordinate.

• Haversine formula• Within limits: geography, fullness in WALS.

Geographical Proximity Filtering

• First approach:– Arbitrary radius from centroid in order to create a

decision boundary for clustering neighbouring languages.

– 500 kilometres provided a sufficient number of examples after cleaning WALS.

Geographical Proximity Filtering

• Second approach:– Arbitrary lower bound for near languages.– Sufficient remainder.– Under-representative of contact languages.– Not as good as the radius method.

WALS Languages and Sparsity

Geographically Focused Map

Phylogenetic Focused Map

W E

More Maps

Conclusion

• A newly applied method for looking at sparse data

• Combines phylogenetic, geographic, and typological data

Final Remarks

Future work: • Integrating Ethnologue or Multitree for

language families. • Further exploration showing more natural

organisation of the linguistic features

All code and visualisations available here:https://github.com/RichardLitt/visualizing-language

top related