darwinian evolution, symmetry, conservation principles and ...title slide “darwinian evolution,...

32
Title Slide Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017 SEMINAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Les Hatton Emeritus Professor of Forensic Software Engineering, Kingston University, London Protein work done with Professor Greg Warr, NSF and MUSC.

Upload: others

Post on 31-Dec-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Title Slide

“Darwinian Evolution, Symmetry, ConservationPrinciples and a box of chocolates"

Version 1.2: 06/Nov/2017

SEMINAR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Les HattonEmeritus Professor of Forensic Software Engineering,

Kingston University, LondonProtein work done with Professor Greg Warr,

NSF and MUSC.

Page 2: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Walked > 3,500 miles in Richmond Park Produced 2,602 plots Analysed 100 million lines of code hundreds of times Analysed the 27Gb European Protein Database

uniprot.org hundreds of times Parsed the entire MusicXML library and significant

chunks of Project Gutenberg several times Wrote compiler front-ends for 7 programming

languages in C + around 22,000 lines of perl + 3,300 lines of R statistics scripts and a protein parser

Got more rejections ( > 8 journals) than in the rest of my career put together

So far, an 8-year slog during which …

Page 3: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Component: a piece of a system made from a string of …

… Tokens: Indivisible symbols chosen from an … … Alphabet: The unique set of symbols from

which every component is made. Examples

Computer programs. Each function or subroutine is made from an alphabet of programming language tokens.

Proteins are components made of amino acids chosen from a unique alphabet of 22 amino acids - those coded directly from DNA, plus an (increasingly) large number of tweaked amino acids.

Vocabulary

Page 4: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Tokens in programming languages

Take an example from C:

void int ( ) [ ] { , ; for = >= -- <= ++ if > -

bubble a N i j t 1 2

void bubble( int a[], int N){ int i, j, t; for( i = N; i >= 1; i--) { for( j = 2; j <= i; j++) { if ( a[j-1] > a[j] ) { t = a[j-1]; a[j-1] = a[j]; a[j] = t; } } }}

Fixed (18)

Variable (8)

+

Total (94)

Tokenising requires writing a compiler front-end for each language

Unique alphabet Total size

Page 5: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

A singular pattern …

Some of the many software ccdfs I analysed in 2010-11

Page 6: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Distribution of software component lengths in programming tokens

600,000 functions from 80 million lines of C

Almost perfect power-lawai ~ ti

-1/

Weird pointy bit

ccdf – comp. cum.

dist. fn.

pdf – prob. dist. fn.

Page 7: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Tokens in proteins

Protein Sequence UniqueAlphabet

VG22_BPT2 KAEEEVEKNK EEAEEEAEKK IAE KAVENI

PHI_MYTCA AKAKRSPRKK KAAVKKSSKS KAKKPKSPKK KKAAKKPAPKK AAKKK

KAVRSP

Strings of 22 letters directly coded from DNA + thousands of tweaked ones (PTM), through glycosylation and so on.

Page 8: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Distribution of protein lengths in amino acids

13,532,084 proteins built from 5,392,041,307 amino acids in the TrEMBL database 15-07.

Almost perfect power-lawai ~ ti

-1/

Weird pointy bit

Page 9: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Computer programs and proteins

Why are the length distributions of such disparate systems functionally identical

?

Page 10: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Time: Anything lasting more than a few years (a few minutes for news media). Especially, Natural Selection Tectonic drift

Scale: Anything much smaller than a midge and anything further away than say New Zealand. A grain of sand contains ~ 1019 atoms. The universe contains ~ 1024 stars, or ~ 1082 atoms.

Scientific things humans find it difficult to deal with

Page 11: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Natural Selection - time

The eye. Light-sensitive cells finally shown to have originated in the brain by molecular fingerprinting the brain of a “a living fossil”, Platynereis dumerlii, a marine worm, (EMBL 2004, Science), although it exists in many stages in the animal kingdom.There are however some things which it does not explain, for example the lengths of proteins.

Page 12: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

For physical systems (close enough)), every conservation principle is associated with a symmetry.

Energy -> invariance in time

Linear momentum -> invariance in displacement

Angular momentum -> invariance in direction.

So are there any symmetries here ?

Emmy Noether’s amazing theorem (1918) - scale

Page 13: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Scale invariance - proteins

All life

All bacteria

HumanAll data from TrEMBL genomic databases.

Page 14: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Scale invariance - software

“Universe” of 7 languages

C language

GNU C compilerAll data from Open source downloads

Page 15: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Emergent behaviour in 40 MSLOC

40 million lines of Ada, C, C++, Fortran, Java, Tcl-Tk from 80+ systems

Page 16: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Playing with beads and sniffing for conservation principles

Heterogeneous – software, proteins, music, literature

Homogeneous – atomic elements, literature

Page 17: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Hartley-Shannon information is basically the log of the number of ways of arranging tokens without caring what they mean. For a unique alphabet A and a total size T, this is log(AT).

But, what happens when we build systems to conserve this (CoHSI) ?

Hartley-Shannon Information is token-agnostic

Page 18: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Statistical mechanicsMy first effort (2011-14)

Conserve total beads

Conserve total Information

log(AT)

Boltzmann’s magical

statistical mechanics

machine gives distributions

Page 19: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Enter a box of chocolates …

Page 20: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

How many ways can we arrange a chocolate box of ti chocolates guaranteeing ai unique chocolates ?N(ti, ai; ai)

(For ti >> ai, this becomes ai^ ti as we need, to give the observed power-law.)

This is not trivial (as I first thought) and relies on recursion and the additive compositions of numbers …Example: N(5,2;2) = (5!)/(1!4!) + (5!)/(4!1!) + (5!)/(2!3!) + (5!)/(3!2!)

2016: The chocolate box extension for heterogeneous systems

Page 21: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

2016: The chocolate box extension

Page 22: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Heterogeneous CoHSI Length Distribution

Chocolate Box (2016-7)

Proteins

Eureka !

First effort (2014)

Page 23: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Playing with beads

Heterogeneous – software, proteins, music, literature

Homogeneous – atomic elements, literature

Page 24: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

2017: The homogeneous case

The resultant CoHSI pdf for homogeneous systems mutates directly into Zipf’s law at all scales

ai ~ i- 2017

where i is the rank order and therefore serves as a proof of Zipf’s empirical law.

Page 25: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

2017: The homogeneous case (and hybrid systems)

Homogeneous

(word freq.)

Heterogeneous

(letter freq.)

Three Men in a Boat

European Constitution

pdf ccdf

Page 26: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

2017: The homogeneous case (and wild speculation)

Distribution of elements in universe

Distribution of elements in sea water

Dark Energy (atomic number -1 ?)

Dark Matter (atomic number 0 ?)

Page 27: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Structure and component size:Power-law => big components are

inevitable

27

10 decisions

50 decisions

In any software system, for every eleven 10 decision components there will on average be one 50 decision component. The bigger the system, the more accurate this becomes and you have no control over this. For proteins, the exponent is around 1.6

ti-1.5

Page 28: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Music and alphabets ?

If notes are tokens, does including duration make a difference in CoHSI ?

Page 29: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Music and alphabets ?

CoHSI predicts that consistent alphabets will be power-laws and power-laws of one another just as we observe above.

883 pieces of music, duration and no-duration alphabet

log-log duration and no-duration alphabet

Page 30: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

The length and alphabetic properties of proteins right down to the species level and software components down to individual packages can be explained by CoHSI without recourse to natural selection or human volition.

CoHSI appears to be a deeper principle setting bounds for all discrete systems

CoHSI implies highly conserved average component length and unusually frequent larger components exactly as observed.

CoHSI implies that all consistent categorisations are power-laws of each other.

Summary

Page 31: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

The occurrence rate of the various letters in the amino acid alphabet suggests that approximately one protein in the entire TrEMBL fully annotated subset SwissProt will contain the words KING and ELVIS at the same time. As it happens, there is exactly one …

A stunning revelation …

Protein Sequence

PT111_YEAST … LQENAHIHTR KINGGEDSSL SGFNAVVDFER FEFKKKKVSH NDVYGAELVIS NSLKEGIAP …

Page 32: Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution, Symmetry, Conservation Principles and a box of chocolates" Version 1.2: 06/Nov/2017

Reference

My writing site:-

http://www.leshatton.org/

Earlier results of this work appear in IEEE TSE, Plos One and arXiv.

[email protected]

Photographs of scientific figures and the Bach chorale courtesy of Wikipedia under Creative Commons.