workshop in bioinformatics (0382.3102)bchor/workshopintro05.pdf · workshop in bioinformatics...

Post on 06-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Workshop in BioInformatics (0382.3102)

Prof. Benny Chorbenny@cs.tau.ac.il

Tel-Aviv University

Spring Semester, 2005

c©Benny Chor – p.1

PreliminariesThe course is a required course for students of thebioinformatics track, and is offered to 3rd yearComputer Science students as well. Studentsfrom other disciplines should consult theinstructor.

c©Benny Chor – p.2

PreliminariesThe course is a required course for students of thebioinformatics track, and is offered to 3rd yearComputer Science students as well. Studentsfrom other disciplines should consult theinstructor.

The course was also given in the fall semester (byanother instructor), and is expected (but notguaranteed) to be given next fall semester.

c©Benny Chor – p.2

PreliminariesThe course is a required course for students of thebioinformatics track, and is offered to 3rd yearComputer Science students as well. Studentsfrom other disciplines should consult theinstructor.

The course was also given in the fall semester (byanother instructor), and is expected (but notguaranteed) to be given next fall semester.

Some biological background knowledge is useful,but is not absolutely necessary.

c©Benny Chor – p.2

AdministraTriviaGrade is based on a software projectimplementation (55-65% of total), and on anoutline (10 min.) and a presentation (30 min.) ofthe project (35-45% ).

c©Benny Chor – p.3

AdministraTriviaGrade is based on a software projectimplementation (55-65% of total), and on anoutline (10 min.) and a presentation (30 min.) ofthe project (35-45% ).

Projects to be done in groups of size 1 or 2.

c©Benny Chor – p.3

AdministraTriviaGrade is based on a software projectimplementation (55-65% of total), and on anoutline (10 min.) and a presentation (30 min.) ofthe project (35-45% ).

Projects to be done in groups of size 1 or 2.

Outlines presented on March 28.

c©Benny Chor – p.3

AdministraTriviaGrade is based on a software projectimplementation (55-65% of total), and on anoutline (10 min.) and a presentation (30 min.) ofthe project (35-45% ).

Projects to be done in groups of size 1 or 2.

Outlines presented on March 28.

Projects’ presentations will take place during thelast 2-3 weeks of the semester.

c©Benny Chor – p.3

AdministraTrivia (2)Presentations and outlines should usecomputerized tools (prosper LaTeX, power-point,or any other software of your choice).

c©Benny Chor – p.4

AdministraTrivia (2)Presentations and outlines should usecomputerized tools (prosper LaTeX, power-point,or any other software of your choice).

In addition, there will be two-three lectures onvarious relevant topics in software engineering,given by the Computer Science system staff.

c©Benny Chor – p.4

AdministraTrivia (2)Presentations and outlines should usecomputerized tools (prosper LaTeX, power-point,or any other software of your choice).

In addition, there will be two-three lectures onvarious relevant topics in software engineering,given by the Computer Science system staff.

Physical attendance in all presentations andlectures is mandatory.

c©Benny Chor – p.4

Projects’ DescriptionsAnalysis, design and implementation ofcombinatorial optimization algorithms withbioinformatics relevance.

c©Benny Chor – p.5

Projects’ DescriptionsAnalysis, design and implementation ofcombinatorial optimization algorithms withbioinformatics relevance.

Some contemporary problems in comparativegenomics, DNA chips analysis, phylogeneticanalysis, regulatory motives finding, and more.

c©Benny Chor – p.5

Projects’ DescriptionsAnalysis, design and implementation ofcombinatorial optimization algorithms withbioinformatics relevance.

Some contemporary problems in comparativegenomics, DNA chips analysis, phylogeneticanalysis, regulatory motives finding, and more.

Getting acquainted with publicly availableBioinformatics databases and using them.

c©Benny Chor – p.5

Projects’ DescriptionsAnalysis, design and implementation ofcombinatorial optimization algorithms withbioinformatics relevance.

Some contemporary problems in comparativegenomics, DNA chips analysis, phylogeneticanalysis, regulatory motives finding, and more.

Getting acquainted with publicly availableBioinformatics databases and using them.

Conducting supervised research in computationalbiology.

c©Benny Chor – p.5

Projects’ DescriptionsAnalysis, design and implementation ofcombinatorial optimization algorithms withbioinformatics relevance.

Some contemporary problems in comparativegenomics, DNA chips analysis, phylogeneticanalysis, regulatory motives finding, and more.

Getting acquainted with publicly availableBioinformatics databases and using them.

Conducting supervised research in computationalbiology.

Efficient implementation of algorithms in C,C++, Java or Matlab (if you insist, we will alsoconsider cobol or even scheme).

c©Benny Chor – p.5

Projects’ RequirementsProjects are individual per group.

They require studying a problem in depth(typically based on research publications);

c©Benny Chor – p.6

Projects’ RequirementsProjects are individual per group.

They require studying a problem in depth(typically based on research publications);

Understanding a solution (or devising a new one),and implementing it.

c©Benny Chor – p.6

Projects’ RequirementsProjects are individual per group.

They require studying a problem in depth(typically based on research publications);

Understanding a solution (or devising a new one),and implementing it.

Implementation will require coding a fairly largeprogram, testing it on simulated and actualbiological data, and analysing the results.

c©Benny Chor – p.6

Tentative TimeTableSpecification released February 28th.

c©Benny Chor – p.7

Tentative TimeTableSpecification released February 28th.

Choices sent to staff by March 6th.

c©Benny Chor – p.7

Tentative TimeTableSpecification released February 28th.

Choices sent to staff by March 6th.

Two page written summary of intended projectsent to staff no later than March 20th.

c©Benny Chor – p.7

Tentative TimeTableSpecification released February 28th.

Choices sent to staff by March 6th.

Two page written summary of intended projectsent to staff no later than March 20th.

Short interviews with each group held week ofMarch 20th to 27th.

c©Benny Chor – p.7

Tentative TimeTableSpecification released February 28th.

Choices sent to staff by March 6th.

Two page written summary of intended projectsent to staff no later than March 20th.

Short interviews with each group held week ofMarch 20th to 27th.

Outlines presented on March 28.

c©Benny Chor – p.7

Tentative TimeTableSpecification released February 28th.

Choices sent to staff by March 6th.

Two page written summary of intended projectsent to staff no later than March 20th.

Short interviews with each group held week ofMarch 20th to 27th.

Outlines presented on March 28.

Projects’ presentations during the last 2-3 weeksof the semester.

c©Benny Chor – p.7

Tentative TimeTableSpecification released February 28th.

Choices sent to staff by March 6th.

Two page written summary of intended projectsent to staff no later than March 20th.

Short interviews with each group held week ofMarch 20th to 27th.

Outlines presented on March 28.

Projects’ presentations during the last 2-3 weeksof the semester.

A written report, accompanied by working anddocumented software, due on June 1st, 2005.

c©Benny Chor – p.7

Important RemarkProjects intended at mini research.

As such, they may chart some unexploredterritories.

Last year, one of the projects has led to a paperaccepted to a prestigious and highly competitiveconference.

This year, we’d like to see more such results.

However, research has risks – not every attemptwill succeed.

You may still get a high grade in the workshopeven if your project was not a success (researchwise)!

c©Benny Chor – p.8

The BioTechnology RevolutionBiological sciences have gone through arevolution in the last decade.

c©Benny Chor – p.9

The BioTechnology RevolutionBiological sciences have gone through arevolution in the last decade.

To a large extent, this revolution was driven byadvances in biotechnology.

c©Benny Chor – p.9

The BioTechnology RevolutionBiological sciences have gone through arevolution in the last decade.

To a large extent, this revolution was driven byadvances in biotechnology.

One of the better known results of this revolutionis the sequencing of the human genome.

c©Benny Chor – p.9

The BioTechnology RevolutionBiological sciences have gone through arevolution in the last decade.

To a large extent, this revolution was driven byadvances in biotechnology.

One of the better known results of this revolutionis the sequencing of the human genome.

Also sequenced: Genomes of about two hundredother organisms (mouse, rice, fruit fly –Drosophila, worm – C. Elegans, mosquito –Anopheles, malaria, bacteria – E. Coli, . . . ), andthousands viruses.

c©Benny Chor – p.9

The BioTechnology RevolutionRobotics’ sequencers

c©Benny Chor – p.10

The BioTechnology RevolutionRobotics’ sequencers

DNA microarrays

c©Benny Chor – p.10

The BioTechnology RevolutionRobotics’ sequencers

DNA microarrays

2D gels

c©Benny Chor – p.10

The BioTechnology RevolutionRobotics’ sequencers

DNA microarrays

2D gels

mass spectrometry

c©Benny Chor – p.10

The BioTechnology RevolutionRobotics’ sequencers

DNA microarrays

2D gels

mass spectrometry

SNPs genotyping

c©Benny Chor – p.10

The BioTechnology RevolutionRobotics’ sequencers

DNA microarrays

2D gels

mass spectrometry

SNPs genotyping

and very many other biotechnologies

c©Benny Chor – p.10

The BioTechnology RevolutionRobotics’ sequencers

DNA microarrays

2D gels

mass spectrometry

SNPs genotyping

and very many other biotechnologies

produce massive amounts of data.

c©Benny Chor – p.10

The BioTechnology RevolutionRobotics’ sequencers

DNA microarrays

2D gels

mass spectrometry

SNPs genotyping

and very many other biotechnologies

produce massive amounts of data.

The task of analyzing, interpreting, andunderstanding this data is where bioinformaticscomes in.

c©Benny Chor – p.10

Definition (take 1)Working definitions from NIH (US National Instituteof Health):

Bioinformatics: Research, development, orapplication of computational tools andapproaches for expanding the use of biological,medical, behavioral or health data, includingthose to acquire, store, organize, archive, analyze,or visualize such data.

c©Benny Chor – p.11

Definition (take 1)Working definitions from NIH (US National Instituteof Health):

Bioinformatics: Research, development, orapplication of computational tools andapproaches for expanding the use of biological,medical, behavioral or health data, includingthose to acquire, store, organize, archive, analyze,or visualize such data.

Computational Biology: The development andapplication of data-analytical and theoreticalmethods, mathematical modeling andcomputational simulation techniques to the studyof biological, behavioral, and social systems.

c©Benny Chor – p.11

A Computer Scientist PerspectiveRecombinant DNA technology has created arevolution in Molecular Biology in the lastdecade.

c©Benny Chor – p.12

A Computer Scientist PerspectiveRecombinant DNA technology has created arevolution in Molecular Biology in the lastdecade.

New computational problems arise from largegenome projects and novel high throughputbiotechnologies.

c©Benny Chor – p.12

A Computer Scientist PerspectiveRecombinant DNA technology has created arevolution in Molecular Biology in the lastdecade.

New computational problems arise from largegenome projects and novel high throughputbiotechnologies.

Problems involve collection, assembly,organization and interpretation of geneticsequence data.

c©Benny Chor – p.12

A Computer Scientist PerspectiveRecombinant DNA technology has created arevolution in Molecular Biology in the lastdecade.

New computational problems arise from largegenome projects and novel high throughputbiotechnologies.

Problems involve collection, assembly,organization and interpretation of geneticsequence data.

Novel algorithmic, mathematical and statisticaltools are crucial for analyzing this flow ofinformation and discovering new globalstructures in it.

c©Benny Chor – p.12

Important BioInfo TopicsAlgorithms and heuristics motivated by problemsoriginating from molecular biology.

Sequence comparison and alignment.

c©Benny Chor – p.13

Important BioInfo TopicsAlgorithms and heuristics motivated by problemsoriginating from molecular biology.

Sequence comparison and alignment.

Constructing phylogenetic (evolutionary) treesfrom sequence data.

c©Benny Chor – p.13

Important BioInfo TopicsAlgorithms and heuristics motivated by problemsoriginating from molecular biology.

Sequence comparison and alignment.

Constructing phylogenetic (evolutionary) treesfrom sequence data.

Probabilistic models for classification andanalysis of sequence data, e.g. for gene finding.

c©Benny Chor – p.13

Important BioInfo TopicsAlgorithms and heuristics motivated by problemsoriginating from molecular biology.

Sequence comparison and alignment.

Constructing phylogenetic (evolutionary) treesfrom sequence data.

Probabilistic models for classification andanalysis of sequence data, e.g. for gene finding.

Finding regulatory motifs in DNA sequences.

c©Benny Chor – p.13

Structural BioInformaticsDeals mainly with the interplay betweenproteins’ 3-dimensional structure and function,and their relation to designing new medicines.

c©Benny Chor – p.14

Structural BioInformaticsDeals mainly with the interplay betweenproteins’ 3-dimensional structure and function,and their relation to designing new medicines.

Apply many tools from computer vision andcomputational geometry.

c©Benny Chor – p.14

Genomics and ProteomicsAttempts to understand function and interactionsbetween families of genes and proteins within thecell, tissue, or organism.

c©Benny Chor – p.15

Genomics and ProteomicsAttempts to understand function and interactionsbetween families of genes and proteins within thecell, tissue, or organism.

Quite recently, new rolls of small non-codingRNAs were discovered (Science magazinediscovery of the year, 2002).

c©Benny Chor – p.15

Molecular Biology BackgroundTwo important linear molecules: DNA andProteins: Strings over 4- and 20-letter alphabets,respectively.

c©Benny Chor – p.16

Molecular Biology BackgroundTwo important linear molecules: DNA andProteins: Strings over 4- and 20-letter alphabets,respectively.

Specific genes, substrings of DNA, code forspecific proteins.

c©Benny Chor – p.16

Molecular Biology BackgroundTwo important linear molecules: DNA andProteins: Strings over 4- and 20-letter alphabets,respectively.

Specific genes, substrings of DNA, code forspecific proteins.

Protein sequence influences structure, which inturn determines its function.

c©Benny Chor – p.16

Molecular Biology BackgroundTwo important linear molecules: DNA andProteins: Strings over 4- and 20-letter alphabets,respectively.

Specific genes, substrings of DNA, code forspecific proteins.

Protein sequence influences structure, which inturn determines its function.

Moral: Study of similarity in sequence, structure andfunction of biological strings gives clues to furtherdiscovery

c©Benny Chor – p.16

EvolutionBiological systems evolved over time fromsimpler to more complex organisms

c©Benny Chor – p.17

EvolutionBiological systems evolved over time fromsimpler to more complex organisms

History of evolution gives key clues to importantchanges and improvements in biological function

c©Benny Chor – p.17

EvolutionBiological systems evolved over time fromsimpler to more complex organisms

History of evolution gives key clues to importantchanges and improvements in biological function

Moral: Evolutionary history gives important leads tofuther discovery

c©Benny Chor – p.17

And NowTo a short tour of some specific topics and problems.

c©Benny Chor – p.18

Suggested TopicsEmploying string operators, influenced byinformation theoretic tools, for gene finding.

c©Benny Chor – p.19

Suggested TopicsEmploying string operators, influenced byinformation theoretic tools, for gene finding.

“System Biology": Employing linear andprobabilistic models to infer genetic networks,based on gene expression datasets.

c©Benny Chor – p.19

Suggested TopicsEmploying string operators, influenced byinformation theoretic tools, for gene finding.

“System Biology": Employing linear andprobabilistic models to infer genetic networks,based on gene expression datasets.

Finding highly conserved segments among pairsand triplets of genome sequences.

c©Benny Chor – p.19

Suggested TopicsEmploying string operators, influenced byinformation theoretic tools, for gene finding.

“System Biology": Employing linear andprobabilistic models to infer genetic networks,based on gene expression datasets.

Finding highly conserved segments among pairsand triplets of genome sequences.

Finding common and separating properties ofregulatory and metabolic networks over differentspecies.

c©Benny Chor – p.19

Suggested TopicsEmploying string operators, influenced byinformation theoretic tools, for gene finding.

“System Biology": Employing linear andprobabilistic models to infer genetic networks,based on gene expression datasets.

Finding highly conserved segments among pairsand triplets of genome sequences.

Finding common and separating properties ofregulatory and metabolic networks over differentspecies.

Testing the hypothesis that there is a correlationbetween proximity of genes (on the chromosome)and their interaction.

c©Benny Chor – p.19

Suggested Topics (cont.)Finding regulatory motifs, employing an existingsystem (MeX) together with gene expressiondatasets and regions conserved across differentgenomes.

c©Benny Chor – p.20

Suggested Topics (cont.)Finding regulatory motifs, employing an existingsystem (MeX) together with gene expressiondatasets and regions conserved across differentgenomes.

Finding associations among cancer related genes,using an existing tool for detecting linearseparability of genes, and gene expressiondatasets.

c©Benny Chor – p.20

Suggested Topics (cont.)Finding regulatory motifs, employing an existingsystem (MeX) together with gene expressiondatasets and regions conserved across differentgenomes.

Finding associations among cancer related genes,using an existing tool for detecting linearseparability of genes, and gene expressiondatasets.

Exploring properties of the likelihood function inphylogenetic (evolutionary) trees, for simulatedand real sequences.

c©Benny Chor – p.20

Suggested Topics (cont.)Finding regulatory motifs, employing an existingsystem (MeX) together with gene expressiondatasets and regions conserved across differentgenomes.

Finding associations among cancer related genes,using an existing tool for detecting linearseparability of genes, and gene expressiondatasets.

Exploring properties of the likelihood function inphylogenetic (evolutionary) trees, for simulatedand real sequences.

Other topics you would like to explore (afterdiscussing them with us and getting ourapproval).

c©Benny Chor – p.20

top related