1 lsm2241 p1 & p2 – extra discussion questions. features of major databases (pubmed and ncbi...

33
1 LSM2241 P1 & P2 – Extra Discussion Questions

Upload: bridget-jordan

Post on 02-Jan-2016

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

1

LSM2241

P1 & P2 – Extra Discussion Questions

Page 2: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Features of major databases(PubMed and NCBI Protein Db)

2

Page 3: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Anatomy of PubMed Db

3

Page 4: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Epub ahead of print and journal impact factor

4

How to get impact factor of any journal:1) Direct source – web of science database (free for NUS students)2) In direct source, e.g blogs, sites etc (do Google search)

Page 5: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Anatomy of a PubMed record

5Extra information compared to slide 3

Page 6: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Demo on downloading articles

6AccessingOnlineJournalArticles.ppt for details

Page 7: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Anatomy of a Protein Db

7

Page 8: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

8

Popular data sources:dbj – DDBJ (DNA Data Bank of Japan database)emb – The European Molecular BiologyLaboratory (EMBL) databaseprf – Protein Research Foundation database

sp – SwissProtgb – GenBankpir – Protein Information Resource

Version NM_000546.3

GI (or Geninfo Identifier) 120407067Accession NM_000546

Accession numbers and GenInfo Identifiers

NM_000546.3 120407067

Page 9: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

9

Why do we need accession number and GI for one record?

1) What is the difference between accession and GI?

2) Why do we need these two when both seem to be accession numbers?

Page 10: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

10

Why do we need accession numbers and GIs?

Q1) Which revision will NCBI show if you were to search bythe accession only without the version number?

ACCESSION GI VERSION

120407067 NM_000546.38400737 NM_000546.24507636 NM_000546.1

NM_000546

Sequence_v1

NM_000546

Sequence_v2

NM_000546

Sequence_v3

NM_000546

NM_000546.1 NM_000546.2 NM_000546.34507636 8400737 120407067

Sequenceupdate

Sequenceupdate

GIVersion

Page 11: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

11

Accession numbers

-The unique identifier for a sequence record.

-An accession number applies to the complete record.

- Accession numbers do not change, even if information in the recordis changed at the author's request.

-Sometimes, however, an original accession number might becomesecondary to a newer accession number, if the authors make a newsubmission that combines previous sequences, or if for somereason a new submission supercedes an earlier record.

Page 12: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

12

GenInfo Identifiers

- GenInfo Identifier: sequence identification number

- If a sequence changes in any way, a new GI number will be assigned

- A separate GI number is also assigned to each protein translationWithin a nucleotide sequence record

-A new GI is assigned if the protein translation changes in any way

-GI sequence identifiers run parallel to the new accession.version system of sequence identifiers

Page 13: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

13

Version

- A nucleotide sequence identification number that represents a single, specific sequence in the GenBank database.

-If there is any change to the sequence data (even a single base), theversion number will be increased, e.g., U12345.1 → U12345.2, butthe accession portion will remain stable.

-The accession.version system of sequence identifiers runs parallel tothe GI number system, i.e., when any change is made to a sequence,it receives a new GI number AND an increase to its version number.

-A Sequence Revision History tool (http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi)is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record

Page 14: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

14

Anatomy of a Protein Db record

Page 15: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

15

Fasta Sequence

Page 16: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Fasta Format

• Text-based format for representing nucleic acid sequences or peptide sequences (single letter codes).

• Easy to manipulate and parse sequences to programs.

>SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Description line/row

Sequence data line(s)

Description line/row

Sequence data line(s)

Page 17: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Fasta Format (cont.)• Begins with a single-line description, followed by lines of sequence data.• Description line

– Distinguished from the sequence data by a greater-than (">") symbol.– The word following the ">" symbol in the same row is the identifier of the sequence. – There should be no space between the ">" and the first letter of the identifier.– Keep the identifier short and clear ; Some old programs only accept identifiers of only 10

characters. For example: > gi|5524211|Human or >HumanP53• Sequence line(s)

– Ensure that the sequence data starts in the row following the description row (be careful of word wrap feature)

– The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence.

>SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Description line/row

Sequence data line(s)

Description line/row

Sequence data line(s)

Page 18: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Amino acids

18

Page 19: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2
Page 20: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

IUPAC One Letter Amino Acid Code

• A• B• C• D• E• F• G• H• I• J• K• L• M

• N• O• P• Q• R• S• T• U• V• W• X• Y• Z

Alanine

Cysteine

Glycine

Histidine

Isoleucine

Leucine

Methionine

Proline

Serine

Threonine

Valine

Glutamic Acid

Aspartic Acid

Phenylalanine

Lysine

Asparagine

Glutamine

Arginine

Tryptophan

Tyrosine

21st (Sec) Selenocysteine

22nd (Pyl) Pyrrolysine

GLx

ASx

Glutamic Acid

Aspar(D)ic Acid

(F)enylalanine

Lysine

Asparagi(N)e

(Q)lutamine

(R)ginine

T(W)ptophan

T(Y)rosine

21st (Sec)Selenocysteine

22nd (Pyl) Pyrr(O)lysine

GLx

ASx

Page 21: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Note

Amino acid Three letter code Single letter code

Asparagine or aspartic acid Asx B

Glutamine or glutamic acid, GLx Z

Leucine or Isoleucine, Xle J

Unspecified or unknown amino acid Xaa X

Page 22: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Advice• We highly recommend that you memorize the

amino acid codes and their structures (covered in lectures on 3D structures)

• Memorizing the codes and in particular the structures will be very useful for this module and other modules, especially for research purposes.

• It is not compulsory that you memorize these for this module.

Page 23: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Features of major database (Gene Db)

23

Page 24: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

24

Anatomy of Gene Db

Page 25: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

25

Anatomy of a Gene Db record

Page 26: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

A section of Gene Db record:Reference Sequences

26

mRNA Accession number

Protein Accession number

Page 27: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Questions

27

Page 28: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

A) Problem Scenario

Mr. Tan Yong Liang, Benjamin just joined Prof. Tan Tin Wee’s lab to do his PhD. He is to continue the project that was done by Dr. Asif M. Khan, who just graduated from Prof. Tan’s lab with PhD. To better understand the project that Dr. Khan did, Prof. Tan asked Benjamin to read all the papers that were published by him. Benjamin being a newbie to bioinformatics, needs your help in finding the papers. Can you help him answer the following questions?

28

Page 29: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

A) QuestionsQ1. Which database(s) should he search?

Q2. Help him formulate his search query based on the following available information:

1. Corresponding authors: Vladimir Brusic, Thomas J August, Tan Tin Wee

2. In one of the paper, Dr. Asif M. Khan’s name was incomplete: Asif Khan

3. Prof. August has a paper with Rosati M, which is also co-authored by someone with the same incomplete abbreviation as Dr. Khan

Q3. On the results page, you will see two tabs, “All” and “Review”. What is the difference between them?

Q4. Is Pubmed comprehensive?

29

Page 30: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

B) Questions

30

p5312 records total

cancer15 records total

Both terms: 5 records

p53 AND cancer: returns how many records p53 OR cancer: returns how many records p53 NOT cancer: returns how many records

???

Page 31: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

31

C) Questions

Q1) When you perform a search for P53 in the protein database, you observe 4 tabs on top, namely All, Bacteria, Refseq and Related Structures. What do you think is the difference between “RefSeq” and “All” tab?

Page 32: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

32

D) Questions

Q1) Using the skills you have learned and databases that have been introduced to you, can find out where in the p53 protein is the Nuclear Localization Signal located? i.e., what is the sequence range?

Q2) Does the entry belong (P04637) to Refseq database? (Hint: analyze the alphanumeric identifiers of the entry)

Page 33: 1 LSM2241 P1 & P2 – Extra Discussion Questions. Features of major databases (PubMed and NCBI Protein Db) 2

Summary of items covered today

• Intro to Practicals – logistics

• Search strategies exercise and discussion

• Explored basic bioinformatics resources – exercise

and discussion

• Tips/Tricks to improve productivity

• “Libproxy1” suffix shortcut

• WizFolio

33