eb-eye back end

13
EBI is an Outstation of the European Molecular Biology Laboratory. The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group

Upload: franck-valentin

Post on 19-Jun-2015

190 views

Category:

Technology


2 download

DESCRIPTION

The new EBI search engine: EB-eyeBackend : An overview of what is under the hood.

TRANSCRIPT

  • 1. The new EBI search engine: EB-eye Backend : An overview of what is under the hood Franck Valentin External Services group Bioinformatics masters' students Open Day

2. Summary

  • What is available
  • Parsing
  • Indexing challenge
  • Software behind EB-eye
  • Acknowledgments

3. What is the data available ? ArrayExpress Ligand Interpro > 20 domains >130M entries > 550 Gb of data 4. What is the data available formats ArrayExpress Ligand Interpro . . . . . . . . . . . . . . . . . . . . . ID: ..PARENT ID : .. RANK: .. ... ID ... AC ... DT ... ID ... AC ... DT ... ID ... AC ... DT ... 5. What is the data available sizes 43M 4.2G 1G 8.4G Interpro 81M 57Gb, >500 files 374Gb, >600 files 6. Points to take into consideration

  • Our World
    • A variety of file formats
    • A large amount of data
    • A variety of file sizes
    • Data formats are changing
  • Our Quest
    • Index the data as fast as possible
    • Add and configure a new domain easily
    • Detect errors in the data

7. Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar Uniprot grammar . . . Parser (ANTXR) Medline grammar Interpro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index IDAF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. ACAF030562; DT04-DEC-1997 (Rel. 53, Created) DT03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DEFusarium venenatum clone VEN-A RAPD band generated using Operon primer DEOPW-03, sequence tagged site. . . . Flat files 1099793520001004 XML files 14216186 1965 02 01 1996 12 01 20070301 0009-8981 10 1964Jul Clinica chimica acta; international journal of clinical chemistry Clin. Chim. Acta . . . . . . ID Creation Date Modification Date issn volume name IDAF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX ACAF030562 ; XX DT04-DEC-1997(Rel. 53, Created) DT03-MAR-2000(Rel. 62, Last updated, Version 2) XX DEFusarium venenatum clone VEN-A RAPD band generated using Operon primer DEOPW-03, sequence tagged site . XX KWSTS. XX OSFusarium venenatum OCEukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OCHypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN[1] RP1-852 RAYoder W.T., Christianson L.M .; RT"Species-specific primers resolve members of the section Fusarium . RTTaxonomic status of the edible 'Quorn' fungus re-evaluated "; RLFungal Genet. Biol. 0:0-0(1997). XX RN[2] RP1-852 RAYoder W.T., Christianson L.M.; RT; RLSubmitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RLMicrobiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RLUSA XX FHKeyLocation/Qualifiers FH FTsource1..852 FT/organism="Fusarium venenatum" FT/strain="ATCC20334" . . .ID AC Creation date / Modification date Description Organism species Organism classes References References IntAct.ExperimentExperimental procedures that allowed to1.02007-Feb-165697 Dump file (XML) 8. Divide and Conquer the Indexing Uniprot (>4M entries) Embl (>83M entries) 2 files,~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries) 1 file,~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, ) XML XML Db XML dump XML dump XML dump Uniprot Index Embl Index Taxonomy Index Medline Index GO Index ArrayExpress Index Ensembl Index Intact Index 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump 9. Libraries

  • Indexing
    • Lucene ( http://lucene.apache.org )
    • ANTLR ( http://www.antlr.org/ )
    • ANTXR ( http://javadude.com/tools/antxr/index.html )
    • JGroups( http://www.jgroups.org )
  • Web
    • Tomcat ( http://tomcat.apache.org/ )
    • Spring Framework ( http://www.springframework.org )

10. Acknowledgements

  • Rodrigo Lopez
  • Janet Thorntonand Graham Cameron
  • EMBL, EBI Industry Support Programme, European Patent Office and the EU.
  • All the data providers
  • Our colleagues in the External Services group
  • The System Group