bioperl (poster t02, ismb 2010)

1
BioPerl at 15: New Features, New Directions Christopher J. Fields, University of Illinois, [email protected] *Mark A. Jensen, Fortinbras Research and SRA International, [email protected] Jason E. Stajich, University of California at Riverside, [email protected] The BioPerl Project, an open-source Perl toolkit for bioinformatics, was initiated in 1995 and became instrumental in the automated organization and analysis of original Human Genome Project data. Since then, BioPerl has become a complete object-oriented Perl environment for bioinformatics development, with modules to perform a wide range of bioinformatics functions, including multi-format parsing and translation, object-relational model databasing, EMBL and NCBI web service access, and external program execution. The BioPerl developer community is actively responding to the far-reaching changes in the field that have taken place over the last several years. Major goals are: (1) to provide new functionality useful to researchers at the cutting edge of bioinformatics, (2) to reorganize BioPerl into smaller application-oriented packages, (3) to deprecate older modules whose utility has declined substantially, and (4) to continue to expand and improve documentation, so that BioPerl remains useful and relevant in the years ahead. Year Sponsoring Institutio n Student Project Example Module 2008 NESCent Mira Han PhyloXML parsing Bio::TreeIO::phylo xml 2009 NESCent Chase Miller NeXML parsing Bio::Nexml 2010 OBF Jun Yin Alignment subsystem refactoring in progress source: http://www.ohloh.net/p/bioperl Google Summer of Code BioPerl has provided mentorship for GSoC projects for the past three years. These have resulted in material additions to the codebase, and have been focused on expanding BioPerl's capabilities in format parsing and large file processing. The BioPerl wiki (http://bioperl.org) The wiki is now the central location for all BioPerl documentation: installation, module POD, HOWTO articles, code snippets, and personnel descriptions. It has played an important role as the new face of BioPerl and as a landing for the developer discussions that are taking BioPerl forward. BioPerl on gitHub (http://github.com/bioperl) BioPerl recently migrated all active repositories to gitHub from OBF-hosted Subversion. With the move to git comes decentralization and more fluid, independent development. We expect this to improve the BioPerl response time both to bugs and to new developments in the field, as well as increase new developer recruitment and community participation. Community participation and development New features New directions Next-gen sequencing support Bringing BioPerl up to speed for next-gen sequence data handling has led to efforts along three lines: file format standardization, common command-line tool wrapping, and BioPerl object system I/O integration tailored to next-gen data. Formats BioPerl and other Bio* projects recently published a collaborative effort to standardize FASTQ formats, including variants for Illumina and Solexa platforms. These formats are now in use across BioPerl and the Bio* projects. Support for important binary formats (BAM, BigWIG) is provided by wrappers for command line tools, and the integration of fast XS-based Perl modules such as Lincoln Stein's Bio- SamTools and Bio-BigFile CPAN packages. Wrappers Enhancements to the Bio::Tools::Run::WrapperBase system has made it easier to add BioPerl wrapper modules for external programs, and to integrate these into other modules that implement pipelines using BioPerl sequence and alignment objects as I/O. Tracking NCBI developments In the past year, NCBI has released a fully updated BLAST toolkit, blast+ , and has been encouraging a move from their EUtilities RESTful interface to a newer SOAP interface . BioPerl has responded with Bio::Tools::Run::StandAloneBlastPlus and Bio::DB::SoapEUtilities. These were designed not only to update the API interface, but also to add I/O layers that accept and parse messages into familiar BioPerl objects, and to build in straightforward methods for creating pipelines of blast+ program analyses or EUtilities fetches. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html bedtools bowtie bwa minimo newbler samtools BioPerl object support : Bio::Assembly The Bio::Assembly system has been extensively updated, to include reading and/or writing assemblies in MAQ, BAM, SAM, BWA, and other formats. Assembly object support is integrated into run wrappers for bwa, bedtools, maq, and samtools. Future work will incorporate new sequence objects that are optimized for large files (through the work of GSoC student Jun Yin). use Bio::Tools::Run::Maq; my $maq = Bio::Tools::Run::Maq->new(); $assy_obj = $maq->run('read1.fastq', 'refseq.fas', 'read2.fastq'); Convert plain text sequence Map reads to reference seq Assemble map into consensus Extract info from consensus fasta2bfa fastq2bfq map mapmerge assemble mapview cns2fq maq assembly pipeline Timeline BioPerl has grown in its user and developer base since those early days. New developers and collaborations have contributed not only key modules, but also important design methodologies and refactoring over the years that have helped BioPerl to maintain its usefulness and relevance. Discontinuities followed by increases in lines of code over time reflect a high level of community flexibility and dedication in pursuit of DTWT. General wrapper facility A set of modules (Bio::Tools::WrapperMaker) is under development that will increase the responsiveness of BioPerl development by providing an XML- based way for users themselves to specify the interface for their favorite commandl ine programs, at the same time creating a common, consistent API for executing those programs and accessing output. Intermediate layers for large file handling and generic parsing BioPerl parsers generally take raw data to Perl objects with no intermediate layer. This induces prohibitive overhead when parsing large files, and also can limit user flexibility: parsing may be desired, but not the BioPerl objects. The first problem is being tackled by attaching backend handlers onto container class constructors that are able persist records of large files efficiently, creating BioPerl objects only as needed or desired. The second problem has led to experiments in generic parsing: data file records are parsed into a simple stream of hashes, which then can be directed where the user desires; into the creation of BioPerl objects as usual, or elsewhere. Biome and BioPerl 6 BioPerl has been object-oriented from the beginning, but suffers the weaknesses of Perl 5 objects: very high overhead, loose encapsulation, limited object introspection, and the lack of built-in interfaces and roles, among other things. These issues are being addressed in two ways: in Perl 5 through the Moose classes and dependencies, and in the creation of Perl 6. BioPerl is exploring both paths to true objects with the experimental Biome (BioPerl with Metaobject Extensions) and BioPerl 6 projects. class consumes role Class Role must instantiate reqd abstract method consuming class possesses role members instance possesses concrete role methods main:: Biome role as interface Shattering the Monolith BioPerl continues to be distributed as just a handful of packages. The core package in particular has grown to 341 files, comprising 874 classes with 23,146 tests. Maintenance and installation issues are barriers to developers and users alike. We are in the process of splitting the core into reasonable, application-related chunks. This plus the git migration should significantly improve BioPerl management. The BioPerl Core Development Team is Sendu Bala, Rob Buels, Christopher Fields, Mark Jensen, Hilmar Lapp, Heikki Lehväslaiho, Aaron Mackey, Dave Messina, Brian Osborne, Jason Stajich, and Lincoln Stein. Key support is provided by Chris Dagdigian and Mauricio Herrera Cuadra. Florent Angly and Dan Kortschak are lead developers of projects discussed here.

Upload: mark-jensen

Post on 10-May-2015

519 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: BioPerl (Poster T02, ISMB 2010)

BioPerl at 15: New Features, New DirectionsChristopher J. Fields, University of Illinois, [email protected]

*Mark A. Jensen, Fortinbras Research and SRA International, [email protected] Jason E. Stajich, University of California at Riverside, [email protected]

The BioPerl Project, an open-source Perl toolkit for bioinformatics, was initiated in 1995 and became instrumental in the automated organization and analysis of original Human Genome Project data. Since then, BioPerl has become a complete object-oriented Perl environment for bioinformatics development, with modules to perform a wide range of bioinformatics functions, including multi-format parsing and translation, object-relational model databasing, EMBL and NCBI web service access, and external program execution. The BioPerl developer community is actively responding to the far-reaching changes in the field that have taken place over the last several years. Major goals are: (1) to provide new functionality useful to researchers at the cutting edge of bioinformatics, (2) to reorganize BioPerl into smaller application-oriented packages, (3) to deprecate older modules whose utility has declined substantially, and (4) to continue to expand and improve documentation, so that BioPerl remains useful and relevant in the years ahead.

Year Sponsoring Institution Student Project Example Module

2008 NESCent Mira Han PhyloXML parsing Bio::TreeIO::phyloxml

2009 NESCent Chase Miller NeXML parsing Bio::Nexml

2010 OBF Jun Yin Alignment subsystem refactoring in progress

source: http://www.ohloh.net/p/bioperl

Google Summer of Code

BioPerl has provided mentorship for GSoC projects for the past three years. These have resulted in material additions to the codebase, and have been focused on expanding BioPerl's capabilities in format parsing and large file processing.

The BioPerl wiki (http://bioperl.org)

The wiki is now the central location for all BioPerl documentation: installation, module POD, HOWTO articles, code snippets, and personnel descriptions. It has played an important role as the new face of BioPerl and as a landing for the developer discussions that are taking BioPerl forward.

BioPerl on gitHub (http://github.com/bioperl)

BioPerl recently migrated all active repositories to gitHub from OBF-hosted Subversion. With the move to git comes decentralization and more fluid, independent development. We expect this to improve the BioPerl response time both to bugs and to new developments in the field, as well as increase new developer recruitment and community participation.

Community participation and development New features New directions

Next-gen sequencing support

Bringing BioPerl up to speed for next-gen sequence data handling has led to efforts along three lines: file format standardization, common command-line tool wrapping, and BioPerl object system I/O integration tailored to next-gen data.

Formats

BioPerl and other Bio* projects recently published a collaborative effort to standardize FASTQ formats, including variants for Illumina and Solexa platforms. These formats are now in use across BioPerl and the Bio* projects.

Support for important binary formats (BAM, BigWIG) is provided by wrappers for command line tools, and the integration of fast XS-based Perl modules such as Lincoln Stein's Bio-SamTools and Bio-BigFile CPAN packages.

Wrappers

Enhancements to the Bio::Tools::Run::WrapperBase system has made it easier to add BioPerl wrapper modules for external programs, and to integrate these into other modules that implement pipelines using BioPerl sequence and alignment objects as I/O.

Tracking NCBI developments

In the past year, NCBI has released a fully updated BLAST toolkit, blast+†, and has been encouraging a move from their EUtilities RESTful interface to a newer SOAP interface‡.

BioPerl has responded with Bio::Tools::Run::StandAloneBlastPlus and Bio::DB::SoapEUtilities. These were designed not only to update the API interface, but also to add I/O layers that accept and parse messages into familiar BioPerl objects, and to build in straightforward methods for creating pipelines of blast+ program analyses or EUtilities fetches.

†ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST‡http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html

bedtoolsbowtiebwaminimonewblersamtools

BioPerl object support : Bio::Assembly

The Bio::Assembly system has been extensively updated, to include reading and/or writing assemblies in MAQ, BAM, SAM, BWA, and other formats. Assembly object support is integrated into run wrappers for bwa, bedtools, maq, and samtools. Future work will incorporate new sequence objects that are optimized for large files (through the work of GSoC student Jun Yin).

use Bio::Tools::Run::Maq;my $maq = Bio::Tools::Run::Maq->new();$assy_obj = $maq->run('read1.fastq', 'refseq.fas', 'read2.fastq');

Convert plain text sequence

Convert plain text sequence

Map reads to reference seqMap reads to reference seq

Assemble map into consensus

Assemble map into consensus

Extract info from consensus

Extract info from consensus

fasta2bfafastq2bfq

mapmapmerge

assemble

mapviewcns2fq

maq assembly pipeline

Timeline

BioPerl has grown in its user and developer base since those early days. New developers and collaborations have contributed not only key modules, but also important design methodologies and refactoring over the years that have helped BioPerl to maintain its usefulness and relevance. Discontinuities followed by increases in lines of code over time reflect a high level of community flexibility and dedication in pursuit of DTWT.

General wrapper facility

A set of modules (Bio::Tools::WrapperMaker) is under development that will increase the responsiveness of BioPerl development by providing an XML-based way for users themselves to specify the interface for their favorite commandl ine programs, at the same time creating a common, consistent API for executing those programs and accessing output.

Intermediate layers for large file handling and generic parsing

BioPerl parsers generally take raw data to Perl objects with no intermediate layer. This induces prohibitive overhead when parsing large files, and also can limit user flexibility: parsing may be desired, but not the BioPerl objects. The first problem is being tackled by attaching backend handlers onto container class constructors that are able persist records of large files efficiently, creating BioPerl objects only as needed or desired. The second problem has led to experiments in generic parsing: data file records are parsed into a simple stream of hashes, which then can be directed where the user desires; into the creation of BioPerl objects as usual, or elsewhere.

Biome and BioPerl 6

BioPerl has been object-oriented from the beginning, but suffers the weaknesses of Perl 5 objects: very high overhead, loose encapsulation, limited object introspection, and the lack of built-in interfaces and roles, among other things.

These issues are being addressed in two ways: in Perl 5 through the Moose classes and dependencies, and in the creation of Perl 6. BioPerl is exploring both paths to true objects with the experimental Biome (BioPerl with Metaobject Extensions) and BioPerl 6 projects.

class consumes role

class consumes role

ClassRole

must instantiate reqd abstract method

must instantiate reqd abstract method

consuming class possesses role members

consuming class possesses role members

instance possesses concrete role methods

instance possesses concrete role methods

main::

Biome role as interface

Shattering the Monolith

BioPerl continues to be distributed as just a handful of packages. The core package in particular has grown to 341 files, comprising 874 classes with 23,146 tests. Maintenance and installation issues are barriers to developers and users alike. We are in the process of splitting the core into reasonable, application-related chunks. This plus the git migration should significantly improve BioPerl management.

The BioPerl Core Development Team is Sendu Bala, Rob Buels, Christopher Fields, Mark Jensen, Hilmar Lapp, Heikki Lehväslaiho, Aaron Mackey, Dave Messina, Brian Osborne, Jason Stajich, and Lincoln Stein. Key support is provided by Chris Dagdigian and Mauricio Herrera Cuadra. Florent Angly and Dan Kortschak are lead developers of projects discussed here.