understanding how to best leverage open source data management software: a roadmap

29
Understanding how to Best Leverage Open Source Data Management Software: A Roadmap Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation

Upload: celestine-mauro

Post on 03-Jan-2016

9 views

Category:

Documents


0 download

DESCRIPTION

Understanding how to Best Leverage Open Source Data Management Software: A Roadmap. Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation. Agenda. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Understanding how to Best Leverage Open Source

Data Management Software: A Roadmap

Chris A. MattmannSenior Computer Scientist, NASA Jet Propulsion Laboratory

Adjunct Assistant Professor, Univ. of Southern CaliforniaMember, Apache Software Foundation

Page 2: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Agenda• Context: Earth Science Data Systems• Concerns in the Open Source World• Open Source Data Management Solutions• Why Apache?• Open Source War Stories• Wrap up

22-Feb-12 2BESSIG-2012

Page 3: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

And you are?

• Apache Member involved in– OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS

(Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor)

• Senior Computer Scientist at NASA JPL in Pasadena, CA USA

• Software Architecture/Engineering Prof at Univ. of Southern California

22-Feb-12 3BESSIG-2012

Page 4: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

The NASA ESDS Context

4

Where is open source most useful?

Which area should produce open source software?

22-Feb-12 BESSIG-2012

Page 5: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Concerns in the Open Source World

• Licensing– GPL(v2, v3?), LGPL(v?), BSD, MIT, ASLv2– Your own custom license approved by OSS

• NASA OSS license?• Caltech license?

– Copy-left versus Copy-right• Redistribution

– Can you take open source product X and use it in your commercially interested software Y?

• If so, do you have to pay for it?– Should others pay for your open source product if they use it in their

commercial application?• Open Source “Help Desk” Syndrome versus Community

– Are you trying to simply make your open source software (releases) available for distribution (aka help desk)?

– Are you trying to get others to “buy in” to your open source software?

522-Feb-12 BESSIG-2012

Page 6: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Concerns in the Open Source World

• Intellectual Property– Who owns it?– How does the Open Source Software affect your IP?

• Open Source Ecosystems– Where can you find the “killer app” you need?– Which communities are conducive for longevity?– How relevant are “generic” open source software communities to NASA Earth

Science Data Systems?• Contributing

– Are you even allowed to contribute to a OSS community?– Can you do it on “company” time?– What’s required?– What’s the governance?

• Responsiveness– How response is the OSS community to your projects’ needs?

622-Feb-12 BESSIG-2012

Page 7: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Concerns in the Open Source World

722-Feb-12 BESSIG-2012

Page 8: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

The NASA ESDS Context

8

The aforementioned OSS concerns are cross cutting against the whole ESDS enterprise!

22-Feb-12 BESSIG-2012

Page 9: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Apache OODT• Entered “incubation” at the Apache

Software Foundation in 2010

• Selected as a top level Apache Software Foundation project in January 2011

• Developed by a community of participants from many companies, universities, and organizations

• Used for a diverse set of science data system activities in planetary science, earth science, radio astronomy, biomedicine, astrophysics, and more

OODT Development & user community includes:

http://oodt.apache.org

22-Feb-12 BESSIG-2012 9

Page 10: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Apache

• Heavily influenced by the underlying– License used– OSS redistribution infrastructure– What “it” is?

• Case Study: Apache Hadoop

10

Who owns it?Apache owns the Hadoop trademark

22-Feb-12 BESSIG-2012

Page 11: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Why develop OSS data systems at Apache?

• Most data system “frameworks” – Are not meant to be “turn key” – Attempt to exploit the boundary

between bringing in capability vs. being overly rigid in science

• Apache is the elite open source community for software developers– Less than 100 projects have been

promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop)

– Differs from other open source communities; it provides a governance and management structure; community emphasis

22-Feb-12 BESSIG-2012 11

Page 12: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Apache Maturity Model• Start out

with Incubation

• Grow community

• Make releases

• Gain interest• Diversify

• When the project is ready, graduate into– Top-Level Project (TLP)– Sub-project of TLP

• Increasingly, Sub-projects are discouraged compared to TLPs

12BESSIG-201222-Feb-12

Page 13: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

• Apache is a meritocracy– You earn your keep and your

credentials• Start out as Contributor

– Patches, mailing list comments, etc.– No commit access

• Move onto Committer– Commit access, evolve the code

• PMC Members– Have binding VOTEs on releases/personnel

• Officer (VP, Project)– PMC Chair

• ASF Member– Have binding VOTE in the state of the foundation– Elect Board of Directors

• Director– Oversight of projects, foundation activities

13BESSIG-201222-Feb-12

Apache Organization

Page 14: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

OSS Ecosystems• Where should you go to for your open source project?

• Should your org have its own?• Should your project (SIPS, DAAC, proposal) have its own?

1422-Feb-12 BESSIG-2012

Page 15: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

SourceForge (a different model)• Project Proposal

– Accepted? Get going!• No foundation-wide oversight• Tons of dormant projects with

no communities of interest• Goal is to host infrastructure and

host technologies• Goal is not to build communities• No foundation-wide rules or guidelines for committership or for

project management – Dealt with locally by the progenitor of the project– Can lead to BDFL (benevolent dictator for life) syndrome

• No foundation-wide license requirements– BSD, GPL(v2, v3), MIT, LGPL, etc all allowed

15BESSIG-201222-Feb-12

Page 16: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Insulation: one strategy• Missions maintain

their own local CMs• Local mission CMs

contain forks of existing OSS software– Forks can be patch

based or CM based

• Changes found particularly effectiveare discussed within the comm.And eventually brought before a CCB that reviews their generality, etc.

1622-Feb-12 BESSIG-2012

Credit: D. Freeborn

Page 17: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Free as in beer• “Something like the Apache foundation is the best place for released government software.

A previous attempt at release and public distribution via a private company was a truly dismal failure. OpenPBS (portable batch system) is supposed to be available to anyone that asks. However when you do ask a sales rep strings you along for more than a month trying to sell you something that they can't actually assure you will fit your requirements (and is no longer under development) even when the free one is documented as doing so. It was a truly stupid waste of the salesperson's time and mine that would have exceeded the price of providing the file for download or sending by email by several orders of magnitude and generated a lot of ill will. I'll go as far as saying it was blatant false advertising using a government funded open source product to do a bait and switch to try to sell me an unmaintained product they picked up in a corporate take over. My experience appears to have been identical to that of many that attempted to obtain this government funded open source software that NASA had declared was available for anyone. Eventually due to this open source project becoming closed the project just had to fork and the compatible Torque batch system was developed by people that had actually get hold of the original OpenPBS.”

– Slashdot user comment

1722-Feb-12 BESSIG-2012

Page 18: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

22-Feb-12 BESSIG-2012 18

NASA Open Source Summit

Page 19: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

NASA Open Source Summit

22-Feb-12 BESSIG-2012 19

http://www.nasa.gov/open/source/

Page 20: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Recommendations• Communication and Publicizing NASA’s Open Source Efforts• Define what open source licenses can be used• Remove barriers to involvement in open source• Remove barriers to open source development models• Define policy for dealing with ITAR/other restrictions• Define policy for contributing to external open source projects• Define governance model• Develop NASA cooperative support structure• Start projects out in open source• Close feedback loop between developers, policy makers and users• Hire open source talent (hint: it’s a specialized skill)• Make open source software more accessible• Unify open source and the office of the chief engineer

22-Feb-12 BESSIG-2012 20

Page 21: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Why licenses are important

• “War Story”– Amazon EC2, S3

• Johnson and Johnson Pharm. R&D– At a recent conference I met the director of R&D for J&J. He presented a

story wherein which J&J needed large bursting processing and limited data storage for some drug tests they were conducted. They decided to use Amazon EC2. After reviewing Amazon’s licensing policy for EC2 J&J’s laywers determined that Amazon claimed IP for any data or computational results produced on its cloud. Since the need for Amazon’s processing and cloud was limited to a few trials, and since the costs were so outrageous to stand up its own cluster for these experiments, J&J decided to forge ahead with Amazon with the understanding that its lawyers would “duke it out.” should the need arise with Amazon’s lawyers based on the FOU restrictions and IP claims induced by Amazon EC2.

2122-Feb-12 BESSIG-2012

Page 22: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Why licenses are important• “War Story 2”

– Oracle versus Google

• Is there really free Java and free lunch?– Although there is no definitive answer besides papers filed in court,

Oracle’s claims in its lawsuit are based on perceived patents for the Java Virtual Machine and its associated IP. Sun originally filed patents on the Java Virtual Machine and its translation of programming language code into runtime executables. In order for a JVM to be a “certified” (read: trademark) JVM, the JVM must pass a “Test Compatibility Kit” (TCK), which Oracle/Sun license at a cost to JVM vendors. By purchasing the TCK, a JVM builder is given IP knowledge of the JVM patents. If a company builds a JVM but does not purchase the TCK, Oracle/Sun loses licensing dollars.

2222-Feb-12 BESSIG-2012

Page 23: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Contribution case study• “War Story 3”

– Majority Rules

• The tyranny of the majority?– Company X decides to put a project Foo into open source. They go through the OSS Y

Foundation which encourages community building versus “Help Desk” syndrome. Project Foo decides to employ RTC practices to accept code contributions from the community, and chooses to weight code contributions higher than any other form of contribution. Since Project Foo is managed by a majority of people within Company X, and since those people have Company X’s best interests in mind (versus the OSS project Foo or its community), Project Foo only elects “friends” who are subordinate to their ideals or members of Company X, as new code “committers”. When outside contributors provide patches, they sit in an issue tracking system for years, and are eventually turned into new project Foo features in Company X’s “distribution of project Foo”, a repackaging of open source Foo for Company X. After a few years working in this model, the non Company X members of project Foo stop participating on the mailing lists, and leave.

2322-Feb-12 BESSIG-2012

Page 24: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Mutually Beneficial?• “War Story 4”

– Sharing software across missions

• I’ll use your software if you use mine– NASA mission X makes a fork of OSS software S and adds

a new great component W, and improvement over an existing component M, to it. NASA mission Y, started around a similar time also forks S, but uses M as is and doesn’t build a new component like W. NASA mission Z comes along and is interested in using OSS, but wants to make sure they are on the same baseline as missions X and Y. NASA missions X, Y and Z huddle up and try to decide how to sort out using OSS, and getting onto the same baseline.

2422-Feb-12 BESSIG-2012

Page 25: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Strategies for addressing the previous use case

• Option 1– Create a “mirror” of S shared by Missions X, Y and Z– Missions X, Y and Z may still maintain local repos– X checks new W in shared repo– Y ignores W and uses M from shared repo– Z uses shared repo to stay on common baseline– Pros:

• Insulate X, Y and Z from changes to S • Z has 1 place to get mission forks• Requires coordination between X, Y and Z

– Cons:• Requires coordination between X, Y and Z!• If W touches M, or changes M in any way, then Y is not insulated from X’s changes• Requires representatives from X, Y and Z to “take ownership” over shared repo’s

maintenance

2522-Feb-12 BESSIG-2012

Page 26: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Strategies for addressing the previous use case

• Option 2– Missions X, Y and Z may still maintain local repos– X checks new W in its local mission X repo as patchset to S

baseline– Y ignores Mission X’s W and uses M from S– Z uses Mission Y’s W patch to S to stay on common baseline– Pros:

• Insulate X, Y and Z from changes to S

• Z has 1 place to get mission forks

• Requires NO coordination between X, Y and Z

– Cons:• Requires understanding of patch maintenance, and patch merging

into local repos

2622-Feb-12 BESSIG-2012

Page 27: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Strategies for addressing the previous use case

• Option 3– Missions X, Y and Z may still maintain local repos– X checks new W in its local mission X repo as branch of S baseline– Y ignores Mission X’s W and uses M from S– Z uses Mission Y’s W branch to S to stay on common baseline– Pros:

• Insulate X, Y and Z from changes to S • Z has 1 place to get mission forks• Requires NO coordination between X, Y and Z• Patch maintained as branch using CM tool’s super tools (history,

changelog, etc., can even generate patches if desired)

– Cons:• Requires understanding of branch maintenance, and branch merging into

local repos

2722-Feb-12 BESSIG-2012

Page 28: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Wrapup

• Open source is critical to the strategy of the organization

• A great method of sustainability and longevity of software and community beyond organizational boundaries

• Different licenses, communities, development practices– Know what the differences mean

22-Feb-12 BESSIG-2012 28

Page 29: Understanding how to Best Leverage Open Source  Data Management Software:  A Roadmap

Alright, I’ll shut up now

• Any questions?

• THANK YOU!– [email protected] – @chrismattmann on Twitter

22-Feb-12 29BESSIG-2012