how it's made - myget.org - azureconf

Post on 15-May-2015

583 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

THANK YOU, LOCAL ORGANIZERS!

Over 60 community-led Windows Azure training events worldwide!

http://globalwindowsazure.azurewebsites.net

Maarten Balliauw@maartenballiauw

How it’s madeMyGet.org

• Maarten BalliauwDaytime: Technical Evangelist, JetBrains

Nighttime: Co-founder MyGet.org• AZUG• Focus on web• Big passion: Windows Azure• http://blog.maartenballiauw.be • @maartenballiauw

Who am I?

Shameless self promotion: Pro NuGet - http://amzn.to/pronuget

• NuGet? MyGet?• How we started• What we did not know• Our first architecture• Our second architecture• ACS• Tough times (learning moments)• Conclusion

Agenda

NuGet?MyGet?

NuGet? MyGet?

NuGet? MyGet?

• Safely store your IP with us• Creating packages is hard. We have Build

Services!• Granular security• Activity streams• Symbol server• Analytics

Why MyGet?

• Xavier Decoster@xavierdecoster

• Yves Goeleven@yvesgoeleven

• Also known as @MyGetTeam

I’m not alone!

How westarted

The real begin? May 09, 2011

• Using OData as their feeds• Which is some sort of WCF…• Multiple feeds?• Exchanged some ideas with Xavier• Prototyped something during TechDays 2011

NuPack!

Prototype online! May 31, 2011

• Windows Azure (yay, new toy!)• Windows Azure Table Storage & Blob Storage

(cheap in case we fail!)• Windows Azure ACS (no way I’m typing

another user registration)• ASP.NET MVC 2• MEF

Technologies used?

Best practices used?

• One web role• One storage account

Architecture at the time?

What wedid notknow…

• Grew from 5 feeds to 70 feeds in a few weeks, thinking we hit our max.

Users would come!

• One user started pushing 1.300 packages worth 1 GB of storage.

• Others started pushing CI packages.

Data would come!

• A lot of refactoring done• Using best practices• SOLID and DRY (well, not everywhere but

refactoring takes time)• Running on two instances (availability, yay!)

ReSharper time!

• Someone mentioned they would pay for our service

• Business model• Public site• Volume of feeds kept going up• Users in EU and US

We “started”

Our firstarchitecture

WEB ROLE

STORAGE

EU-WEST NORTH CENTRAL US

WEB ROLE

STORAGE

• Datacenters nearby our users• Centralized storage• Packages on CDN for faster throughput• DNS fail-over if one of the DC’s went down

Awesome!

• Datacenters nearby our usersOr not?

• Centralized storage Speed of light! USA was slow!

• Packages on CDN for faster throughput Sync issues, downtime, …

• DNS fail-over if one of the DC’s went down Seems not every ISP follows DNS standards

Not so awesome…

• Local caching in USA added• 2 instances in EU, 1 in the USA• Syncing data kept being slow• Populating cache was a nightmare• CDN kept having issues• Of 3 instances, only 1 was being used with

enough load (~60%)

We persisted!

• We had public subscription plans• We added enterprise tenants (multi-tenancy

added)• Resulting in…

• Architecture became complex• Caching and syncing became complex

We pivoted!

Our secondarchitecture

• Managing feeds and packages• Doesn’t matter much where (who cares about a little latency)

• Downloading packages• May matter where, let the tenant decide on storage account location

• Builds• Who cares where!

Workloads

WEB ROLE

STORAGE

EU-WEST

STORAGE

EU-NORTH

STORAGE ACCT PER TENANT

OTHER DATACENTERS

STORAGE ACCT (SOME TENANTS)

VIRTUAL MACHINE (BUILDS)

• … was scaled across the globe• … but as synchronous as it could be• … prone to all issues with latency vs.

synchrony

• Event Driven Architecture?*

*some concepts borrowed from EDA

Our first architecture…

• Some actions put an ICommand on a queue(ground rule: if it can’t be done in 1 write, use ICommand)

• All actions complete with an IEvent on a queue

• Handlers can subscribe to ICommand and IEvent

• Handlers are idempotent and not depending on others

EDA in MyGet

• 2 operations: 1 read, 1 write• Read the profile• Store the profile with LastLogin date• No use of ICommand• Finishes with UserLoggedInEvent

Example: log in

• Many operations!• Read two user profiles• Read current access rights• Change access rights• Push new privileges to SymbolSource.org

• One command, one event• ChangeFeedOwnerCommand• FeedOwnerChangedEvent

Example: change feed owner

Example: change feed owner

ChangeFeedOwnerCommandHan

dler

ChangeFeedOwnerCommand

FeedOwnerChangedEvent

SymSrcHandler<FeedOwnerChangedEve

nt>

SymSrcEvent

ActivityLogHandler

<FeedOwnerChangedEvent>

• We now run on 2 instances, mostly for redundancy (coming from 3)

• Average CPU usage? 20% (coming from 60%)

• Way easier to implement new features!• New feature: activity log• Simply subscribe to events we want to see in that log

Gain?

• Why no relational database?

• With only PartitionKey as an index, how do you store a feed’s packages and versions in an optimal way?• Three important values: feed name, package id, package version• Table per feed• Package id = PartitionKey• Package version = RowKey

Storage

• Reading 1.000 rows and deserializing them is SLOW (many seconds)

• We cache some tables on blob storage• 1.000 rows in serialized JSON = small• Loading one file = fast• Searching in memory through 1.000 rows = fast

• Cache update subscribed to IEvent

Storage

Windows AzureAccess Control Service

• Multiple applications• www.myget.org• staging.myget.org• localhost:1196• Customer1.myget.org• Customer2.myget.org• …

• Multiple identity providers• Who wants Microsoft Account?• Google anyone?• Oh, your custom ADFS? Sure!

Imagine managing this!

production tenants

www.myget.org*.customer.myget.orgother domain names

localhost:1196 myget-staging.cloudapp.net

develo

pm

ent

stag

ing

Windows Azure Access Control Service

• Users typically have some identity that allows federation

• ACS gives us Microsoft Account, Yahoo!, Google & Facebook accounts*

• We only care about ACS in our code

*we built many others and are working on a spin-off http://socialsts.com

No more user registration!

• An identity is an identity, whether dev/staging/prod

• ACS handles subtle differences per environment

• Our app just gets and uses the claims

No difference between environments!

• Easy multi-tenant logins with different identity providers

• ACS decides how to log in based on the audience• www.myget.org• some.customers.myget.org

• Our app just gets and uses the claims

No difference between tenants!

Tough timesLearning moments

• Symptoms:• Users complaining about “downtime”• No monitoring SMS alert• Half an hour later: “site up!”, “site down!”, “site up!”, “site down!” SMS

alerts• No sign of issues in the Windows Azure Management portal

• But what’s the cause?• We just deployed our multi-tenant architecture• We just enabled storage analytics• ELMAH was showing storage throttling• 16.000 unprocessed commands and events in queue

Huge downtime on July 2nd, 2012Full story at http://blog.myget.org/post/2012/07/02/Site-issues-on-July-2nd-2012.aspx

• One, simple piece of code…• GetHashCode() on Package object faulty

• “If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.“

• GetHashCode() used to track object in data context (new vs. update)

• 2 objects with the same hashcode = UnhandledException

Huge downtime on July 2nd, 2012Full story at http://blog.myget.org/post/2012/07/02/Site-issues-on-July-2nd-2012.aspx

• We caught any Exception and back then, blindly retried operations• Resulting in 16.000 commands and events being retried continuously• Causing storage throttling• Causing the website to retry reads• Causing more throttling• Starving IIS worker threads

• Lessons learned?• A simple bug can halt the entire application• Only retry transient errors• Our monitoring sucked• Bad, untested code (code from back when MyGet was a blog post…)

An exception killed the site? WTF?!?

• Symptoms:• Everything down• Furious users on social media• Windows Azure Management Portal Down• Furious tweets about #WindowsAzure

• The cause?• Global outage of Windows Azure due to an expired SSL certificate on

storage

Huge downtime February 23rd, 2013

Full story at http://blog.myget.org/post/2013/02/24/We-were-down.aspx

• Move storage to HTTP instead of HTTPS?• Windows Azure down globally impacts us

quite a bit• Fail-over to another solution costs money

and lots of effort• Decided against it for now

• Considering off-Windows Azure backups of at least all packages

Considerations and lessons learned

Full story at http://blog.myget.org/post/2013/02/24/We-were-down.aspx

• “Retention policies” introduced• Seemed to be a success! 3+ million

commands and events in queue• Solution: scale out (20 instances did it in a

few minutes)• Solution for the future: feature toggling

One more! New features…

But overall…

From: http://status.myget.org

Bonus tip

• “The Lean Startup” book says this• Don’t build it yourself: Google Analytics

Measure everything, test assumptions

this is why we built username/password registration, seems a lot of people prefer typing instead of one click

we must keep investing in Build Services

feed discovery is more popular than we imagined from zero reactions on our blog and Twitterthe technical fear we had about “download as ZIP” consuming too much server resources? That thing doesn’t show up in our stats, that’s how successful it is…

Conclusion

• NuGet? MyGet?• How we started• What we did not know• Our architecture• ACS• Tough times provide learning• Measurement as well

Conclusion

Thank you!

http://blog.maartenballiauw.be

@maartenballiauw

http://amzn.to/pronuget

Thank you!http://

blog.maartenballiauw.be@maartenballiauw

http://amzn.to/pronugethttp://www.myget.org

http://aka.ms/AzureConf-MemberOffers

http://aka.ms/AzureConf-FreeTrial

Get started with a 90 day free trial

Or, use your existing benefits…

top related