how it's made - myget.org - azureconf
TRANSCRIPT
THANK YOU, LOCAL ORGANIZERS!
Over 60 community-led Windows Azure training events worldwide!
http://globalwindowsazure.azurewebsites.net
Maarten Balliauw@maartenballiauw
How it’s madeMyGet.org
• Maarten BalliauwDaytime: Technical Evangelist, JetBrains
Nighttime: Co-founder MyGet.org• AZUG• Focus on web• Big passion: Windows Azure• http://blog.maartenballiauw.be • @maartenballiauw
Who am I?
Shameless self promotion: Pro NuGet - http://amzn.to/pronuget
• NuGet? MyGet?• How we started• What we did not know• Our first architecture• Our second architecture• ACS• Tough times (learning moments)• Conclusion
Agenda
NuGet?MyGet?
NuGet? MyGet?
NuGet? MyGet?
• Safely store your IP with us• Creating packages is hard. We have Build
Services!• Granular security• Activity streams• Symbol server• Analytics
Why MyGet?
• Xavier Decoster@xavierdecoster
• Yves Goeleven@yvesgoeleven
• Also known as @MyGetTeam
I’m not alone!
How westarted
The real begin? May 09, 2011
• Using OData as their feeds• Which is some sort of WCF…• Multiple feeds?• Exchanged some ideas with Xavier• Prototyped something during TechDays 2011
NuPack!
Prototype online! May 31, 2011
• Windows Azure (yay, new toy!)• Windows Azure Table Storage & Blob Storage
(cheap in case we fail!)• Windows Azure ACS (no way I’m typing
another user registration)• ASP.NET MVC 2• MEF
Technologies used?
Best practices used?
• One web role• One storage account
Architecture at the time?
What wedid notknow…
• Grew from 5 feeds to 70 feeds in a few weeks, thinking we hit our max.
Users would come!
• One user started pushing 1.300 packages worth 1 GB of storage.
• Others started pushing CI packages.
Data would come!
• A lot of refactoring done• Using best practices• SOLID and DRY (well, not everywhere but
refactoring takes time)• Running on two instances (availability, yay!)
ReSharper time!
• Someone mentioned they would pay for our service
• Business model• Public site• Volume of feeds kept going up• Users in EU and US
We “started”
Our firstarchitecture
WEB ROLE
STORAGE
EU-WEST NORTH CENTRAL US
WEB ROLE
STORAGE
• Datacenters nearby our users• Centralized storage• Packages on CDN for faster throughput• DNS fail-over if one of the DC’s went down
Awesome!
• Datacenters nearby our usersOr not?
• Centralized storage Speed of light! USA was slow!
• Packages on CDN for faster throughput Sync issues, downtime, …
• DNS fail-over if one of the DC’s went down Seems not every ISP follows DNS standards
Not so awesome…
• Local caching in USA added• 2 instances in EU, 1 in the USA• Syncing data kept being slow• Populating cache was a nightmare• CDN kept having issues• Of 3 instances, only 1 was being used with
enough load (~60%)
We persisted!
• We had public subscription plans• We added enterprise tenants (multi-tenancy
added)• Resulting in…
• Architecture became complex• Caching and syncing became complex
We pivoted!
Our secondarchitecture
• Managing feeds and packages• Doesn’t matter much where (who cares about a little latency)
• Downloading packages• May matter where, let the tenant decide on storage account location
• Builds• Who cares where!
Workloads
WEB ROLE
STORAGE
EU-WEST
STORAGE
EU-NORTH
STORAGE ACCT PER TENANT
OTHER DATACENTERS
STORAGE ACCT (SOME TENANTS)
VIRTUAL MACHINE (BUILDS)
• … was scaled across the globe• … but as synchronous as it could be• … prone to all issues with latency vs.
synchrony
• Event Driven Architecture?*
*some concepts borrowed from EDA
Our first architecture…
• Some actions put an ICommand on a queue(ground rule: if it can’t be done in 1 write, use ICommand)
• All actions complete with an IEvent on a queue
• Handlers can subscribe to ICommand and IEvent
• Handlers are idempotent and not depending on others
EDA in MyGet
• 2 operations: 1 read, 1 write• Read the profile• Store the profile with LastLogin date• No use of ICommand• Finishes with UserLoggedInEvent
Example: log in
• Many operations!• Read two user profiles• Read current access rights• Change access rights• Push new privileges to SymbolSource.org
• One command, one event• ChangeFeedOwnerCommand• FeedOwnerChangedEvent
Example: change feed owner
Example: change feed owner
ChangeFeedOwnerCommandHan
dler
ChangeFeedOwnerCommand
FeedOwnerChangedEvent
SymSrcHandler<FeedOwnerChangedEve
nt>
SymSrcEvent
ActivityLogHandler
<FeedOwnerChangedEvent>
• We now run on 2 instances, mostly for redundancy (coming from 3)
• Average CPU usage? 20% (coming from 60%)
• Way easier to implement new features!• New feature: activity log• Simply subscribe to events we want to see in that log
Gain?
• Why no relational database?
• With only PartitionKey as an index, how do you store a feed’s packages and versions in an optimal way?• Three important values: feed name, package id, package version• Table per feed• Package id = PartitionKey• Package version = RowKey
Storage
• Reading 1.000 rows and deserializing them is SLOW (many seconds)
• We cache some tables on blob storage• 1.000 rows in serialized JSON = small• Loading one file = fast• Searching in memory through 1.000 rows = fast
• Cache update subscribed to IEvent
Storage
Windows AzureAccess Control Service
• Multiple applications• www.myget.org• staging.myget.org• localhost:1196• Customer1.myget.org• Customer2.myget.org• …
• Multiple identity providers• Who wants Microsoft Account?• Google anyone?• Oh, your custom ADFS? Sure!
Imagine managing this!
production tenants
www.myget.org*.customer.myget.orgother domain names
localhost:1196 myget-staging.cloudapp.net
develo
pm
ent
stag
ing
Windows Azure Access Control Service
• Users typically have some identity that allows federation
• ACS gives us Microsoft Account, Yahoo!, Google & Facebook accounts*
• We only care about ACS in our code
*we built many others and are working on a spin-off http://socialsts.com
No more user registration!
• An identity is an identity, whether dev/staging/prod
• ACS handles subtle differences per environment
• Our app just gets and uses the claims
No difference between environments!
• Easy multi-tenant logins with different identity providers
• ACS decides how to log in based on the audience• www.myget.org• some.customers.myget.org
• Our app just gets and uses the claims
No difference between tenants!
Tough timesLearning moments
• Symptoms:• Users complaining about “downtime”• No monitoring SMS alert• Half an hour later: “site up!”, “site down!”, “site up!”, “site down!” SMS
alerts• No sign of issues in the Windows Azure Management portal
• But what’s the cause?• We just deployed our multi-tenant architecture• We just enabled storage analytics• ELMAH was showing storage throttling• 16.000 unprocessed commands and events in queue
Huge downtime on July 2nd, 2012Full story at http://blog.myget.org/post/2012/07/02/Site-issues-on-July-2nd-2012.aspx
• One, simple piece of code…• GetHashCode() on Package object faulty
• “If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.“
• GetHashCode() used to track object in data context (new vs. update)
• 2 objects with the same hashcode = UnhandledException
Huge downtime on July 2nd, 2012Full story at http://blog.myget.org/post/2012/07/02/Site-issues-on-July-2nd-2012.aspx
• We caught any Exception and back then, blindly retried operations• Resulting in 16.000 commands and events being retried continuously• Causing storage throttling• Causing the website to retry reads• Causing more throttling• Starving IIS worker threads
• Lessons learned?• A simple bug can halt the entire application• Only retry transient errors• Our monitoring sucked• Bad, untested code (code from back when MyGet was a blog post…)
An exception killed the site? WTF?!?
• Symptoms:• Everything down• Furious users on social media• Windows Azure Management Portal Down• Furious tweets about #WindowsAzure
• The cause?• Global outage of Windows Azure due to an expired SSL certificate on
storage
Huge downtime February 23rd, 2013
Full story at http://blog.myget.org/post/2013/02/24/We-were-down.aspx
• Move storage to HTTP instead of HTTPS?• Windows Azure down globally impacts us
quite a bit• Fail-over to another solution costs money
and lots of effort• Decided against it for now
• Considering off-Windows Azure backups of at least all packages
Considerations and lessons learned
Full story at http://blog.myget.org/post/2013/02/24/We-were-down.aspx
• “Retention policies” introduced• Seemed to be a success! 3+ million
commands and events in queue• Solution: scale out (20 instances did it in a
few minutes)• Solution for the future: feature toggling
One more! New features…
Bonus tip
• “The Lean Startup” book says this• Don’t build it yourself: Google Analytics
Measure everything, test assumptions
this is why we built username/password registration, seems a lot of people prefer typing instead of one click
we must keep investing in Build Services
feed discovery is more popular than we imagined from zero reactions on our blog and Twitterthe technical fear we had about “download as ZIP” consuming too much server resources? That thing doesn’t show up in our stats, that’s how successful it is…
Conclusion
• NuGet? MyGet?• How we started• What we did not know• Our architecture• ACS• Tough times provide learning• Measurement as well
Conclusion
Thank you!
http://blog.maartenballiauw.be
@maartenballiauw
http://amzn.to/pronuget
Thank you!http://
blog.maartenballiauw.be@maartenballiauw
http://amzn.to/pronugethttp://www.myget.org
http://aka.ms/AzureConf-MemberOffers
http://aka.ms/AzureConf-FreeTrial
Get started with a 90 day free trial
Or, use your existing benefits…