mvp showcast it - mensageria - exchange 2013 alta disponibilidade
TRANSCRIPT
SESSÃO: INFRAESTRUTURA
TRILHA: MENSAGERIA
© 2013, MVP ShowCast. Evento organizado por MVPs do Brasil com apoio da Microsoft.
MVP ShowCast 2013
Exchange 2013Alta Disponibilidade
Jorge Patricio Diaz Guzman
Exchange Server
Consultor SPC Chile e apresentador da Rádio Coringão e Rádio Livre Gaviões
@jorDiazMVP
Desafios de Storage• Discos• A capacidade aumentou, mas os IOPs não
• Databases• Os tamanhos devem ser administrados
• Cópias de Database• Reseeds devem ser rápidos e confiáveis• Os IOPs das cópias da database são ineficientes• Lagged copies precisam de storage assimétrico e precisam de cuidados manuais
• Multiplas Databases Por Volume• Autoreseed• Self-Recovery por falhas de Storage• Inovações no Lagged Copy
Storage - Inovações
Multiplas databases por volume
DB1 DB4DB3DB2
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1 Passive
ActiveLagge
d
4-member DAG4 databases4 copies of each database4 databases per volume
Desenho simétrico
Multiple databases per volume
DB1 DB1DB1DB1
Passive
ActiveLagge
d
Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs
20 MB/s
Multiple databases per volume
DB1 DB4DB3DB2
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1 Passive
ActiveLagge
d
Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs
4 database copies/disk:Reseed 2TB Disk = ~9.7 hrsReseed 8TB Disk = ~39 hrs
12 MB/s
12 MB/s
20 MB/s 20 MB/s
• Requerimentos• Um só disco lógico / partição por disco físico
• Recomendações• O número de databases por volume deve ser igual ao número de
copias por database• Mesmos vizinhos em todos os servidores (compartir o mesmo disco em
todos os servidores)
Multiplas databases por volume
• Os controladores de storage são basicamente mini-PCs• E por isso, eles podem congelar, cair, etc.., precisando de intervenção
administrativa
• Outras condições podem ocorrer• Perda de elementos vitais do sistema• Congelamento ou altíssima latência de IO
Desafios da recuperação
• As inovações do Exchange 2010 continuam• Novos comportamentos de recuperação foram
adicionados ao Exchange 2013• E mais ainda no Exchange 2013 CU1
Inovações na recuperação
Exchange Server 2010 Exchange Server 2013
ESE Database Hung IO (240s) System Bad State (302s)
Failure Item Channel Heartbeat (30s) Long I/O times (41s)
SystemDisk Heartbeat (120s) MSExchangeRepl.exe memory threshold (4GB)
Exchange Server 2013 CU1
Bus reset (event 129)
Replication service endpoints not responding
Cluster database hang (GUM updates blocked)
• Ativacao difícil• Lagged copies precisam de cuidados
manuais• Lagged copies não podem ser page
patched
Lagged Copy - Desafios
• Automatic log file replay• Espaco em disco insuficiente (habilitar no registry)• Page patching (habilitado by default)• Menos de 3 cópias saudáveis (habilitado no Active Directory;
configurado no registry)
• Integração com Safety Net• Não precisa de uma cirurgia de logs ou procurar o ponto de corrupção
Lagged Copy - Inovacoes
• Alta disponibilidade focada na saúde da database
• A melhor cópia selecionada é insuficiente para a nova arquitetura
• Configuração da rede do DAG continua manual
Desafios da alta disponibilidade
Inovações da alta disponibilidade• Managed Availability• Melhor cópia e seleção de servidor• Autoconfiguração da rede do DAG
Managed Availability• Doutrina principal do Exchange 2013• Access to a mailbox is provided by protocol stack on the Mailbox server
that hosts the active copy of the mailbox• If a protocol is down on a Mailbox server, all access to active databases
on that server via that protocol is lost
• Managed Availability foi introduzida para detectar e recuperar automaticamente para esses tipos de falhas• For most protocols, quick recovery is achieved via a restart action• If the restart action fails, a failover can be triggered
• An internal framework used by component teams
• Sequencing mechanism to control when recovery actions are taken versus alerting and escalation
• Enhances the Best Copy Selection algorithm by taking into account overall server health of source and target
Managed Availability
• MA failovers are recovery action from failure• Detected via a synthetic operation or live data• Throttled in time and across the DAG
• MA failovers can happen at database or server level• Database: Store-detected database failure can trigger database
failover• Server: Protocol failure can trigger server failover
• Single Copy Alert integrated into MA• ServerOneCopyInternalMonitorProbe (part of DataProtection Health Set)• Alert is per-server to reduce flow• Still triggered across all machines with copies• Logs 4138 (red) and 4139 (green) events
Managed Availability
• Exchange 2010 used several criteria• Copy queue length• Replay queue length• Database copy status – including activation blocked• Content index status
• Using just this criteria is not good enough for Exchange 2013, because protocol health is not considered
Best Copy Selection Challenges
• Still an Active Manager algorithm performed at *over time based on extracted health of the system
• Replication health still determined by same criteria and phases
• Criteria now includes health of entire protocol stack• Considers a prioritized protocol health set in the selection using four
priorities – critical, high, medium, low• Failover responders trigger added checks to select a “protocol not
worse” target
Best Copy and Server Selection
Managed Availability imposes 4 new constraints
on theBest Copy Selection
algorithm
Best Copy and Server Selection
All HealthyServer that has all health sets in a healthy state
Up to Normal HealthyServer that has all health sets Medium and above in a healthy state
All Better than SourceServer that has health sets in a state that is better than the server hosting the affected copy
Same as SourceServer that has health sets in a state that is the same as the server hosting the affected copy
1
2
3
4
• DAG networks must be manually collapsed in a multi-subnet deployment
• Small remaining administrative burden for deployment and initial configuration
DAG Network Challenges
• Automatically collapsed in multi-subnet environment
• Automatic or manual configuration• Default is Automatic• Requires specific settings on MAPI and Replication network interfaces
• Manual edits and EAC controls blocked by default• Set DAG to manual network setup to edit or change DAG networks
DAG Network Innovations
• Operationally complex• Mailbox and Client Access recovery
connected• Namespace is a SPOF
Site Resilience Challenges
Site Resilience Innovations• Key Characteristics• DNS resolves to multiple IP addresses• Almost all protocol access in Exchange 2013 is HTTP• HTTP clients have built-in IP failover capabilities• Clients skip past IPs that produce hard TCP failures• Admins can switchover by removing VIP from DNS• Namespace no longer a SPOF• No dealing with DNS latency
• Operationally simplified• Mailbox and Client Access recovery
independent• Namespace provides redundancy
Site Resilience Innovations
• Operationally Simplified• Previously loss of CAS, CAS array, VIP, LB, etc., required admin to
perform a datacenter switchover• In Exchange Server 2013, recovery happens automatically• The admin focuses on fixing the issue, instead of restoring service
Site Resilience
• Mailbox and CAS recovery independent• Previously, CAS and Mailbox server recovery were tied together in site
recoveries• In Exchange Server 2013, recovery is independent, and may come
automatically in the form of failover• This is dependent on business requirements and configuration
Site Resilience
• Namespace provides redundancy• Previously, the namespace was a single point of failure• In Exchange 2013, the namespace provides redundancy by leveraging
multiple A records and client’s OS/HTTP stack ability to failover
Site Resilience
SESSÃO: INFRAESTRUTURA
TRILHA: MENSAGERIA
© 2013, MVP ShowCast. Evento organizado por MVPs do Brasil com apoio da Microsoft.
Muito Obrigado
Rover Marinhowww.rovermarinho.org@rovermarinho
Marcelo Vighihttp://marcelovighi.wordpress.com @mvighi
Jorge@jorDiazMVP