mvp showcast it - mensageria - exchange 2013 alta disponibilidade

SESSÃO: INFRAESTRUTURA

TRILHA: MENSAGERIA

© 2013, MVP ShowCast. Evento organizado por MVPs do Brasil com apoio da Microsoft.

MVP ShowCast 2013

Exchange 2013Alta Disponibilidade

Jorge Patricio Diaz Guzman

Exchange Server

Consultor SPC Chile e apresentador da Rádio Coringão e Rádio Livre Gaviões

@jorDiazMVP

http://mvp.microsoft.com/pt-br/mvp/Jorge%20Patricio%20Diaz%20Guzman-4040004

https://twitter.com/jorDiazMVP

Agenda• Storage• Alta Disponibilidade• Site Resilience

Storage

Desafios de Storage• Discos• A capacidade aumentou, mas os IOPs não

• Databases• Os tamanhos devem ser administrados

• Cópias de Database• Reseeds devem ser rápidos e confiáveis• Os IOPs das cópias da database são ineficientes• Lagged copies precisam de storage assimétrico e precisam de cuidados manuais

• Multiplas Databases Por Volume• Autoreseed• Self-Recovery por falhas de Storage• Inovações no Lagged Copy

Storage - Inovações

Multiplas database por volume

Multiplas databases por volume

DB1 DB4DB3DB2

DB4

DB3

DB2

DB1

DB4

DB3

DB2

DB1

DB4

DB3

DB2

DB1 Passive

ActiveLagge

d

4-member DAG4 databases4 copies of each database4 databases per volume

Desenho simétrico

Multiple databases per volume

DB1 DB1DB1DB1

Passive

ActiveLagge

d

Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs

20 MB/s

Multiple databases per volume

DB1 DB4DB3DB2

DB4

DB3

DB2

DB1

DB4

DB3

DB2

DB1

DB4

DB3

DB2

DB1 Passive

ActiveLagge

d

Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs

4 database copies/disk:Reseed 2TB Disk = ~9.7 hrsReseed 8TB Disk = ~39 hrs

12 MB/s

12 MB/s

20 MB/s 20 MB/s

• Requerimentos• Um só disco lógico / partição por disco físico

• Recomendações• O número de databases por volume deve ser igual ao número de

copias por database• Mesmos vizinhos em todos os servidores (compartir o mesmo disco em

todos os servidores)

Multiplas databases por volume

Self-recovery por falhas de storage

• Os controladores de storage são basicamente mini-PCs• E por isso, eles podem congelar, cair, etc.., precisando de intervenção

administrativa

• Outras condições podem ocorrer• Perda de elementos vitais do sistema• Congelamento ou altíssima latência de IO

Desafios da recuperação

• As inovações do Exchange 2010 continuam• Novos comportamentos de recuperação foram

adicionados ao Exchange 2013• E mais ainda no Exchange 2013 CU1

Inovações na recuperação

Exchange Server 2010 Exchange Server 2013

ESE Database Hung IO (240s) System Bad State (302s)

Failure Item Channel Heartbeat (30s) Long I/O times (41s)

SystemDisk Heartbeat (120s) MSExchangeRepl.exe memory threshold (4GB)

Exchange Server 2013 CU1

Bus reset (event 129)

Replication service endpoints not responding

Cluster database hang (GUM updates blocked)

Lagged copy - Inovações

• Ativacao difícil• Lagged copies precisam de cuidados

manuais• Lagged copies não podem ser page

patched

Lagged Copy - Desafios

• Automatic log file replay• Espaco em disco insuficiente (habilitar no registry)• Page patching (habilitado by default)• Menos de 3 cópias saudáveis (habilitado no Active Directory;

configurado no registry)

• Integração com Safety Net• Não precisa de uma cirurgia de logs ou procurar o ponto de corrupção

Lagged Copy - Inovacoes

Alta Disponibilidade

• Alta disponibilidade focada na saúde da database

• A melhor cópia selecionada é insuficiente para a nova arquitetura

• Configuração da rede do DAG continua manual

Desafios da alta disponibilidade

Inovações da alta disponibilidade• Managed Availability• Melhor cópia e seleção de servidor• Autoconfiguração da rede do DAG

Managed Availability

Managed Availability• Doutrina principal do Exchange 2013• Access to a mailbox is provided by protocol stack on the Mailbox server

that hosts the active copy of the mailbox• If a protocol is down on a Mailbox server, all access to active databases

on that server via that protocol is lost

• Managed Availability foi introduzida para detectar e recuperar automaticamente para esses tipos de falhas• For most protocols, quick recovery is achieved via a restart action• If the restart action fails, a failover can be triggered

• An internal framework used by component teams

• Sequencing mechanism to control when recovery actions are taken versus alerting and escalation

• Enhances the Best Copy Selection algorithm by taking into account overall server health of source and target


• MA failovers are recovery action from failure• Detected via a synthetic operation or live data• Throttled in time and across the DAG

• MA failovers can happen at database or server level• Database: Store-detected database failure can trigger database

failover• Server: Protocol failure can trigger server failover

• Single Copy Alert integrated into MA• ServerOneCopyInternalMonitorProbe (part of DataProtection Health Set)• Alert is per-server to reduce flow• Still triggered across all machines with copies• Logs 4138 (red) and 4139 (green) events


Best Copy and Server Selection

• Exchange 2010 used several criteria• Copy queue length• Replay queue length• Database copy status – including activation blocked• Content index status

• Using just this criteria is not good enough for Exchange 2013, because protocol health is not considered

Best Copy Selection Challenges

• Still an Active Manager algorithm performed at *over time based on extracted health of the system

• Replication health still determined by same criteria and phases

• Criteria now includes health of entire protocol stack• Considers a prioritized protocol health set in the selection using four

priorities – critical, high, medium, low• Failover responders trigger added checks to select a “protocol not

worse” target


Managed Availability imposes 4 new constraints

on theBest Copy Selection

algorithm


All HealthyServer that has all health sets in a healthy state

Up to Normal HealthyServer that has all health sets Medium and above in a healthy state

All Better than SourceServer that has health sets in a state that is better than the server hosting the affected copy

Same as SourceServer that has health sets in a state that is the same as the server hosting the affected copy

1

2

3

4

DAG Network Innovations

• DAG networks must be manually collapsed in a multi-subnet deployment

• Small remaining administrative burden for deployment and initial configuration

DAG Network Challenges

• Automatically collapsed in multi-subnet environment

• Automatic or manual configuration• Default is Automatic• Requires specific settings on MAPI and Replication network interfaces

• Manual edits and EAC controls blocked by default• Set DAG to manual network setup to edit or change DAG networks

DAG Network Innovations

Site Resilience

• Operationally complex• Mailbox and Client Access recovery

connected• Namespace is a SPOF

Site Resilience Challenges

Site Resilience Innovations• Key Characteristics• DNS resolves to multiple IP addresses• Almost all protocol access in Exchange 2013 is HTTP• HTTP clients have built-in IP failover capabilities• Clients skip past IPs that produce hard TCP failures• Admins can switchover by removing VIP from DNS• Namespace no longer a SPOF• No dealing with DNS latency

• Operationally simplified• Mailbox and Client Access recovery

independent• Namespace provides redundancy

Site Resilience Innovations

• Operationally Simplified• Previously loss of CAS, CAS array, VIP, LB, etc., required admin to

perform a datacenter switchover• In Exchange Server 2013, recovery happens automatically• The admin focuses on fixing the issue, instead of restoring service

Site Resilience

• Mailbox and CAS recovery independent• Previously, CAS and Mailbox server recovery were tied together in site

recoveries• In Exchange Server 2013, recovery is independent, and may come

automatically in the form of failover• This is dependent on business requirements and configuration

Site Resilience

• Namespace provides redundancy• Previously, the namespace was a single point of failure• In Exchange 2013, the namespace provides redundancy by leveraging

multiple A records and client’s OS/HTTP stack ability to failover

Site Resilience


TRILHA: MENSAGERIA


Muito Obrigado

Rover Marinhowww.rovermarinho.org@rovermarinho

Marcelo Vighihttp://marcelovighi.wordpress.com @mvighi

Jorge@jorDiazMVP

http://www.rovermarinho.org/

http://marcelovighi.wordpress.com/

http://marcelovighi.wordpress.com/


TRILHA: MENSAGERIA


Perguntas & Respostas

mvp showcast it - mensageria - exchange 2013 alta disponibilidade

Technology

tb database

source server

database copiesdisk

gb exchange server

normal healthy server

single database copydisk

tb disk

ese database hung io