performance(analysis(and(op?miza?on(of(the(weather...

1
RESEARCH POSTER PRESENTATION DESIGN © 2015 www.PosterPresentations.com 0 1 2 3 4 5 6 7 8 0 2 4 6 8 Performance (?mesteps/s) Number of Nodes WRF Performance: Binding Threads to Socket 8 PPN Bound 4 PPN Bound 2 PPN Bound 8 PPN 4 PPN 2 PPN All benchmarks in this study were run on the Texas Advanced Computing Center’s Stampede supercomputer. Each node is comprised of two Intel Xeon 2680 CPU’s and utilizes Intel Xeon Phi SE10P coprocessors. Introduc?on Hybrid WRF Scaling and Task/Thread Binding The Weather Research and Forecasting Model (WRF) is a widely used mesoscale numerical weather prediction system used for both research and operational forecasting needs. General performance portability is essential for any such community code to take advantage of ever- changing HPC architectures as we move into the exascale era of computing. Efficient shared memory models are critical as core counts on shared memory systems will continue to increase. Intel’s many integrated core (MIC) architecture is very representative of such systems and therefore performance optimization of WRF on MIC architectures will greatly contribute to the overall performance portability of the model. In addition to performance portability for future systems, it is necessary to understand how the model behaves on current systems to best utilize our current HPC resources. Goals 1. Generalize optimizations and best practices to educate current WRF users on how to best utilize HPC resources. 2. Identify performance bottlenecks and issues with WRF hybrid parallelization on both Xeon CPUs and Xeon Phi Coprocessors. Benchmarking Although initial scaling of WRF on Xeon Phi shows poor results (first plot), we can see that the performance approaches that of the host cpu’s in extreme high workload per core simulations. To better understand this we scaled up the gridsize while running on a constant number of nodes. We should expect that for either architecture, due to insufficient workloads per core, the performance per grid point should be low initially and level off to a maximum for sufficient workloads per core. What we find is that this maximum is initially reached for much larger workloads per core on Xeon than on Xeon Phi and for greater than approximately 60,000 horizontal gridpoints/processor, it actually becomes more efficient to use Xeon Phi over the host Xeon CPUs (second plot). Xeon Phi Scaling Conclusions: WRF Best Prac?ces Hybrid WRF parallelization will give consistent performance benefits, especially for MPI bound workloads. Ensure utilization of task/thread binding using runtime scripts or environment runtime settings (this is specific to each system and MPI implementation). Ensure that the environment variable OMP_STACKSIZE is set correctly based on the memory limitations of your system Using parallel I/O options with PnetCDF will significantly decrease I/O time spent in I/O related processes. Using separate output files for each MPI task makes writing history negligible. Change namelist option: io_form_history = 102 This option does require post-processing but is very worthwhile as the I/O overhead dominates runtimes on larger core counts. Understand the tradeoff between quick time to solution/low efficiency and high efficiency/slow time to solution. Only use Xeon Phi if you are extremely focused on efficient utilization of HPC allocations on systems such as stampede. Using optimal WRF I/O options is critical for utilization of Xeon Phi Efficient utilization of Xeon Phi is limited to a very small window of workloads. For specific cases, symmetric execution can be used for efficient utilization of HPC resources. Future Work Further develop methods for better patching and tiling strategies Extend studies onto various systems including the Knights Landing architecture which is more representative of future manycore HPC systems Assess whether other shared memory models such as OpenMP task constructs will further WRF’s performance portability for current and future HPC architectures Better understand performance issues associated with memory allocation that would otherwise dramatically decrease memory utilization in hybrid WRF simulations Acknowledgements I would like to first thank my advisor Davide Del Vento (NCAR) as well as Dave Gill (NCAR), John Michalakes (NCAR), Indraneil Gokhale (Intel), and Negin Sobhani (NCAR) for their guidance and discussions. I would also like to thank NCAR, the SIParCS internship program, and XSEDE for providing me the opportunity, community and resources to work on this project. The majority of WRF users initially write off using hybrid parallelization due to poor initial performance results. This is typically due to the lack of task/thread binding. Threads spawned across sockets do not share L3 cache, causing issues such as false sharing to become more prevalent. The following performance results show how critical using thread binding within MPI ranks is (First plot below). By using a hybrid implementation, we cut out a significant portion of the MPI communication while still utilizing the same number of cores. The following shows how using a hybrid implementation allows WRF to scale more efficiently to a larger number of nodes. WRF parallelization is done by breaking down the grid into two-dimensional horizontal patches, which are distributed to the MPI tasks to execute. Each OpenMP thread then solves and updates a subset of this patch, which we call an OpenMP tile. We found that WRF’s patching and tiling strategies often cause imbalancing and categorized these as follows: 1. Number of Patch Rows > Number of OpenMP Threads In this case each thread will execute a single tile but since the number of rows in a patch may not break down evenly into the number of tiles, some tiles will consist of more rows than others (up to a 2x imbalance between threads). 2. Number of Patch Rows < Number of OpenMP Threads In this case WRF will decompose the tiles 2-dimensionally, each tile with a contiguous subset of a single row. The number of tiles will be greater than the number of threads, requiring that some threads execute two tiles while others only execute one (always 2x imbalance between threads). 3. Number of OpenMP threads is any multiple of the number of patch rows This case should be optimal for thread balancing but found that WRF’s default strategy would overdecompose the problem causing unnecessary imbalances. We have since changed the tiling algorithm to fix this issue which will be included in the next release of WRF. These issues are much more prevalent on Xeon Phi as there are many more threads available (up to 244) opposed to the Xeon CPU’s (up to 8 on Stampede). The first plot below shows a case that whose performance is hindered by the first and second issues explained above. This case demonstrates the correlation between execution time and maximum workload per core. The second plot below shows the performance benefits in the third case after changing the default tiling algorithm. WRF OpenMP Tile Decomposi?on Samuel EllioM – Advisor: Davide Del Vento Na/onal Center for Atmospheric Research University of Colorado, Boulder Performance Analysis and Op?miza?on of the Weather Research and Forecas?ng Model (WRF) on Intel® Mul?core and Manycore Architectures Xeon E5 2680 CPU 8 Cores 2Way Hyperthreading 256 Bit Vector Registers 2.7 GHz 32 KB L1 256 KB L2 20 MB Shared L3 32 GB Main Memory Xeon Phi SE10P Coprocessor 61 Cores 4Way Hyperthreading 512 Bit Vector Registers 1.1 GHz 32 KB L1 512 KB L2 No L3 Cache 8 GB Main Memory Socket 0 Socket 1 Rank 0 Rank 1 Socket 0 Socket 1 Rank 0 Rank 1 0 5 10 15 20 25 0 10 20 30 40 50 60 70 Performance (?mesteps/s) Number of Nodes WRF Performance: Hybrid vs. MPI Scaling MPI 8 PPN 4 PPN 2 PPN Compute Bound MPI Bound High Efficiency Slow Time to Solu/on Low Efficiency Quick Time to Solu/on I/O Considera?ons WRF I/O, considering time spent in pure I/O as well as MPI communication and data formatting that results directly from I/O reads and writes, can quickly take over a simulations runtime. This is extremely significant for larger problem sizes and when running on larger numbers of nodes. The following plot shows history write times using serial I/O, PnetCDF parallel I/O, and namelist option io_form_history=102, which writes separate output files for each MPI rank. Using PnetCDF significantly reduces I/O times and although the third option requires post processing (very cheap relative to I/O related processes during the simulation), writing to separate output files made history writes negligible during the simulation. 0 2 4 6 8 10 12 14 0 10 20 30 40 50 Performance (?mesteps/s) Number of Nodes Xeon vs. Xeon Phi Strong Scaling Performance Xeon Phi Xeon Xeon Phi Approaches Xeon Performance for Large Workloads Per Core For large workloads per core, due to low MPI overhead and constant efficiency it is possible to have well balanced symmetric CPU-Coprocessor WRF runs that are more efficient than running on either homogeneous architecture. The following results demonstrate a case where over a 1.5X speedup can be achieved running on the same number of nodes symmetrically. Symmetric Xeon + Xeon Phi Performance 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Performance (?mesteps/s) Symmetric Xeon + Xeon Phi Performance Xeon CPU Xeon Phi Coprocessor Symmetric Xeon + Xeon Phi 0 1 2 3 4 5 6 7 8 9 0 100 200 300 400 500 600 700 Horizontal Grid Dimension Scaling Problem Size Xeon Xeon Phi Xeon Phi exceeds Xeon Performance for > 60,000 Horizontal Gridpoints/CPUCoprocessor Consistent Balancing for Symmetric Runs Xeon Phi Hits Memory Limits Performance Per Grid Point (millions of gridpoints computed per second) 0 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 10 12 15 Maximum Gridpoints Per Thread Compute Time Per Time Step Ranks Per Xeon Phi Thread ImbalancePerformance Rela?on Compute Time Per Timestep Max Gridpoints Per Thread 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 1 2 3 4 Performance (?mesteps/s) Ranks Per Xeon Phi WRF Default Tiling Strategy Op?miza?on New Default Previous Default 0 5 10 15 20 25 30 35 4 8 16 32 64 Output Write Time Per History Interval Number of Nodes Write Time Per History Interval (750x750x60 gridsize) Seperate Output Files PnetCDF Parallel I/O Serial I/O References Michalakes, J., J.Dudhia, D. Gill, J. Klemp, and W. Skamarock. Design of a next-generation weather research and forecast model. In proceedings of the Eighth Workshop on the Use of Parallel Processors in Meteorology, European Center for Medium Range Weather Forecasting. World Scientific, Singapore, 1999 Skamarock, William C., Joseph B. Klemp, Jimy Dudhia, David O. Gill, Dale M. Barker, Wei Wang, and Jordan G. Powers. A description of the advanced research WRF version 2. No. NCAR/TN-468+ STR. National Center For Atmospheric Research Boulder Co Mesoscale and Microscale Meteorology Div, 2005.

Upload: others

Post on 27-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance(Analysis(and(Op?miza?on(of(the(Weather ...sc15.supercomputing.org/sites/all/themes/SC15images/src...poster and save valuable time placing titles, subtitles, text, and graphics

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

(—THIS SIDEBAR DOES NOT PRINT—)

D E S I G N G U I D E

This PowerPoint 2007 template produces a 48”x96” presentation poster. You can use it to create your research poster and save valuable time placing titles, subtitles, text, and graphics.

We provide a series of online answer your poster production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK.

When you are ready to print your poster, go online to PosterPresentations.com

Need assistance? Call us at 1.510.649.3001

Q U I C K S TA R T

Zoom in and out As you work on your poster zoom in and out to the level that is more comfortable to you. Go to VIEW > ZOOM.

Title, Authors, and Affiliations

Start designing your poster by adding the title, the names of the authors, and the affiliated institutions. You can type or paste text into the provided boxes. The template will automatically adjust the size of your text to fit the title box. You can manually override this feature and change the size of your text.

T I P : The font size of your title should be bigger than your name(s) and institution name(s).

Adding Logos / Seals Most often, logos are added on each side of the title. You can insert a logo by dragging and dropping it from your desktop, copy and paste or by going to INSERT > PICTURES. Logos taken from web sites are likely to be low quality when printed. Zoom it at 100% to see what the logo will look like on the final poster and make any necessary adjustments.

T I P : See if your company’s logo is available on our free poster templates page.

Photographs / Graphics You can add images by dragging and dropping from your desktop, copy and paste, or by going to INSERT > PICTURES. Resize images proportionally by holding down the SHIFT key and dragging one of the corner handles. For a professional-looking poster, do not distort your images by enlarging them disproportionally.

Image Quality Check Zoom in and look at your images at 100% magnification. If they look good they will print well.

ORIGINAL   DISTORTED  

Corner  handles  

Good

 prin

/ng  qu

ality

 

Bad  prin/n

g  qu

ality

 

Q U I C K S TA RT ( c o n t . )

How to change the template color theme You can easily change the color theme of your poster by going to the DESIGN menu, click on COLORS, and choose the color theme of your choice. You can also create your own color theme. You can also manually change the color of your background by going to VIEW > SLIDE MASTER. After you finish working on the master be sure to go to VIEW > NORMAL to continue working on your poster.

How to add Text The template comes with a number of pre-formatted placeholders for headers and text blocks. You can add more blocks by copying and pasting the existing ones or by adding a text box from the HOME menu.

Text size

Adjust the size of your text based on how much content you have to present. The default template text offers a good starting point. Follow the conference requirements.

How to add Tables

To add a table from scratch go to the INSERT menu and click on TABLE. A drop-down box will help you select rows and columns.

You can also copy and a paste a table from Word or another PowerPoint document. A pasted table may need to be re-formatted by RIGHT-CLICK > FORMAT SHAPE, TEXT BOX, Margins.

Graphs / Charts You can simply copy and paste charts and graphs from Excel or Word. Some reformatting may be required depending on how the original document has been created.

How to change the column configuration RIGHT-CLICK on the poster background and select LAYOUT to see the column options available for this template. The poster columns can also be customized on the Master. VIEW > MASTER.

How to remove the info bars

If you are working in PowerPoint for Windows and have finished your poster, save as PDF and the bars will not be included. You can also delete them by going to VIEW > MASTER. On the Mac adjust the Page-Setup to match the Page-Setup in PowerPoint before you create a PDF. You can also delete them from the Slide Master.

Save your work Save your template as a PowerPoint document. For printing, save as PowerPoint or “Print-quality” PDF.

Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.

©  2015  PosterPresenta/ons.com  2117  Fourth  Street  ,  Unit  C  Berkeley  CA  94710  [email protected]  

0  

1  

2  

3  

4  

5  

6  

7  

8  

0   2   4   6   8   10  

Performance  (?mesteps/s)  

Number  of  Nodes  

WRF  Performance:  Binding  Threads  to  Socket  

8  PPN  Bound  

4  PPN  Bound  

2  PPN  Bound  

8  PPN  

4  PPN  

2  PPN  

All benchmarks in this study were run on the Texas Advanced Computing Center’s Stampede supercomputer. Each node is comprised of two Intel Xeon 2680 CPU’s and utilizes Intel Xeon Phi SE10P coprocessors.

Introduc?on   Hybrid  WRF  Scaling  and  Task/Thread  Binding  The Weather Research and Forecasting Model (WRF) is a widely used mesoscale numerical weather prediction system used for both research and operational forecasting needs. General performance portability is essential for any such community code to take advantage of ever-changing HPC architectures as we move into the exascale era of computing. Efficient shared memory models are critical as core counts on shared memory systems will continue to increase. Intel’s many integrated core (MIC) architecture is very representative of such systems and therefore performance optimization of WRF on MIC architectures will greatly contribute to the overall performance portability of the model. In addition to performance portability for future systems, it is necessary to understand how the model behaves on current systems to best utilize our current HPC resources.

Goals 1.  Generalize optimizations and best practices to educate current

WRF users on how to best utilize HPC resources. 2.  Identify performance bottlenecks and issues with WRF hybrid

parallelization on both Xeon CPUs and Xeon Phi Coprocessors.

Benchmarking  

Although initial scaling of WRF on Xeon Phi shows poor results (first plot), we can see that the performance approaches that of the host cpu’s in extreme high workload per core simulations. To better understand this we scaled up the gridsize while running on a constant number of nodes. We should expect that for either architecture, due to insufficient workloads per core, the performance per grid point should be low initially and level off to a maximum for sufficient workloads per core. What we find is that this maximum is initially reached for much larger workloads per core on Xeon than on Xeon Phi and for greater than approximately 60,000 horizontal gridpoints/processor, it actually becomes more efficient to use Xeon Phi over the host Xeon CPUs (second plot).

Xeon  Phi  Scaling   Conclusions:  WRF  Best  Prac?ces  §  Hybrid WRF parallelization will give consistent performance benefits,

especially for MPI bound workloads.

§  Ensure utilization of task/thread binding using runtime scripts or environment runtime settings (this is specific to each system and MPI implementation).

§  Ensure that the environment variable OMP_STACKSIZE is set correctly based on the memory limitations of your system

§  Using parallel I/O options with PnetCDF will significantly decrease I/O time spent in I/O related processes.

§  Using separate output files for each MPI task makes writing history negligible.

§  Change namelist option: io_form_history = 102

§  This option does require post-processing but is very worthwhile as the I/O overhead dominates runtimes on larger core counts.

§  Understand the tradeoff between quick time to solution/low efficiency and high efficiency/slow time to solution.

§  Only use Xeon Phi if you are extremely focused on efficient utilization of HPC allocations on systems such as stampede.

•  Using optimal WRF I/O options is critical for utilization of Xeon Phi

•  Efficient utilization of Xeon Phi is limited to a very small window of workloads.

•  For specific cases, symmetric execution can be used for efficient utilization of HPC resources.

Future  Work  •  Further develop methods for better patching and tiling strategies

•  Extend studies onto various systems including the Knights Landing architecture which is more representative of future manycore HPC systems

•  Assess whether other shared memory models such as OpenMP task constructs will further WRF’s performance portability for current and future HPC architectures

•  Better understand performance issues associated with memory allocation that would otherwise dramatically decrease memory utilization in hybrid WRF simulations

Acknowledgements  I would like to first thank my advisor Davide Del Vento (NCAR) as well as Dave Gill (NCAR), John Michalakes (NCAR), Indraneil Gokhale (Intel), and Negin Sobhani (NCAR) for their guidance and discussions. I would also like to thank NCAR, the SIParCS internship program, and XSEDE for providing me the opportunity, community and resources to work on this project.

The majority of WRF users initially write off using hybrid parallelization due to poor initial performance results. This is typically due to the lack of task/thread binding. Threads spawned across sockets do not share L3 cache, causing issues such as false sharing to become more prevalent. The following performance results show how critical using thread binding within MPI ranks is (First plot below). By using a hybrid implementation, we cut out a significant portion of the MPI communication while still utilizing the same number of cores. The following shows how using a hybrid implementation allows WRF to scale more efficiently to a larger number of nodes.

WRF parallelization is done by breaking down the grid into two-dimensional horizontal patches, which are distributed to the MPI tasks to execute. Each OpenMP thread then solves and updates a subset of this patch, which we call an OpenMP tile. We found that WRF’s patching and tiling strategies often cause imbalancing and categorized these as follows:

1.  Number of Patch Rows > Number of OpenMP Threads

In this case each thread will execute a single tile but since the number of rows in a patch may not break down evenly into the number of tiles, some tiles will consist of more rows than others (up to a 2x imbalance between threads).

2.  Number of Patch Rows < Number of OpenMP Threads

In this case WRF will decompose the tiles 2-dimensionally, each tile with a contiguous subset of a single row. The number of tiles will be greater than the number of threads, requiring that some threads execute two tiles while others only execute one (always 2x imbalance between threads).

3.  Number of OpenMP threads is any multiple of the number of patch rows

This case should be optimal for thread balancing but found that WRF’s default strategy would overdecompose the problem causing unnecessary imbalances. We have since changed the tiling algorithm to fix this issue which will be included in the next release of WRF.

These issues are much more prevalent on Xeon Phi as there are many more threads available (up to 244) opposed to the Xeon CPU’s (up to 8 on Stampede). The first plot below shows a case that whose performance is hindered by the first and second issues explained above. This case demonstrates the correlation between execution time and maximum workload per core. The second plot below shows the performance benefits in the third case after changing the default tiling algorithm.

WRF  OpenMP  Tile  Decomposi?on  

Samuel  EllioM  –  Advisor:  Davide  Del  Vento  Na/onal  Center  for  Atmospheric  Research  

University  of  Colorado,  Boulder  

Performance  Analysis  and  Op?miza?on  of  the  Weather  Research  and  Forecas?ng  Model  (WRF)  on  Intel®  Mul?core  and  Manycore  Architectures  

Xeon  E5  2680  CPU  

8  Cores  

2-­‐Way  Hyperthreading  

256  Bit  Vector  Registers  

2.7  GHz  

32  KB  L1  

256  KB  L2  

20  MB  Shared  L3  

32  GB  Main  Memory  

Xeon  Phi  SE10P  Coprocessor  

61  Cores  

4-­‐Way  Hyperthreading  

512  Bit  Vector  Registers  

1.1  GHz  

32  KB  L1  

512  KB  L2  

No  L3  Cache  

8  GB  Main  Memory  

Socket  0  

Socket  1  

Rank  0   Rank  1  

Socket  0  

Socket  1  

Rank  0  

Rank  1  

0  

5  

10  

15  

20  

25  

0   10   20   30   40   50   60   70  

Performance  (?mesteps/s)  

Number  of  Nodes  

WRF  Performance:  Hybrid  vs.  MPI  Scaling  

MPI  8  PPN  4  PPN  2  PPN  

Compute  Bound   MPI  Bound  

High  Efficiency  Slow  Time  to  Solu/on  

Low  Efficiency  Quick  Time  to  Solu/on  

I/O  Considera?ons  WRF I/O, considering time spent in pure I/O as well as MPI communication and data formatting that results directly from I/O reads and writes, can quickly take over a simulations runtime. This is extremely significant for larger problem sizes and when running on larger numbers of nodes. The following plot shows history write times using serial I/O, PnetCDF parallel I/O, and namelist option io_form_history=102, which writes separate output files for each MPI rank. Using PnetCDF significantly reduces I/O times and although the third option requires post processing (very cheap relative to I/O related processes during the simulation), writing to separate output files made history writes negligible during the simulation.

0  

2  

4  

6  

8  

10  

12  

14  

0   10   20   30   40   50  

Performance  (?mesteps/s)  

Number  of  Nodes  

Xeon  vs.  Xeon  Phi  Strong  Scaling  Performance  

Xeon  Phi  

Xeon  

Xeon  Phi  Approaches  Xeon  Performance  for  Large  Workloads  Per  Core  

For large workloads per core, due to low MPI overhead and constant efficiency it is possible to have well balanced symmetric CPU-Coprocessor WRF runs that are more efficient than running on either homogeneous architecture. The following results demonstrate a case where over a 1.5X speedup can be achieved running on the same number of nodes symmetrically.

Symmetric  Xeon  +  Xeon  Phi  Performance  

0  

0.05  

0.1  

0.15  

0.2  

0.25  

0.3  

0.35  

0.4  

0.45  

Performance  (?mesteps/s)  

Symmetric  Xeon  +  Xeon  Phi  Performance  

Xeon  CPU   Xeon  Phi  Coprocessor   Symmetric  Xeon  +  Xeon  Phi  

0  1  2  3  4  5  6  7  8  9  

0   100   200   300   400   500   600   700  Horizontal  Grid  Dimension  

Scaling  Problem  Size  

Xeon  

Xeon  Phi  Xeon  Phi  exceeds  Xeon  Performance  for  

>  60,000  Horizontal  Gridpoints/CPU-­‐Coprocessor  

Consistent  Balancing  for  Symmetric  Runs  

Xeon  Phi  Hits    Memory  Limits  

Performance  Per  Grid  Point  (millions  of  gridpoints  computed  per  second)  

0  

50  

100  

150  

200  

250  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

1   2   3   4   5   6   10   12   15  

Maximum  Gridpoints  Per  Thread  

Compute  Time  Per  Time  Step  

Ranks  Per  Xeon  Phi  

Thread  Imbalance-­‐Performance  Rela?on  

Compute  Time  Per  Timestep  

Max  Gridpoints  Per  Thread  

0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  

1   2   3   4  

Performance  (?mesteps/s)  

Ranks  Per  Xeon  Phi  

WRF  Default  Tiling  Strategy  Op?miza?on  

New  Default  

Previous  Default  

0  

5  

10  

15  

20  

25  

30  

35  

4   8   16   32   64  

Output  Write  Time  Per  History  Interval  

Number  of  Nodes  

Write  Time  Per  History  Interval  (750x750x60  gridsize)  

Seperate  Output  Files  

PnetCDF  Parallel  I/O  

Serial  I/O  

References  Michalakes, J., J.Dudhia, D. Gill, J. Klemp, and W. Skamarock. Design of a next-generation weather research and forecast model. In proceedings of the Eighth Workshop on the Use of Parallel Processors in Meteorology, European Center for Medium Range Weather Forecasting. World Scientific, Singapore, 1999 Skamarock, William C., Joseph B. Klemp, Jimy Dudhia, David O. Gill, Dale M. Barker, Wei Wang, and Jordan G. Powers. A description of the advanced research WRF version 2. No. NCAR/TN-468+ STR. National Center For Atmospheric Research Boulder Co Mesoscale and Microscale Meteorology Div, 2005.