Just Garbage ----Garbage and Garbage………….All Garbage……
But you will know why garbage is so important for BFUB application!!!
Clearing of Garbage!!!
Basic understanding of GC
I don’t want to go in details of GC, Memory management and all blah blah. But just to refresh so that understanding of result will become easier, here is quick refresher.
Conceptually, garbage collection (GC) creates the illusion of infinite free space.
– Java has a create (“new”) but no destroy
– Applications create objects as needed on the Heap
In reality, GC reclaims unused memory back to the free lists
– Finds objects that are no longer used
– Makes their storage available for allocation
All garbage collectors follow the same formula
• Find all live objects (Mark)
– Trace the object graph from a set of known starting points (e.g., Thread stacks). Known as “The Root Set”
• Recycle objects not found onto the free list (Sweep)
– Objects not visible in the live set are “dead”
• Optional: Move objects to reduce fragmentation (Compact)
– Free bits of memory here and there create holes
– Cannot allocate object even if total free space is sufficient
– Converts many small holes into fewer large ones
IBM Java GC has a number of selectable policies under which it will recycle objects
Why have many policies? Why not just “the best”?
– Cannot always dynamically determine what tradeoffs the user/application is willing to make
• Pause time vs. Throughput
• Footprint vs. Frequency
This is why we are tuning GC for BFUB application.
Verbose GC logs and its analysis
Location to get verbose GC logs:
Application servers > server1 > Process definition > Java Virtual Machine
Also we can use -verbose:gc option is the main diagnostic that is available for runtime analysis of the Garbage Collector can be put in generic JVM argument.
The native_stderr.log file will be generated on following location:
/ibm/WebSphere7/AppServer/profiles/AppSrv01/logs/server1
Analyzer
For the analysis, we can use following tools from IBM:
· IBM support Assistant - Garbage Collection and Memory Visualizer
· GC Analyzer – ga402
Please don’t try sun jdk in IBM GC analyzer and vice versa. Why? you better know!!
Also, please don’t ask why I am using only IBM tools!!!
Why BankfusionUB needs optimal GC?
· If JAVA is there than garbage will be there. So how BFUB a high throughput application will survive from it.
· Although IBM has given tuned and intelligent GC parameters but still given leeway of the application like BFUB to further optimize it.
· Don’t we want Tier 1 bank!!!!
Always remember Our Aim!!
· Reduce GC pause time overhead as much as possible.
· To improve performance of BFUB and make it scalable.
How GC pause time will lead to improved performance ,we will see in the document!!
Test case and Server specification
We need some BFUB application module to confirm the best GC settings!!!
Test case
Test case | Description |
Online | Teller module for 75 users |
Batch | Interest Accrual and Interest Accrual posting Batch process |
Also, Server specification is important for GC settings!!
Server specification
Server Type | Application | Configuration |
App Server | WAS 7 | 8 Cores, 16 GB RAM,IBM, 9133-55A, Power 5, 1.65 GHz,64 bit |
DB Server | DB2 9.7 | 8 Cores, 16 GB RAM,IBM, 9133-55A, Power 5, 1.65 GHz,64 bit |
Description of current GC settings and problems associated with that.
The current setting is default with Initial heap size is 1/4th size of Maximum heap and it is using optthruput GC policy.
Policy | Option | Description |
Optimize for throughput | -Xgcpolicy:optthruput (optional) | (optional)The default policy. It is typically used for applications where raw throughput is more important than short GC pauses. The application is stopped each time that garbage is collected. |
It works as given in below figure:
Results
Online
GC time (sec) | Collection | Average GC Overhead % | Max GC Overhead % | Test type | Max CPU (%) | Avg. CPU (%) | % improvement in response time |
517 |
1399 |
19 |
41 | Min512Max2048- optthruput | 79.2 | 62.9 |
Baseline |
Batch
GC time (sec) | Accrual (h:mm:ss) | Posting (h:mm:ss) | collection | Test type | Average GC Overhead % | Max GC Overhead% |
85 | 0:10:33 | 0:03:21 | 245 | Min512Max2048-optthruput | 6 | 90 |
Problem Statement 1
Less Initial memory (Keeping Initial memory 1/4th of maximum heap)
Reasons
-Overhead for contraction and expansion of heap will take place during GC which will add to GC pause time.
- Compaction will occur if the heap is too small or fragmented or if the heap is resized and that will add to GC pause time.
-More No. of collection is observed as frequent GC is taking place and overall Average overhead is 19% in online and 6% in batch which is too high.
Solution
Keep Initial memory equal to Maximum memory
Results
Online
GC time (sec) | Collection | Average GC Overhead % | Max GC Overhead % | Test type | Max CPU (%) | Avg. CPU (%) | % improvement in response time |
517 |
1399 |
19 |
41 | Min512Max2048-optthruput | 79.2 | 62.9 |
Baseline |
283 |
777 |
11 |
69 | Min=Max2048 -optthruput |
75.9 |
62.5 |
6.93849390 |
Batch
GC time (sec) | Accrual (h:mm:ss) | Posting (h:mm:ss) | collection | Test type | Average GC Overhead % | Max GC Overhead% |
85 | 0:10:33 | 0:03:21 | 245 | Min512Max2048-optthruput | 6 | 90 |
29 | 0:10:22 | 0:03:10 | 79 | Min=Max2048 -optthruput | 3 | 87 |
Observations with Change settings
-No contraction/expansion. Thus no time is waste in heap resizing.
-No compact time wasted as contiguous heap is available throughout the test as no resizing is needed.
-Fewer no. of collections were observed and overall Average overhead is reduced to 11% in online and 3% in batch.
- Around 7% improvement in response time.
Although we solve one problem still BFUB is taking 11% overhead in online process. So still it is red hot zone. Now let’s take next important aspect “GC policy” which is nothing but different way of doing GC.
Problem Statement 2
Which is best GC policy for BFUB?
Reasons
· GC overhead is around 19% which is too high so we cannot just rely on optimum throughput (optthruput GC policy-default).
· Need to optimize GC policy which can utilize concurrent GC as well as can distinguish between short lived and long lived objects.
· Performance improvement if any can be possible by reducing GC activities.
Solution
IBM has given so many GC policies to solve our concern. Let’s test one after the other!!!
Optavgpause (should never be used for BFUB)
Policy | Option | Description |
Optimize for pause time | -Xgcpolicy:optavgpause | Trades high throughput for shorter GC pauses by performing some of the garbage collection concurrently. The application is paused for shorter periods. |
optavgpause is an alternative GC policy designed to keep pauses to a minimum. It does not guarantee a particular pause time, but pauses are shorter than those produced by the default GC policy. The idea is to perform some garbage collection work concurrently while the application is running. This is done in two places:
- Concurrent mark and sweep: Before the heap is filled up, each mutator helps out and mark s objects (concurrent mark). There is still a stop-the-world GC, but the pause is significantly shorter. After GC, the mutator threads help out and sweep objects (concurrent sweep).
- Background GC thread: One (or more) low-priority background GC threads perform marking while the application is idle.
Result
Online
GC time (sec) | Collection | Average GC Overhead % | Max GC Overhead % | Test type | Max CPU (%) | Avg. CPU (%) | % improvement in response time |
517 |
1399 |
19 |
41 | Min512Max2048-optthruput | 79.2 | 62.9 |
Baseline |
283 |
777 |
11 |
69 | Min=Max2048 -optthruput |
75.9 |
62.5 |
6.9384939 |
513 |
513 |
0 |
13 | Min=Max2048 -optavgpause |
94.4 |
85.7 |
-593.36253 |
Batch
GC time (sec) | Accrual (h:mm:ss) | Posting (h:mm:ss) | collection | Test type | Average GC Overhead % | Max GC Overhead% |
85 | 0:10:33 | 0:03:21 | 245 | Min512Max2048-optthruput | 6 | 90 |
29 | 0:10:22 | 0:03:10 | 79 | Min=Max2048 -optthruput | 3 | 87 |
11 | 0:10:11 | 0:03:18 | 84 | Min=Max2048 -optavgpause | 1 | 75 |
Observations with Change settings:
As IBM says, there is obvious degradation in the throughput (around 5 – 10%), GC happens concurrently with the application threads (concurrent Mark and sweep), that’s the reason throughput is degrading.
But BFUB application shows more than what IBM says it shows 600% degradation in response time.
But we need to see, why response time is degrading that much .The reason might be CPU utilization is more by concurrent mark and sweep thread and thus fewer resources available for BFUB application.
Thus as of now, permanent Bye to Optavgpause GC policy for BFUB application.
Gencon (The best!!)
Policy | Option | Description |
Generational concurrent | -Xgcpolicy:gencon | Handles short-lived objects differently than objects that are long-lived. Applications that have many short-lived objects can see shorter pause times with this policy while still producing good throughput. |
A generational garbage collection strategy considers the lifetime of objects and places them in separate areas of the heap. In this way, it tries to overcome the drawbacks of a single heap in applications where most objects die young -- that is, where they do not survive many garbage collections.
With generational GC, objects that tend to survive for a long time are treated differently from short-lived objects. The heap is split into a nursery and a tenured area, as illustrated in Figure 4. Objects are created in the nursery and, if they live long enough, are promoted to the tenured area. Objects are promoted after having survived a certain number of garbage collections. The idea is that most objects are short-lived; by collecting the nursery frequently, these objects can be freed up without paying the cost of collecting the entire heap. The tenured area is garbage collected less often.
As you can see in Figure, the nursery is in turn split into two spaces: allocate and survivor. Objects are allocated into the allocate space and, when that fills up, live objects are copied into the survivor space or into the tenured space, depending on their age. The spaces in the nursery then switch use; with allocate becoming survivor and survivor becoming allocate. The space occupied by dead objects can simply be overwritten by new allocations. Nursery collection is called a scavenge; Figure illustrates what happens during this process:
When the allocate space is full, garbage collection is triggered. Live objects are then traced and copied into the survivor space. This process is really inexpensive if most of the objects are dead. Furthermore, objects that have reached a copy threshold count are promoted into the tenured space. The object is then said to be tenured.
As the name Generational concurrent implies, the gencon policy has a concurrent aspect to it. The tenured space is concurrently marked with an approach similar to the one used in the optavgpause policy, except without concurrent sweep. All allocations pay a small throughput tax during the concurrent phase. With this approach, the pause time incurred from the tenure space collections is kept small.
Figure shows how the execution time maps out when running gencon GC:
Distribution of CPU time between mutators and GC threads in gencon
A scavenge is short (shown by the small red boxes). Gray indicates that concurrent tracing starts followed by a collection of the tenured space, some of which happens concurrently. This is called a global collection, and it includes both a scavenge and a tenure space collection. How often a global collection occurs depends on the heap sizes and object lifetimes. The tenured space collection should be relatively quick because most of it has been collected concurrently.
Results
Online
GC time (sec) | Collection | Average GC Overhead % | Max GC Overhead % | Test type | Max CPU (%) | Avg. CPU (%) | % improvement in response time |
517 |
1399 |
19 |
41 | Min512Max2048-optthruput | 79.2 | 62.9 |
Baseline |
283 |
777 |
11 |
69 | Min=Max2048 -optthruput |
75.9 |
62.5 |
6.9384939 |
513 |
513 |
0 |
13 | Min=Max2048 -optavgpause |
94.4 |
85.7 |
-593.3625 |
155 |
2206 |
6 |
100 | Min=Max2048 -gencon | 76.2 | 50.9 |
39.198494 |
Batch
GC time (sec) | Accrual (h:mm:ss) | Posting (h:mm:ss) | collection | Test type | Average GC Overhead % | Max GC Overhead% |
85 | 0:10:33 | 0:03:21 | 245 | Min512Max2048-optthruput | 6 | 90 |
29 | 0:10:22 | 0:03:10 | 79 | Min=Max2048 -optthruput | 3 | 87 |
11 | 0:10:11 | 0:03:18 | 84 | Min=Max2048 -optavgpause | 1 | 75 |
20 | 0:10:00 | 0:03:10 | 280 | Min=Max2048 -GenCon | 2 | 100 |
Observations with Change settings
-The mean occupancy in the nursery is 3%. This is low, so the gencon policy is probably an optimal policy for this workload.
-Approximately 40% improvement in response time.
-Average GC overhead is reduced to 6% in online and 2% in batch.
-Reduction of around 12% in CPU usage.
Subpool (Ok but…)
Policy | Option | Description |
Subpooling | -Xgcpolicy:subpool | Uses an algorithm similar to the default policy's but employs an allocation strategy that is more suitable for multiprocessor machines. We recommend this policy for SMP machines with 16 or more processors. This policy is only available on IBM pSeries® and zSeries® platforms. Applications that need to scale on large machines can benefit from this policy. |
The subpool policy can help increase performance on multiprocessor systems. As I mentioned earlier, this policy is available only on IBM pSeries and zSeries machines. The heap layout is the same as that for the optthruput policy, but the structure of the free list is different. Rather than having one free list for the entire heap, there are multiple lists, known as subpools. Each pool has an associated size by which the pools are ordered. An allocation request of a certain size can quickly be satisfied by going to the pool with that size. Atomic (platform-dependent) high-performing instructions are used to pop a free list entry off the list, avoiding serialized access. Figure shows how the free chunks of storage are organized by size:
Subpool free chunks ordered by size
When the JVMs start or when a compaction has occurred, the subpools are not used because there are large areas of the heap free. In these situations, each processor gets its own dedicated mini-heap to satisfy requests. When the first garbage collection occurs, the sweep phase starts populating the subpools, and subsequent allocations mainly use subpools.
The subpool policy can reduce the time it takes to allocate objects. Atomic instructions ensure that allocations happen without acquiring a global heap lock. Mini-heaps local to a processor increase efficiency because cache interference is reduced. This has a direct effect on scalability, especially on multiprocessor systems. On platforms where subpool is not available, generational GC can provide similar benefits.
Results
Online
GC time (sec) | Collection | Average GC Overhead % | Max GC Overhead % | Test type | Max CPU (%) | Avg. CPU (%) | % improvement in response time |
517 |
1399 |
19 |
41 | Min512Max2048-optthruput | 79.2 | 62.9 |
Baseline |
283 |
777 |
11 |
69 | Min=Max2048 -optthruput |
75.9 |
62.5 |
6.938493 |
513 |
513 |
0 |
13 | Min=Max2048 -optavgpause |
94.4 |
85.7 |
-593.3625 |
155 |
2206 |
6 |
100 | Min=Max2048 -GenCon | 76.2 | 50.9 |
39.19849 |
245 |
638 |
7 |
89 | Min=Max2048 -subpool | 70.9 | 56.3 |
22.92960 |
Batch
GC time (sec) | Accrual (h:mm:ss) | Posting (h:mm:ss) | collection | Test type | Average GC Overhead % | Max GC Overhead% |
85 | 0:10:33 | 0:03:21 | 245 | Min512Max2048-optthruput | 6 | 90 |
29 | 0:10:22 | 0:03:10 | 79 | Min=Max2048 -optthruput | 3 | 87 |
11 | 0:10:11 | 0:03:18 | 84 | Min=Max2048 -optavgpause | 1 | 75 |
20 | 0:10:00 | 0:03:10 | 280 | Min=Max2048 -GenCon | 2 | 100 |
24 | 0:09:36 | 0:03:10 | 66 | Min=Max2048 -subpool | 3 | 94 |
Limitation for Subpool
- The subpool policy can help increase performance on multiprocessor systems. As I mentioned earlier, this policy is available only on IBM pSeries and zSeries machines.
- Overhead is more =7% Compaction is happening so increasing AF/GC pause time during the test
- On an Average GC Pause time is more as more amount of memory need to reclaim in Subpool (375ms) than in gencon (120ms). [Look at GC pause time in the graph below].
Now gencon is the best GC policy for BFUB application has been proved by us.
Now What!!!
Problem Statement 3
Tuning gencon policy
Reason
As discussed earlier nursery is nothing but where short lived object are stored and thus moved to tenure if it survive certain no. of GC. During our analysis we observed that there is shorter lived object rather than long lived object in BFUB and thus playing with nursery size will land up us in some positives.
Solution
Tuning nursery size (Xmn
-default (25% of Max heap size) and remaining (75% of Max heap size) with tenure.
-50% of Max heap size and remaining (50% of Max heap size) with tenure.
-75% of Max heap size and remaining (25% of Max heap size) with tenure.
Results
Online
GC time (sec) | Collection | Average GC Overhead % | Max GC Overhead % | Test type | Max CPU (%) | Avg.CPU (%) | % improvement in response time |
155 |
2206 |
6 |
100 | Min=Max2048 –gencon (Default (25%) nursery size) | 76.2 | 50.9 |
39.19849 |
107 |
1082 |
3 |
100 | Min=Max2048 -gencon(50% nursery size) | 65.6 | 50.6 |
46.18440 |
92 |
724 |
2 |
100 | Min=Max2048 -gencon(75% nursery size) |
64.6 |
49.9 |
25.32519 |
Observations with Change settings
- There is continuous Nursery heap occupancy was observed in 25% nursery size settings.
- On an Average GC Pause time is more as more amount of memory need to reclaim in 75% nursery size settings (137ms) than in 50% nursery size settings (120ms).
- The mean occupancy in the tenured area is 87% in 75% nursery size setting which is high.
- The Best setting (46% improvement in response time) was observed for 50% nursery size settings.
- Fewer no. of global garbage collection was observed in 50% nursery size (4) than 75% nursery size (9).
Problem Statement 4
Remove explicit GC
Reasons
The use of System.gc () is generally not recommended since they can cause long pauses and do not allow the garbage collection algorithms to optimize themselves.
Solution
-Xdisableexplicitgc. So we will not give any chance for BFUB code for occurrences of System.gc ().
Thus, we saw there was great improvement in CPU Utilization and Response time of transactions in online mode and significant reduction in GC pause time. On the other hand, elapsed time of batch process is more or less same (as GC overhead was less) but optimized GC settings reduced GC pause time.
Problem Statement 5
Why not test with full M&D (Teller with ATM, BPW, Lending, collateral and core) and see the impact?
Reasons
We should know how the impact when all the modules are running which is the real time scenario in banks.
Solution
GC time (sec) | Collection | Average GC Overhead % | Max GC Overhead % | Test type | Max CPU % | Avg. CPU % | % improvement in response time |
570 | 1676 | 23 | 95 | Min512Max2048-optthroughput | 85.6 | 67.2 | Base |
162 | 1431 | 3 | 100 | Min=Max2048 -GenCon/Xmn1024 | 71 | 53.3 | 46 |
Problem Statement 6
Why not run full EOD and see the impact?
It is good to have expectation that we will able to improve overall EOD timings by this.
Still Grey areas in Bankfusion UB
Finalizer
Using finalizers is not recommended as it can slow garbage collection and cause wasted space in the heap. We have to Consider review BFUB application for occurrences of the finalize () method. We can use the ISA Tool Add-on, IBM Monitoring and Diagnostic Tools for Java - Memory Analyzer to list objects that are only retained through finalizers.
LOA
LOA is large object allocation. A large object is the object which occupies more than 64k in Heap. More the Large object more the GC pause time and thus minimization or prohibition of large object should take place.
There are no dumb questions in BankfusionUB
Q1. How GC settings changes with system and transaction?
From above discussion, it is cleared that Product has to decide whether it is needed response time improvement or throughput improvement or trade off. Thus transaction requirement is very much important wrt GC settings. Also, System also plays important role as GC is CPU intensive activity and each GC thread depend on the configuration. Thus GC settings should be set according to system configurations.
Q2. Why not give maximum heap if available with system?
No, this is the myth that we can give maximum amount of heap. Always GC settings should be recommended by iteratively testing and comparing the result. The Problem with large heap is it may need to clear big pile of heap and which will pause the system for longer duration and also as heap is fragmented GC thread may required longer time to mark the objects.
Q3. Give some magic formula to give to customer so that BFUB can implement in customer environment?
No magic formula but magic analysis can be done according system requirement and transaction Peak with some of our magic tools
Conclusion/Best Practices
Hope GC settings explained in the document will take care of everything!!
But still GC settings are dependent on system and transaction volume.
Thus, Final GC settings:
Initial heap size = Maximum heap size
GC Policy: gencon (-Xgcpolicy:gencon)
Nursery size = 50% of Maximum heap size (-Xmn1024m)
Disable explicit GC= -Xdisableexplicitgc
To print the GC parameters = -verbose: sizes
Maximum heap size will depend on the system configuration and throughput requirement.
Sources
http://www.iecc.com/gclist/GC-faq.html
http://www.ibm.com/developerworks/ibm/library/i-gctroub/
http://www.ibm.com/developerworks/java/library/j-ibmjava2/
http://www.ibm.com/developerworks/java/library/j-ibmjava3/
http://www.performancewiki.com/was-tuning.html
http://publib.boulder.ibm.com/infocenter/ieduasst/v1r1m0/index.jsp?topic=/com.ibm.iea.was_v7/was/7.0/ProblemDetermination/WASv7_GCMVOverview/player.html
http://publib.boulder.ibm.com/infocenter/javasdk/v6r0/index.jsp?topic=/com.ibm.java.doc.diagnostics.60/diag/appendixes/cmdline/cmdline_gc.html
No comments:
Post a Comment