BitUs DT: 5月 2014

2014-05-29

Java Performance - JVM Overview

Book

Addison Wesley - Java Performance (2012)
Chapter 3 - Java Performance Monitoring

HotSpot VM High Level Architecture

Responsibility of HotSpot VM Runtime

Parsing of command line arguments
VM life cycle
Class loading
Byte code interpreter
Exception handling
Synchronization
Thread management
Java Native Interface
VM fatal error handling
C++ heap management

Command Line Options

Standard Options
Nonstandard Options : start with「-X」
Developer Options :
format (一) : -XX:{+/-}{OptionName}
format (二) : -XX:{OptionName}={Number}{k/m/g}

HotSpot VM Live Cycle

Parse command line options
Establish java heap and JIT compiler
Establish Environment variable
Use JNI_CreateJavaVM to create HotSpot VM
Load Java Main-Class (or find from jar's manifest file)
Pass command line arguments to Main-Class
At finally, use JNI_DestroyJavaVM to shutdown HotSpot VM

HotSpot Class Loader

A Class' phases of loading, linking, initializing.
Class loading APIs

Class.forName()
ClassLoader.loadClass()
Reflection APIs
JNI_FindClass

Parent class or interfaces need to load before child class.
Class loader delegation.

Interpreter

HotSpot VM is a template based interpreter.
It is not assembly, there are assembly libraries in it and use to compile bytecode to machine code.
Use -XX:+PrintInterpreter to debug . But Don't use it in production environment.

Synchronization

A mechanism that prevents, avoids, or recovers from the inopportune interleavings, commonly call "races".
In Java, a monitor is either lock or unlock, and only one thread may own the monitor at any one time.
In Java, critical section are referred to as synchronized blocks.
Synchronization includes three locks: biased locking, cas locking(lightweight locking), heavyweight locking.
Mark Word stores Java object synchronization states: Neutral, Biased,Stack-Locked, Inflated.

Thread Management

Types of thread

java.lang.Thread represents a thread in Java code.
C++ JavaThread represents the java.lang.Thread instance and contains a pointer to java.lang.Thread. It also contains additional information to track the state of thread.
OSThread represents an OS thread and contains OS level information.

When a java.lang.Thread is started the Hotspot VM creates the associated JavaThread, OSThread and native thread. When native threa is created and executes a startup method that leads to the execution of java.lang.Thread#run() method.
In java.lang.Thread's constructor:

A native thread need to be attached to HotSpot VM via JNI_AttachCurrentThread.
JavaThread and OSThread will be created when invoke JNI_AttachCurrentThread.
Finally, java.lang.Thread is created for the attached thread.

Internal VM Threads (C++ JavaThread)

VM thread
Periodic task thread
Garbage collection threads
JIT compile threads
Signal dispatcher thread

C++ heap management

In addition to HotSpot java heap the HotSpot also uses a C/C++ heap for storage of HotSpot VM internal objects and data.
Arena is used to manage the HotSpot VM C++ heap operations.
Arena are thread-local objects that cache a amount of memory storage. This allow for fast-path allocation where a global shared lock is not required.

Java Native Interface

It allow Java code that runs inside a JVM to interoperate with application an libraries written in other programming language, such as C, C++, assembly.
If you use JNI, your application may lose two benefits of the Java Platform.

No more "write once, run anywhere"
Java programming language is a type-safe language and secure; native language such as C, C++ are not.

Command line option: -Xcheck:jni can used to debug.

VM Fatal Error Handling

A common VM fatal error is an OutOfMemoryError.
HotSpot's error log: hs_err_pid<pid>.log contains some information. A memory map is included in the hs_err_pid<pid>.log to make it is easy to see how memory is laid out during VM crash.
Command line options:

-XX:ErrorFile=<path>
-XX:+ShowMessageBoxOnError
-XX:+HeapDumpOnOutOfMemoryError
-XX:+HeapDumpPath=<path>

Generational Garbage Collection

The two weak generational hypothesis:

Most allocated object become unreachable quickly.
Few references from older to young object exist.

The young generation: Most newly allocated objects are allocated in the young generation. It is typically small and collected frequently. In general, minor GC are efficient.
The old generation: Objects that are longer-lived are eventually promoted, or tenured, to the old generation. It is typically larger than young generation, and its occupancy grows more slowly. In general, major GC are infrequent.
The permanent generation: It should not be sees as part of the generation hierarchy. It is only used by the HotSpot VM itself to hold metadata, such as class data structure, interned strings, and so on.
Because there is a "old-to-young" generation references and to keep minor GC short, HotSpot uses a data structure "Card table". How does minor GC work with card table? You can read page. 82~83 to get more information.

The Young Generation

The Eden: This is where most new objects are allocationed. Eden is almost always empty after a minor GC.
The two survivor spaces: Thes hold objects that have survived at least one minor GC.
Used Survivor and unused survivor will swap roles at the end of the minor GC.
Right illustrate shows the operation of a minor GC.After minor GC, the live objects are copied to the unused survivor space. Live objects which will be given another chance to be reclaimed in the young generation, are also copied to the unused survivor space. Finally, live objects that are deemed "old enough", are promoted to the old generation.

Types of Garbage Collectors

The Serial GC
The Parallel GC: Throughput Matters!
The Mostly-Concurrent GC (CMS GC): Latency Matters!
The Garbage-First GC (G1): CMS Replacement

The Serial GC

Both minor and full GC take place in a stop-the-world fashion. Only after GC has finished is the application restarted.
The young generation that operates as described earlier. The old generation managed by a sliding compacting mark-sweep (mark-compact GC).

First, identifies which objects are still live in the old generation.
Then, slide them toward the beginning of the heap.
Leaving any free space in a single contiguous chunk at the end of the heap.

Do not have low pause requirements and run on client-style machine.
It takes advantage of only a single virtual processor of GC work.

The Parallel GC

Both minor and full GC take place in a stop-the-world fashion. Only after GC has finished is the application restarted.
It should take advantage of all available processor resources. To decrease GC overhead and hence increase application throughput on server-style machine.
The Parallel GC, compared to the Serial GC, improves overall GC efficiency, and as a result improves application throughput.

The Mostly-Concurrent GC: Latency Matters!

The Mostly-Concurrent GC, as known as the Concurrent Mark-Sweep GC.
It manages its young generation the same way the Parallel and Serial GC.
Its old generation is managed by an algorithm that performs most of its work concurrently, imposing only two short pauses per GC cycle.
Right illustrate shows a GC cycle.

It starts with a short pause "Initial Mark", that identifies the set of objects from old generation.
Then, during the concurrent marking phase, it mark all live objects from the set.
The use of pre-cleaning can reduce, sometimes dramatically, the number of objects that need to be visited during remark phase, it is very effective in reducing the duration of the remark pause.
The application is stopped again for second pause "Remark" which finalizes the marking information by revisiting any objects that were modified during the concurrent marking phase.
The final phase of GC cycle is the concurrent sweeping phase, which sweeps over the java heap, deallocating garbage objects without relocating the live ones.

Compare to the Parallel GC, CMS decreases old-generation pause - sometimes dramatically - at the expense of slighly longer young generation pauses, some reduction in the thoughtput, extra heap size requirements.

The Garbage-First GC: CMS Replacement

It is a parallel, concurrent and incrementally compacting low-pause GC intended to be the long-term replacement of CMS.
It splits the Java heap into equal-sized chunks call "regions".
Each generation is a set of regions. This allow G1 to resize the young generation in a flexible way.

Comparisons

2014-05-19

Linux Tunning when testing

# Managed by Jerry Meng

# Controls the size of the connection listening queue
net.core.somaxconn = 4096 (default value: 128)

# Increase Linux autotuning TCP buffer limits
# Set max to 16M for 1GE and 32M (33554432) or 54M (56623104) for 10GE
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216

# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
# Also increase the max packet backlog
net.core.netdev_max_backlog = 50000 (default value: 1000)
net.ipv4.tcp_max_syn_backlog = 30000 (default value: 1024)
net.ipv4.tcp_max_tw_buckets = 2000000 (default value: 180000)
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# Disable source routing and redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0

# Log packets with impossible addresses for security
net.ipv4.conf.all.log_martians = 1

# Additional settings - these settings can improve the network
# security of the host and prevent against some network attacks
# including spoofing attacks and man in the middle attacks through
# redirection. Some network environments, however, require that these
# settings are disabled so review and enable them as needed.
#
# Do not accept ICMP redirects (prevent MITM attacks)
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0

# _or_
# Accept ICMP redirects only for gateways listed in our default
# gateway list (enabled by default)
# net.ipv4.conf.all.secure_redirects = 1
#
# Do not send ICMP redirects (we are not a router)
net.ipv4.conf.all.send_redirects = 0

2014-05-15

心得： How not to use cassandra (下)

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
http://www.slideshare.net/planetcassandra/8-axel-liljencrantz-23204252

https://www.youtube.com/watch?v=0u-EKJBPrj8

<= How not to delete data =>

重點：

Tombstones can only be delete once all non-tombstone values have been deleted.

Tombstones can only be deleted if all values for the specified row are all being compacted

所以對於wide-row，minor compaction幾乎沒辦法回收tombstone

補充：

Size tiered compaction
預設的compaction strategy，原理是對N個(預設是4個)差不多大的SSTable做compaction，然後將這N個sstable merge成一個新的大SSTable(通常這個新的SSTable會變比較大)
由於這個原理，會造成越大的SSTable compaction的機率越低

Leveled Compaction
給SSTable 分層，不同層的SSTable大小會程等比級數的上升，預設是10倍大，ex: L0 5MB, L1 50MB, L2 500MB

L0預設是5MB
L0到5MB時，就會compact，並把超出大小的部份跟跟L1 merge
當L1超出50MB時，會compact並把超出50MB的部份merge進L2…依此類推
理論上大部份的row會在一個SSTable裡(理論上90%，但是實際上可能只有50~80%

http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

http://www.datastax.com/wp-content/uploads/2011/10/leveled-1.png

http://www.datastax.com/wp-content/uploads/2011/10/leveled-2.png

相較於Size tiered compaction，更適合

-需要較低的讀取延遲

-讀多寫少

-wide-row or rows are frequently updated

不適合以下情況

-機器IO不行

-寫多讀少

-資料寫入之後不再更新

= TTL:ed data =

重點：

Overwritten data could theoretically bounce back

如果TTLed data 去覆寫掉另一個column的值，當TTLed data expired，當舊的data還沒被compaction之前，他就會再跑出來

補充

-TTLed data and compaction

http://www.datastax.com/dev/blog/tombstone-removal-improvement-in-1-2

-CASSANDRA-3442 TTL histogram for sstable metadata (for size tiered compaction)

- CASSANDRA-4234 Add tombstone-removal compaction to LCS (Cassandra 1.2.0 b1)

cassandra 1.2, Cassandra tracks tombstone droppable time for all TTLed/deleted columns and performs standalone compaction onto an SSTable that has droppable tombstones ratio against all columns above certain threshold. The threshold has default value of 20% or 0.2, and you can configure threshold by providing compaction parameter tombstone_threshold when creating column family.

The histogram looks like this:

http://www.datastax.com/wp-content/uploads/2012/07/chart_1.png

table options:

tombstone_compaction_interval

tombstone_compaction_interval && count > tombstone_threshold, then cassandra will trigger a tombstone compaction.

tombstone_threshold

-if garbage-collectable column count > threshold(ratio), then the SSTable will trigger a compaction to purging the tombstone (for the SSTable)

garbage-collectable means at least the data was out of gc_grace

-Cassandra-5228: Drop entire sstables when all columns are expired

Cassandra(2.0 b1)

a separate compaction strategy that doesn't bother merging sstables, just throws out expired ones

<= The playlist service =>

= Tombstone hell =

重點：
expect tombstone would be deleted after 30days, but all tombstone since 1.5 years ago were there
Rows exist in 4+ SSTables, ts never del in minor compactions.

solution: use Major compaction

solution2: repairs during Monday-Friday, Major compaction in Saturday-Sunday

-> Dont use Cassandra to store queues

= Cassandra counters =

Cassandra counters
Distributed counters > works pretty well

Counter added in cassandra 0.8
http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters

create a column family with default_validation_class=CounterColumnType

cli: keyspace.prepareColumnMutation(CF_COUNTER1, rowKey, "CounterColumn1").incrementCounterColumn(1).execute();

cql: UPDATE counters SET c1 = c1 + 3, c2 = c2 - 4 WHERE key = row2;

-For each write, only one of the replica has to perform a read, even with many replicas.
-- the read was part of the write, client will not observe.

-if sstable or disk corrupted, the counter cf should rebuild. (cant repair)
-Counter column與非counter的column不能共存 https://issues.apache.org/jira/browse/CASSANDRA-2614
-no TTL for counter column
-couner的remove有些許限制，如果只是要重置counter的話，快速的 incr-delete-incr 可能會讓delete被略過，比較適合的用法是減掉current value
delete最好只用來永久性的刪除某筆counter
-假設Counter的寫入timeout了，client並沒辦法確定counter是不是真的寫入成功(client不能單純地重新寫入，因為根本不知道資料到底有沒有寫成功，多寫了會造成重覆計算)

Counter++(cassandra 2.1b1)

https://issues.apache.org/jira/browse/CASSANDRA-6504