BitUs DT: AMQ-4853 - ActiveMQ-5.9.0 吃滿 CPU 的問題

Warning

這篇文章提出來的 fix, 等於 patch ActiveMQ-5.8.0 的程式, 而不是 ActiveMQ-5.9.0 的程式.

Reference

ActiveMQ-5.9.1 release note

https://issues.apache.org/jira/browse/AMQ-4853

Problem

ActiveMQ 從 5.5.1 升級到 5.9.0 之後過一個禮拜,

ActiveMQ 把一台 server 的八核 CPU 都吃滿, 達到 100% 使用率.

查 ActiveMQ-5.9.1 的 release note, 發現在 AMQ-4853 有在解相關的問題.

這個問題主要是說 ActiveMQ 在 5.9.0 有個修改. 這個修改的 comment 是:

Just use a concurrent linked queue, makes the code much simpler and
doesn't hurt performance overall

目的是為了讓程式變簡單, 理論上不影響效能. (不過看起來的確 hurt performance)

Reproduce

部屬 ActiveMQ 兩台 cluster (一台不容易重現)

執行程式去打 ActiveMQ

package test;

import java.util.Enumeration;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import javax.jms.Connection;
import javax.jms.JMSException;
import javax.jms.QueueBrowser;
import javax.jms.Session;
import javax.jms.TextMessage;

import org.apache.activemq.ActiveMQConnectionFactory;
import org.junit.Test;

public class OccupyCPUTest {

    @Test
    public void test() throws JMSException, InterruptedException {
        final ActiveMQConnectionFactory f1 = new ActiveMQConnectionFactory("tcp://192.168.0.1:61616");
        final ActiveMQConnectionFactory f2 = new ActiveMQConnectionFactory("tcp://192.168.0.2:61616");
        ExecutorService e = Executors.newCachedThreadPool();
        for ( int i = 0; i < 40; i++ ) {
            newRun(f1, e);
        }
        for ( int i = 0; i < 40; i++ ) {
            newRun(f2, e);
        }
        TimeUnit.DAYS.sleep(1);
    }

    protected void newRun(final ActiveMQConnectionFactory f, ExecutorService e) {
        e.execute(new Runnable(){
            @Override
            public void run() {
                try {
                    Connection c = f.createConnection("id", "pwd");
                    c.start();
                    Session session = c.createSession(false, Session.AUTO_ACKNOWLEDGE);
                    while (true) {
                        QueueBrowser b = session.createBrowser(session.createQueue("test"), "key='val'");
                        Enumeration<textmessage> enumeration = b.getEnumeration();
                        while (enumeration.hasMoreElements()) {
                            enumeration.nextElement();
                        }        
                        b.close();
                        TimeUnit.MILLISECONDS.sleep(50);
                    }
                } catch (Throwable ex) {
                    ex.printStackTrace();
                }
            }});
    }
    
}

用 jconsole 或 jvisualvm 監控兩台 ActiveMQ 的 CPU, 可以看到 CPU 慢慢往上升, 直到 100%

What Happened?

這個測試程式其實就是起 80個 thread, 每個 thread 每 50 millis 去 browse 一次 ActiveMQ.

而每次 browse message, ActiveMQ 都會建立內部的 consumer, browse 之後 ActiveMQ 又會 removeConsumer.

回到 AMQ-4853, 以前的 AdvisorySupport#removeConsumer 會從 ConcurrentSkipListMap put or remove ConsumerIdKey & ConsumerInfo. 注意 ConsumerIdKey 有保留 Consumer 進入的時間先後順序, 程式的 comment 提到:

// replay consumer advisory messages in the order in which they arrive - allows duplicate suppression in

// mesh networks with ttl>1

意思是說這段程式的目的是要確保 consumer 的 advisory message 會照順序的被發出去.
不過這樣的處理方式, 後來被覺得程式比較複雜, 所以直接將 ConcurrentSkipListMap 改為 ConcurrentLinkedQueue, 然後把一堆 ConsumerIdKey 的處理拿掉.

結果問題出現, 當 consumer 頻繁的上下線, 會造成 ConcurrentLinkedQueue#remove 被呼叫很多次, 造成 CPU 節節升高. 只要把 jvisualvm 打開 monitor 就可以發現 bottleneck 就在 #remove 這個 method 上.

當這個問題在 issue 上被提出來質疑 ConcurrentLinkedQueue 是個不好的做法時, 該 developer - Timothy Bish 認為 ActiveMQ 的使用者不該讓 consumer 太常上下線.

You should investigate why all the consumers are coming and going so often -Timothy Bish

最後有個 user 研究後發現 ConcurrentLinkedQueue<ConsumerInfo> 的 ConsumerInfo 沒有實作 equals & hashCode, 這樣會讓 ConcurrentLinkedQueue#remove 變很慢, 所以 Tim 就把這個 improvement 放到 trunk & 5.9.1. 結束 AMQ-4853.

ActiveMQ Classpath

我們使用 ActiveMQ 的 batch 檔開 ActiveMQ.

ActiveMQ 使用 wrapper service 起動, 起動方式會先載入 activemq.jar, 裡面有個 main class, 這個 class 會去載相關的 library.

把這個 main class decompile 後看到

app.addClassPathList(System.getProperty("activemq.classpath"));

看起來可以設定 activemq.classpath, 改了 AdvisoryBroker 後包成一個小 jar 檔, 然後在 wrapper.conf 裡面設定 -Dactivemq.classpath=${patch jar path} 就可以了.

Problem Fixed?

放上這個 issue 的 fix (就是實作 ConsumerInfo 的 equals & hashCode) 測一次, 效能比較好的那台機器似乎 CPU 維持平穩, 但比較差的那台機器就撐不過去.

我後來把 ConcurrentLinkedQueue 改為 ConcurrentHashMap 就可以通過測試.

Final

原本有點擔心 ConcurrentHashMap 的實作無法滿足 replay consumer advisory message 的需求, 不過查一下 ActiveMQ-5.8.0 的程式, 發現本來就是使用 ConcurrentHashMap. 看來 ConcurrentSkipListMap 與 ConcurrentLinkedQueue 都是在 ActiveMQ-5.9.0 加 & refactor 的.

在 ActiveMQ-5.9.0 把 ConcurrentHashMap 改為 ConcurrentSkipListMap & ConcurrentLinkedQueue 的原因是 AMQ-2327 - Consumers left hanging in the large network of brokers

在我們的 case 裡面, 還沒有那麼多台 ActiveMQ, 而且使用上也可以分成很多獨立的 ActiveMQ 使用, 看起來使用 ConcurrentHashMap 在我們的環境不會遇到 AMQ-2327 的 issue 才對. 所以最後我就選擇把 ConcurrentLinkedQueue 改回使用 ConcurrentHashMap 了.

BitUs DT

2014-06-05

AMQ-4853 - ActiveMQ-5.9.0 吃滿 CPU 的問題