为什么这么慢?

昨天差点引发了一个严重线上bug,以此为记。

背景

我们现在采用的是logstash的jdbc插件同步mysql的数据。
这次上线的功能需要多关联一张表查询数据进入索引,直接体现在logstash jdbc配置文件中定义的SQL多了一个left join。

现象

应用发布之后,tomcat启动耗时很长时间,很
进入应用,索引重建后依照惯例执行了全量索引的logstash脚本,但是索引迟迟没有创建成功,很

处理

Tomcat启动慢


从日志中找到罪魁祸首:

1
Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [*,*] milliseconds.

当时因为一次打不开页面,做了多次刷新操作,生成session ID的过程一只阻塞到完成所有session ID的创建,耗时接近4分钟。
Tomcat官方文档也给出了说明:

1
2
3
4
5
6
7
8
9
Tomcat 7+ heavily relies on SecureRandom class to provide random values for its session ids and in other places. Depending on your JRE it can cause delays during startup if entropy source that is used to initialize SecureRandom is short of entropy. You will see warning in the logs when this happens, e.g.:

<DATE> org.apache.catalina.util.SessionIdGenerator createSecureRandom
INFO: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [5172] milliseconds.
There is a way to configure JRE to use a non-blocking entropy source by setting the following system property: -Djava.security.egd=file:/dev/./urandom

Note the "/./" characters in the value. They are needed to work around known Oracle JRE bug #6202721. See also JDK Enhancement Proposal 123. It is known that implementation of SecureRandom was improved in Java 8 onwards.

Also note that replacing the blocking entropy source (/dev/random) with a non-blocking one actually reduces security because you are getting less-random data. If you have a problem generating entropy on your server (which is common), consider looking into entropy-generating hardware products such as "EntropyKey".

Tomcat也给出了解决方案:在启动的jvm参数中加入-Djava.security.egd=file:/dev/./urandom,同时也警告说这种方式存在一些安全隐患。
添加jvm参数后启动就正常了:

创建索引慢:数据库

怀疑因为多关联了一张表导致数据库查询耗时增加,进入RDS监控查看发现CPU出现峰值:

结合之前RDS出现CPU飙升阿里工程师给出的解释:

CPU高主要是由于这些慢查询导致,建议您先优化下这慢查询再观察下。

(小插曲,惊觉配置文件中写入的数据库地址被从备库误写为主库)
第一时间通过mysql的show processlist找到创建索引的SQL进程执行kill,保障其他业务不受影响。
然后,进入慢日志查询中,发现果然是创建索引的SQL执行时间超长,这条SQL执行时间长达492s:

接着,分析sql,怀疑新关联的表没有为关联的字段创建索引,马上在这个字段上创建索引。
随后创建索引,输出以下日志:

1
2
3
4
5
6
[2017-06-08T19:36:22,802][INFO ][logstash.pipeline        ] Starting pipeline {"id"=>"main", "pipeline.workers"=>2, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>5
[2017-06-08T19:36:22,850][INFO ][logstash.pipeline ] Pipeline main started
[2017-06-08T19:36:22,920][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
[2017-06-08T19:36:32,053][INFO ][logstash.inputs.jdbc ] (9.767000s) SELECT car.car_id AS car_id,
...
[2017-06-08T19:42:51,136][WARN ][logstash.agent ] stopping pipeline {:id=>"main"}

看到logstash的日志中,执行SQL的时间为9.7s,虽然时间略长,但是考虑到是多表关联+全表扫描,所以暂且任务数据库的问题已经处理完成。
但是还是没解决索引创建为什么这么慢的问题。
单次全量耗时为6分20s,系统资源冗余较多。

创建索引慢:EL

通过上面的日志输出看到logstash采用的是默认配置:

1
2
3
pipeline.workers => 2 // 并行的线程数
pipeline.batch.size=>125 // 单个线程单次处理的最大索引数
pipeline.batch.delay=>5 // 如果在这里定义的时间内没有达到最大索引数,也执行提交

更多配置项参见官网文档

尝试更改配置为:

1
2
3
pipeline.workers => 4 // 2*cpu
pipeline.batch.size=>20000
pipeline.batch.delay=>3

日志如下:

1
2
3
4
5
6
7
[2017-06-09T15:11:09,088][INFO ][logstash.pipeline        ] Starting pipeline {"id"=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>20000, "pipeline.batch.delay"=>3, "pipeline.max_inflight"=>80000}
[2017-06-09T15:11:09,088][WARN ][logstash.pipeline ] CAUTION: Recommended inflight events max exceeded! Logstash will run with up to 80000 events in memory in your current configuration. If your message sizes are large this may cause instability with the default heap size. Please consider setting a non-standard heap size, changing the batch size (currently 20000), or changing the number of pipeline workers (currently 4)
[2017-06-09T15:11:09,109][INFO ][logstash.pipeline ] Pipeline main started
[2017-06-09T15:11:09,154][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9601}
[2017-06-09T15:11:18,651][INFO ][logstash.inputs.jdbc ] (10.100000s) SELECT car.car_id AS car_id,
...
[2017-06-09T15:15:49,619][WARN ][logstash.agent ] stopping pipeline {:id=>"main"}

单次全量耗时4分30秒,docker容器显示的系统负载峰值(使用命令docker stats):

对应的系统负载峰值(使用命令vmstat -n 1):

基本已经榨干了ecs的所有计算资源。
尝试更多配置:

配置 耗时
-w 4 -b 60000 -u 3 4分25s
-w 6 -b 20000 -u 3 4分13s

可以发现不管是提升worker数还是提升单次提交的数据,都无法继续提升性能,基本是这台2核ECS的极限了,而且同时ES的集群(2台2核4g)也出现了CPU的负载峰值:

看来在现有的计算资源之下,我们只能暂且容忍这么长的全量时间了,后面继续调整ES的配置,看看会不会有提升。