Friday, September 16, 2022

Heapdump of Apache Flink in yarn mode

It happens some times when the apache flink yarn instance got crashed because of memory crash. Like:

2017-02-08 16:04:53,272 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e38_1486041380317_0381_01_000002 is completed with diagnostics: Container [pid=31161,containerID=container_e38_1486041380317_0381_01_000002] is running beyond physical memory limits. Current usage: 2.8 GB of 2 GB physical memory used; 4.3 GB of 4.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_e38_1486041380317_0381_01_000002 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 31235 31161 31161 31161 (java) 1697 274 4501712896 729895 /usr/share/java/jdk1.7.0_67/bin/java -Xms1448m -Xmx1448m -XX:MaxDirectMemorySize=1448m -Dlog.file=/hadoop/yarn/log/application_1486041380317_0381/container_e38_1486041380317_0381_01_000002/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 
    |- 31161 31159 31161 31161 (bash) 1 1 108654592 308 /bin/bash -c /usr/share/java/jdk1.7.0_67/bin/java -Xms1448m -Xmx1448m -XX:MaxDirectMemorySize=1448m  -Dlog.file=/hadoop/yarn/log/application_1486041380317_0381/container_e38_1486041380317_0381_01_000002/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> /hadoop/yarn/log/application_1486041380317_0381/container_e38_1486041380317_0381_01_000002/taskmanager.out 2> /hadoop/yarn/log/application_1486041380317_0381/container_e38_1486041380317_0381_01_000002/taskmanager.err 

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143​

 And its very much difficult to find anything from the log even after enabling the debugging logs.

So it can be solved from the heapdump if yarn jobs.

First enable the heapdump in flink-conf.yaml. Add the following line as:

env.java.opts: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<LOG_DIR>/flink_heap_dump.hprof

Once the apache flink yarn job got crashed collect the yarn job log that have the heapdump also as:


yarn logs -applicationId application_1391047358335_0041 > application_1391047358335_0041.log