[jboss-jira] [JBoss JIRA] (WFLY-10047) OOM caused by jgroups objects UNICAST3$SenderEntry#1

Mon Mar 19 08:52:00 EDT 2018

    [ https://issues.jboss.org/browse/WFLY-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547775#comment-13547775 ] 

Erich Duda edited comment on WFLY-10047 at 3/19/18 8:51 AM:
------------------------------------------------------------

In the log I can see several messages \[1\], before the OOM happend. I suspected that they are expected since the node-2 was killed.
Both servers are running on single node and communicate over localhost. Do you know what could cause that these warnings were logged?

\[1\]
{code}
13:31:33,794 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:35,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:37,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:39,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:41,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:43,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
{code}

bq. The analysis is very likely incorrect – the SenderEntry are entries kept for retransmit – none of this is indicative of a leak (note that this size can be adjusted with xmit_table_msgs_per_row and xmit_table_num_rows).

If the OOM was caused by "cache" for retransmits, shouldn't be the default size of the cache lowered to not to cause OOM?

was (Author: eduda):
In the log I can see several messages \[1\], before the OOM happend. I suspected that they are expected since the node-2 was killed.
Both servers are running on single node and communicate over localhost. Do you know what could cause that these warning were logged?

\[1\]
{code}
13:31:33,794 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:35,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:37,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:39,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:41,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
13:31:43,804 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ejb,node-2) JGRP000032: node-2: no physical address for bb5fc2e1-deb1-30b6-0f2d-90a2b5239c6c, dropping message
{code}

> OOM caused by jgroups objects UNICAST3$SenderEntry#1
> ----------------------------------------------------
>
>                 Key: WFLY-10047
>                 URL: https://issues.jboss.org/browse/WFLY-10047
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 13.0.0.Beta1
>            Reporter: Erich Duda
>            Assignee: Paul Ferraro
>            Priority: Blocker
>         Attachments: heapdump.png
>
>
> JGroups objects UNICAST3$SenderEntry#1 caused OOM on Wildfly server during the boot. See attached picture.  !heapdump.png|thumbnail! 
> *User impact:* If users use JGroups for clustering, the server may get OOM what can cause undefined behavior.
> The *blocker* priority was set, because this is regression against previous versions of Wildfly and the OOM is serious error which prevents server to work properly.
> The issue was hit in following scenario.
> # start two servers (nodes) in cluster with one queue
> # producer starts to send messages to queue to node-1
> # node-2 is killed and restarted during sending messages <---- *Here the test failed, when the node-2 was started after that it had been killed.*
> # start consumer on node-2 which reads messages from queue
> # servers are stopped
> The Wildfly was built from following source code:
> repo: https://github.com/jmesnil/wildfly
> branch: WFLY-9407_upgrade_artemis_2.5.0
> commit SHA: 06c878a313d3cad323889d017e60fd5533204d1a
> JGroups version: 4.0.10.Final

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)