[jboss-jira] [JBoss JIRA] (JGRP-1669) UDP should not stop message receiver thread after SocketException is caught

Fri Jul 26 18:37:26 EDT 2013

     [ https://issues.jboss.org/browse/JGRP-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aleksandr Korostov updated JGRP-1669:
-------------------------------------

    Description: 
One of our customers is getting the following error sporadically:
{code}
java.net.SocketException: socket closed
        at java.net.PlainDatagramSocketImpl.receive0(Native Method)
        at java.net.PlainDatagramSocketImpl.receive(Unknown Source)
        at java.net.DatagramSocket.receive(Unknown Source)
        at org.jgroups.protocols.UDP.run(UDP.java:262)
        at java.lang.Thread.run(Unknown Source)
{code}

(not that I modified UPD class to log the full stack trace)

I'm sure that socket was not closed by any Java code, the sender thread keeps running and sending messages via the same socket (if mcast_socket.close() had been called, the mcast_socket.send() would have thrown "socket closed" exception too but it did not).

The main problem that this error stops the receiver thread so node stops receiving UDP messages from other nodes in the cluster.

We cannot reproduce the error in our environment but I found cases of similar problem on the Internet:

* https://forums.oracle.com/thread/2190450 (read the last post). It looks like Time-to-live-exceeded ICMP packet could force such an error on the multicast socket. 
* https://github.com/elasticsearch/elasticsearch/pull/2783 Here this problem was faced but the devs did not manage to find the root cause so they ended up implementing socket re-creation on error

Given all that I think that UDP receiver thread should check mcast_socket.isClosed() before exiting. If socket is still open the thread should continue running (it may be safer to close the old socket and create a new one though)

  was:
One of our customers is getting the following error sporadically:
{code}
java.net.SocketException: socket closed
        at java.net.PlainDatagramSocketImpl.receive0(Native Method)
        at java.net.PlainDatagramSocketImpl.receive(Unknown Source)
        at java.net.DatagramSocket.receive(Unknown Source)
        at org.jgroups.protocols.UDP.run(UDP.java:262)
        at java.lang.Thread.run(Unknown Source)
{code}

(not 

I'm sure that socket was not closed by any Java code, the sender thread keeps running and sending messages via the same socket (if mcast_socket.close() had been called, the mcast_socket.send() would have thrown "socket closed" exception too but it did not).

The main problem that this error stops the receiver thread so node stops receiving UDP messages from other nodes in the cluster.

We cannot reproduce the error in our environment but I found cases of similar problem on the Internet:

* https://forums.oracle.com/thread/2190450 (read the last post). It looks like Time-to-live-exceeded ICMP packet could force such an error on the multicast socket. 
* https://github.com/elasticsearch/elasticsearch/pull/2783 Here this problem was faced but the devs did not manage to find the root cause so they ended up implementing socket re-creation on error

Given all that I think that UDP receiver thread should check mcast_socket.isClosed() before exiting. If socket is still open the thread should continue running (it may be safer to close the old socket and create a new one though)

> UDP should not stop message receiver thread after SocketException is caught
> ---------------------------------------------------------------------------
>
>                 Key: JGRP-1669
>                 URL: https://issues.jboss.org/browse/JGRP-1669
>             Project: JGroups
>          Issue Type: Enhancement
>    Affects Versions: 2.6.13
>         Environment: Windows 2008 R2, Oracle JDK 1.6.0.24
>            Reporter: Aleksandr Korostov
>            Assignee: Bela Ban
>
> One of our customers is getting the following error sporadically:
> {code}
> java.net.SocketException: socket closed
>         at java.net.PlainDatagramSocketImpl.receive0(Native Method)
>         at java.net.PlainDatagramSocketImpl.receive(Unknown Source)
>         at java.net.DatagramSocket.receive(Unknown Source)
>         at org.jgroups.protocols.UDP.run(UDP.java:262)
>         at java.lang.Thread.run(Unknown Source)
> {code}
> (not that I modified UPD class to log the full stack trace)
> I'm sure that socket was not closed by any Java code, the sender thread keeps running and sending messages via the same socket (if mcast_socket.close() had been called, the mcast_socket.send() would have thrown "socket closed" exception too but it did not).
> The main problem that this error stops the receiver thread so node stops receiving UDP messages from other nodes in the cluster.
> We cannot reproduce the error in our environment but I found cases of similar problem on the Internet:
> * https://forums.oracle.com/thread/2190450 (read the last post). It looks like Time-to-live-exceeded ICMP packet could force such an error on the multicast socket. 
> * https://github.com/elasticsearch/elasticsearch/pull/2783 Here this problem was faced but the devs did not manage to find the root cause so they ended up implementing socket re-creation on error
> Given all that I think that UDP receiver thread should check mcast_socket.isClosed() before exiting. If socket is still open the thread should continue running (it may be safer to close the old socket and create a new one though)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira