[Design of Clustering on JBoss (Clusters/JBoss)] - Re: Handling of 'deployments taking ~1 minute' scenario
by galder.zamarreno@jboss.com
Yeah, that bug was fixed. In this latest case, the root cause was log4j that was logging over NFS and was having some issues in some cluster nodes. Logging over NFS is never a good idea, but I needed to be 100% sure that this is the cause of the slow deployments and not another JGroups bug ;).
I was lucky enough that when digging through the case, I was able to match the nodes for which the RPC called failed to the logs of two nodes that showed log4j issues, "stale NFS handle".
anonymous wrote : In some other case where a remote node "isn't responding" all you could do would be to send a message to "commit suicide" -- there's no mechanism to evict a node from the group outside of JGroups' own failure detection. But if the node isn't responding to RPCs, it likely wouldn't respond to the "commit suicide" either.
If it wasn't responding to RPCs, FD/FD_SOCK eventually would discover that the node is not responding. In this case though, failure detection layer was Ok, so cluster was not dismantled, but something was wrong that was disrupting a healthy cluster. Customer was concerned about such scenario.
anonymous wrote : Logically, I could see some benefit in some sort of self-healing approach where cluster members detect faults and restart themselves or send commands to others telling them to restart. But this will take a lot of thought.
I'll fill in a JIRA tomorrow to track this.
View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4120248#4120248
Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4120248
18 years, 2 months
[Design of Clustering on JBoss (Clusters/JBoss)] - Re: Thread dumps generated by triggers/events?
by galder.zamarreno@jboss.com
There're 4 ways to generate a thread dump. I have taken the output from the same run with different methods:
1.- kill -3. Can be parsed by tools like https://tda.dev.java.net/ (simply, one of the most useful tools I've found out there!). Example:
"JBossLifeThread" prio=1 tid=0x0a690cc0 nid=0x1741 in Object.wait() [0x867f2000..0x867f30b0]
| at java.lang.Object.wait(Native Method)
| - waiting on <0x9247e200> (a java.lang.Object)
| at java.lang.Object.wait(Object.java:474)
| at org.jboss.system.server.ServerImpl$LifeThread.run(ServerImpl.java:940)
| - locked <0x9247e200> (a java.lang.Object)
2.- JMX via org.jboss.system.server.ServerInfo.listThreadDump(). Is it the same as what Thread.getAllStackTraces() would do? Need a JIRA to get to the bottom of this. Here's some output:
Thread: JBossLifeThread : priority:5, demon:false, threadId:58, threadState:WAITING, lockName:java.lang.Object@10cff6b
|
| java.lang.Object.wait(Native Method)
| java.lang.Object.wait(Object.java:474)
| org.jboss.system.server.ServerImpl$LifeThread.run(ServerImpl.java:940)
Lock name is different to lock monitor achieved via kill -3. Doesn't looks like it prints previously locked monitors. Cannot be parsed by TDA tools because it's in JBoss specific format. Useless.
3.- Thread.getAllStackTraces().
Haven't tested the output from this. What would this print? Couldn't find an MBean that uses/parses the output from this, but I suspect it's the same as method 2, specially since http://jira.jboss.com/jira/browse/JBAS-1448 has already been fixed.
4.- JConsole Threads view:
Name: JBossLifeThread
| State: WAITING on java.lang.Object@10cff6b
| Total blocked: 0 Total waited: 1
|
| Stack trace:
| java.lang.Object.wait(Native Method)
| java.lang.Object.wait(Object.java:474)
| org.jboss.system.server.ServerImpl$LifeThread.run(ServerImpl.java:940)
Seems to print same info as JBoss' ServerInfo.
--
This has been bugging for me for ages, had to get to get to the bottom of it :). So far, there's nothing like "killing" JBoss.
View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4120243#4120243
Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4120243
18 years, 2 months
[Design of Clustering on JBoss (Clusters/JBoss)] - Re: Thread dumps generated by triggers/events?
by galder.zamarreno@jboss.com
"bstansberry(a)jboss.com" wrote : So are you thinking about a service that generates thread dumps (or tweaking the existing one), with configuration options to control where the dumps go?
Something along those lines but was thinking more about a service that lies in the management console that responds to specific JMX events and then requests generation of thread dumps. Could be a standard service in AS, but seems to fit more the monitoring area.
This service should be able to instruct not only the local node, but other cluster members to generate thread dumps as well. For example if the TE happened when doing a sync repl, or when doing a sync JGroups RPC call, request the node where the sync repl/rpc failed to generate a thread dump.
Information that would be needed:
- timestamp (kill -3 does not provide timestamp of the thread dump!)
- thread dump
- some kind of unique id shared by all thread dumps in all nodes that were generated from a specific failure.
- some information to match the thread dump(s) to the failure in the logs.
The aim is for someone to be able to say something like this: "Machine A reported a TE (with suspected=false) and these are the thread dumps taken immediately after from Machines B,C,D in the cluster that are associated with this TE. I have already the GC logs in case the TE was due to long garbage collection."
"bstansberry(a)jboss.com" wrote : With hooks to inject the service into other interested services, e.g. HAPartition? HAPartition would then decide whether an event (e.g. a timeout on an RPC) justifies calling into the thread dump service.
|
| In the case of an RPC timeout, only the caller knows it happened.
That could work as well. My comment above on JMX notifications would be pretty much this. HAPartition generates a JMX notification upon an RPC timeout and a service in the monitoring tool does the job.
Where do you think such service would fit better?
One thing I need to get to the bottom of regarding thread dumps is that Thread.getAllStackTraces() does not provide the same information a kill -3. Some lock information seems to be missing from Thread.getAllStackTraces(), which is why I recommend against JMX method to generate stack traces.
View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4120230#4120230
Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4120230
18 years, 2 months
[Design of Clustering on JBoss (Clusters/JBoss)] - Re: JBAS-4919 - ha singletons in heterogenous topologies
by galder.zamarreno@jboss.com
"bstansberry(a)jboss.com" wrote : Crap. I see the following didn't make it into the interface:
|
| /**
| | * Called by the HASingleton to set the name with which the singleton
| | * service is registered with the HAPartition.
| | */
| | public void setServiceName(String serviceName);
| |
| | public String getServiceName();
|
| Those need to be there, otherwise an impl has no idea how to interact with the DRM.
I guess that's what setManagedSingleton/getManagedSingleton was trying to achieve:
/**
| * Called by the HASingleton to provide the election policy a reference to
| * itself. A policy that was designed to elect a particular kind of singleton
| * could downcast this object to a particular type and then access the
| * singleton for state information needed for the election decision.
| */
| void setManagedSingleton(Object singleton);
|
| Object getManagedSingleton();
Or maybe not. Judging from the javadoc, seems like the aim is different.This method is actually not used anywhere in the code. Shall we swap it for setServiceName/getServiceName? Or shall we keep it and add service name get/set?
View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4120220#4120220
Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4120220
18 years, 2 months