1. Jonathon Anderson
  2. Use Case Sharing
  3. Thursday, 18 July 2019
  4.  Subscribe via email
For the past three years of RMACC Summit's production (since the beginning of its deployment) we've had problems with unexplained GPFS expels on on Omni-Path fabric. This has usually been 2-3 nodes being expelled each day when a pair of nodes loose IP connectivity with each other and one asks the cluster to expel another.

I'll be honest: I spent so much time and energy trying to figure out what was going wrong (including with Intel and DDN assistance) with no success that I had given up.

Recently we upgraded OPA to 10.9, the latest promise from Intel that this would fix our issue. In stead, we started experiencing significantly worse expel events, with 200+ nodes being expelled simultaneously when they failed to renew their leases. (As before, the nodes rejoined the cluster automatically after about another minute.)

I reached out to intel again, and we discovered that our one internally-managed OPA switch had an IPoIB port configured. (This was configured by Dell during original deployment to serve as a management port, redundant with the oob Ethernet management port.) There appears to be a problem with this feature, though. According to Intel:

[*] It appears this is the device that is registering as type IB instead of type OPA.

Enabling IPoIB on a internally-managed switch is not common

I immediately disabled this IPoIB port, and (fingers crossed) we haven't seen another unexplained expel event since then, almost a week ago. We _have_ seen two expels, but both are directly explained by a GPFS client becoming unresponsive for an unrelated reason.

I believe I had heard chatter that other sites may have been having problems with GPFS expels on OPA fabrics as well. If that's you, I'd love to hear if it's still a problem for you and, if so, if you have any internally-managed switches with an internal IPoIB interface configured. The log to look for on your fabric manager is:

Jun 17 17:17:50 sfabric1 fm0_sm[1936]: ERROR[async]: MAI: mai_send_stl_timeout: mai_send_stl: Invalid MAD base version: 1
Jun 17 17:17:50 sfabric1 fm0_sm[1936]: ERROR[async]: APP: cs_cntxt_send_mad_nolock: status 29 sending REPORT[NOTICE] MAD length 80 in context entry[10136] to LID[0xa], TID 0x0000000025BD5D7A
Jun 17 17:17:50 sfabric1 fm0_sm[1936]: ERROR[async]: APP: cs_cntxt_send_mad: can't send MAD rc: 29: invalid MAD
Jun 17 17:17:50 sfabric1 fm0_sm[1936]: ERROR[async]: SA: sa_Trap_Forward: sa_Trap_Forward: can't send MAD reliably rc: 29: invalid MAD
There are no comments made yet.

There are no replies made for this post yet.
Be one of the first to reply to this post!
Submit Your Response
Upload files or images for this discussion by clicking on the upload button below. Supports doc, docx, DOC, DOCX, xls, xlsx, XLS, XLXS, ppt, PPT
• Remove Upload Files (Maximum File Size: 2 MB)
You may insert polls into your post. The poll would then appear in the post.
Vote Options

About OPUG

The OPUG is an independent users group that provides a forum for the free exchange of information and ideas that enhance the usability and efficiency of scientific applications running on large HPC systems using the Intel Omni-Path Architecture fabric.

Upcoming Events

Past Events

SC18 OPUG Annual Meeting November 12, 2018
IXPUG Annual Fall Conference 2018 September 25-28, 2018
PEARC2018 July 24, 2018

Quick Links