HI there, Don't know if this is the right place to ask, but let's just give it a try So I have this network.. 8 racks with servers (web/red5/mysql/nexenta etc) 14 switches all cisco: 2 x3750: Stacked 12x 2960: Has 2 or 4 cat6 cables evenly spread on a 3750 for trunking the internal and external vlan. One of our file servers has beside the internal 2 network adapters a e1000 server adapter, the file server is running on a single internal interface (1gbit). Across in the racks are ~50 web server which all have a NFS (UDP) mount to the file server. The e1000 in the file server has both connectors connected to a 2960 because i wanted to change it to a 2gbit aggregate so i have 2gbit of bandwidth available to/from the server. So today .. - I created a Port-channel on the cisco 2960, mode active LACP for the 2 Gi's the file server was connected to. (only vlan 200 which is internal) Code: port-channel load-balance src-dst-ip ! interface Port-channel2 switchport access vlan 200 ! interface GigabitEthernet1/0/36 switchport access vlan 200 channel-group 2 mode active ! interface GigabitEthernet1/0/38 switchport access vlan 200 channel-group 2 mode active ! lw-r2-core#sh etherchannel summary Code: 2 Po2(SU) LACP Gi1/0/36(P) Gi1/0/38(P) - On the file server i added a link-aggr with dladm Code: root@alcor:~# dladm show-aggr LINK POLICY ADDRPOLICY LACPACTIVITY LACPTIMER FLAGS aggr0 L3,L4 auto active short ----- So all looks fine .. web servers can connected over the aggregate etc, looks all good and well Until nagios started sending SMS messages .. high load on 15-20 web servers .. hmm .. let's have a look .. Code: [183613.720649] nfs: server 192.168.5.181 OK [183613.721477] nfs: server 192.168.5.181 OK [183673.596026] nfs: server 192.168.5.181 not responding, still trying [183677.996026] nfs: server 192.168.5.181 not responding, still trying [183677.996033] nfs: server 192.168.5.181 not responding, still trying [183677.996659] nfs: server 192.168.5.181 OK [183677.997555] nfs: server 192.168.5.181 OK [183677.997563] nfs: server 192.168.5.181 OK [183687.588027] nfs: server 192.168.5.181 not responding, still trying [183687.590185] nfs: server 192.168.5.181 OK aw crap .. now of the ~50 servers, about 40% of them got a high load (high wait state) .. the other servers using the exact same NFS mount sharing racks with servers that didn't like the aggregate upgrade. so where to look .. tcpdump gave me A LOT of UDP packets being resend from the file server to the web server .. while that didn't happen on web servers that where working perfect. MTU is on 1500 in my whole network, package length was 1514 .. OK fragmentation that's causing output socket buffers to fill up or something? The thing is .. when i remounted the NFS mount to TCP, the load went away and everything runs smooth. BUT that's just a work around, it should work with UDP just as well as it does with TCP. So 40% of the servers high load .. hmm "Could it be .. " yes .. it seems the web servers with the high load where all assigned to Gi1/0/38 on the switch .. the others on Gi1/0/36 .. disable port Gi1/0/38 .. et voila! UDP mounted servers with high load started to run smooth again with low load. So I'm kinda stuck where to look now .. and what this could cause this behavior. I'm thinking about the UDP fragmentation .. When i lowered the MTU on the file server, there where less fragmented packages being send, but still, it didn't really help. ALL switches in the network have src-dest-ip load balancing enabled. So in short: - File server single uplink -- NFS over UDP: working servers 100% -- NFS over TCP: working servers 100% - File server: LACP link -- NFS over UDP: working servers 50% -- NFS over TCP: working servers 100% - File server: LACP link (pulled one cable) -- NFS over UDP: working servers 100% -- NFS over TCP: working servers 100% tomorrow i'm going to disable Gi1/0/36 and enable Gi1/0/38 to see if it's the physical switch port. Anyone any pointers?
Thinking about this cartoon Here's the "solution".. What we eventually did was remount everything to TCP .. at some point the amount of data that's being send needs to be controlled/checked. UDP becomes to unstable at those amounts of data, so in short: use NFS over TCP