Hello, I've followed the tutorial on how to setup a load-balanced Mysql Cluster and everything seems to be working fine but just recently as I checked up on the services, one of the mysql-cluster isn't being recognized by ndb_mgm app. I've had this problem twice before and I thought I misconfigured it and reinstall the whole system on VM's, I thought I solved it but it seems to be reoccuring after a few days of completing the setup. Here is my configuration for the 5 machines: (note all VMs) sql-1 172.30.0.7 (runs ndbd and mysql) sq-2 172.30.0.8 (runs ndbd and mysql) loadb-1 172.30.0.110 (runs lb1 and ndb_mgm) [active] loadb-2 172.30.0.9 (runs lb2) [passive] virtual IP for cluster: 172.30.0.111 I can ping the virtual IP, I can access the mysql db's from 0.7 and 0.8 but when I try from 0.111, I get an error trying to connect. Here's the output from show in ndb_mgm Cluster Configuration --------------------- [ndbd(NDB)] 2 node(s) id=2 @172.30.0.7 (Version: 4.1.21, Nodegroup: 0, Master) id=3 @172.30.0.8 (Version: 4.1.21, Nodegroup: 0) [ndb_mgmd(MGM)] 1 node(s) id=1 @172.30.0.110 (Version: 4.1.21) [mysqld(API)] 2 node(s) id=4 (not connected, accepting connect from any host) id=5 @172.30.0.8 (Version: 4.1.21) I've restarted mysql on 0.7 and it seems to run fine, but ndb_mgm doesn't see it and even so, 0.8 is running it fine but I still can't connect. Everything worked last week when I completed the setup and I don't know what else I could do to check what may be erroring so that the cluster isn't working. Loadb-1 is the active load-balancers and it should direct the db to sql-2 but it doesn't seem to. I ran all the checks found on http://www.howtoforge.com/loadbalanced_mysql_cluster_debian_p8 and it all checks out fine and the active loadb-1 has the ip 172.30.0.111 as the virutal. If anyone has experience this or could shed some light on what I might be doing wrong that would be great. As I said, everything work 100% when I completed the inital install and I even tested when a single cluster and load balancer would go down, and it worked as the tutorial stated.
Can you run the tests from http://www.howtoforge.com/loadbalanced_mysql_cluster_debian_p8 and post the results here? Also, are there any errors in the logs?
Command "ip addr sh eth0" loadb-1: 2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:0c:29:a7:30:cf brd ff:ff:ff:ff:ff:ff inet 172.30.0.110/24 brd 172.30.0.255 scope global eth0 inet 172.30.0.111/24 brd 172.30.0.255 scope global secondary eth0 loadb-2 2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:0c:29:1f:46:fd brd ff:ff:ff:ff:ff:ff inet 172.30.0.9/24 brd 172.30.0.255 scope global eth0 Command "ldirectord ldirectord.cf status" loadb-1: ldirectord for /etc/ha.d/ldirectord.cf is running with pid: 919 loadb-2: ldirectord is stopped for /etc/ha.d/ldirectord.cf Command: " loadb-1: "ipvsadm -L -n" IP Virtual Server version 1.0.11 (size=4096) Prot LocalAddressort Scheduler Flags -> RemoteAddressort Forward Weight ActiveConn InActConn TCP 172.30.0.111:3306 wrr -> 172.30.0.8:3306 Route 0 0 0 -> 172.30.0.7:3306 Route 0 0 0 loadb-2: IP Virtual Server version 1.0.11 (size=4096) Prot LocalAddressort Scheduler Flags -> RemoteAddressort Forward Weight ActiveConn InActConn Command: "/etc/ha.d/resource.d/LVSSyncDaemonSwap master status" loadb-1: master running (ipvs_syncmaster pid: 1046) loadb-2: master stopped Everything seems to check out but I'm still unable to connect. When I first installed the app and tested ndb_mgm, both NDB's show up, ndb MGM shows up and so does both MYSQLD. Now when I run a show all I get this the following: [ndbd(NDB)] 2 node(s) id=2 @172.30.0.7 (Version: 4.1.21, Nodegroup: 0) id=3 @172.30.0.8 (Version: 4.1.21, Nodegroup: 0, Master) [ndb_mgmd(MGM)] 1 node(s) id=1 @172.30.0.110 (Version: 4.1.21) [mysqld(API)] 2 node(s) id=4 @172.30.0.8 (Version: 4.1.21) id=5 (not connected, accepting connect from any host) You can see that 172.30.0.7 mysqld isn't showing up, but it's running on 0.7 and I can access the mysql directly from it.
What's the output of Code: netstat -tap and Code: df -h on 172.30.0.7? Are there any errors in the logs on 172.30.0.7?
sql-1:~# netstat -tap Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 *:mysql *:* LISTEN 27158/mysqld tcp 0 0 *:www *:* LISTEN 813/apache2 tcp 0 0 *:ssh *:* LISTEN 800/sshd tcp 0 0 sql-1.localdomain:2202 *:* LISTEN 27099/ndbd tcp 0 0 sql-1.localdomain:35463 172.30.0.110:1186 ESTABLISHED27098/ndbd tcp 0 0 sql-1.localdomain:35466 172.30.0.110:1186 ESTABLISHED27158/mysqld tcp 0 0 sql-1.localdomain:mysql 172.30.0.110:56547 TIME_WAIT - tcp 0 0 sql-1.localdomain:2202 172.30.0.8:49152 ESTABLISHED27099/ndbd tcp 0 0 sql-1.localdomain:mysql 172.30.0.110:56521 TIME_WAIT - tcp 0 148 sql-1.localdomain:ssh 172.30.0.2:1800 ESTABLISHED18132/0 tcp 0 0 sql-1.localdomain:35465 172.30.0.110:2202 ESTABLISHED27099/ndbd tcp 0 0 sql-1.localdomain:2202 172.30.0.8:49149 ESTABLISHED27099/ndbd tcp 0 0 sql-1.localdomain:35468 172.30.0.8:2202 ESTABLISHED27158/mysqld sql-1:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 883M 424M 412M 51% / tmpfs 126M 0 126M 0% /dev/shm (The sql data I'm storing will be < 1mb in total, it's just user's ftp login information) I've checked the logs and nothing seems out of place, there are no errors being thrown.
What's in /etc/fstab? I could imagine it's a problem with your disk space or memory as a MySQL cluster needs lots of memory...
sql-1:~# cat /etc/fstab # /etc/fstab: static file system information. # # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc defaults 0 0 /dev/sda1 / ext3 defaults,errors=remount-ro 0 1 /dev/sda5 none swap sw 0 0 /dev/hda /media/cdrom0 iso9660 ro,user,noauto 0 0 /dev/fd0 /media/floppy0 auto rw,user,noauto 0 0
You don't have much swap (only 126MB). And if your memory is low that could cause a problem... What's the output of Code: cat /proc/meminfo ?
sql-1:~# cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 263208960 256610304 6598656 0 25546752 80146432 Swap: 82210816 0 82210816 MemTotal: 257040 kB MemFree: 6444 kB MemShared: 0 kB Buffers: 24948 kB Cached: 78268 kB SwapCached: 0 kB Active: 59388 kB Inactive: 163688 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 257040 kB LowFree: 6444 kB SwapTotal: 80284 kB SwapFree: 80284 kB So you think I should bump up the memory? I default these VM's to have about 256mb of ram. I didn't think that the cluster would require much since its not hold much information.
So I bumped up the memory on both sql-1 and sql-2 to 512mb of ram. sql-1:~# cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 528752640 223784960 304967680 0 14512128 72069120 Swap: 82210816 0 82210816 MemTotal: 516360 kB MemFree: 297820 kB MemShared: 0 kB Buffers: 14172 kB Cached: 70380 kB SwapCached: 0 kB Active: 40808 kB Inactive: 161348 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 516360 kB LowFree: 297820 kB SwapTotal: 80284 kB SwapFree: 80284 kB Still no change.
So I've looked into the issue abit more, when I try to access the connectioncheck table I get the following message: ERROR 1105 (HY000): Failed to open 'connectioncheck', error while unpacking from engine Also since I'm running VM's, I always ssh to the machine and didn't realize there was an error getting printed to the console. DBI connect('database=ldirectordb;host=172.30.0.140ort3306','ldirector',...) failed: Unknown database 'ldirectordb' at /etc/ha.d/resource.d/ldirector line 1950 I saw that people were having this issue after restarting their cluster http://forums.mysql.com/read.php?25,80009,80009 I wasn't sure if you've seen this before, because when you start from scratch it works, but after 1 reboot, it seems that the database somehow gets corrupted or something. I've tried dropping the database, but still doesn't work.
The perl DBI modules are the latest version. This really sucks because it works when I first have it initially setup. It's only after a restart, the sql db seems to get corrupted and the active load balancer will start to throw the error about the connectioncheck table error.