(Continued from part II where I tried to fix a bug and found out that the affected part of the code had been rewritten, so the bug didn't exist anymore.)
Magnus gives a helpful hint...
<magnus> hingo: maybe https://bugs.mysql.com/bug.php?id=42920 is some thing for you? :)
<magnus> part III
<hingo> magnus: thanks :-) I was thinking someone will suggest something...
<hingo> magnus: I was really disappointed that #32658 was not valid anymore. Would have been fun to send a patch that only fixes a message string and no code :-)
<magnus> hingo: I'm quite sure this is not much more :)
<magnus> hingo: just a little tricky to find why it behaves like it does.
<hingo> magnus: Shh... No spoilers :-)
<magnus> :)
So here we go...
$ bzr branch mysql-cluster-7.0 bug-42920
The bug already contains some tests and descriptions of what happens (thanks gustaf), so I'll skip repeating those tests. (But the bug is filed against 6.3 and I'm using 7.0...)
Then what? In school very very long ago they thought me how to use gdb. In real life I just used Visual Studio debugger, which is kind of fun because you just point at various lines in the code to set breakpoints and then you see what happens - no need to understand what you're doing, just like Microsoft products usually are.
I decide to start with plain old reading through the code...
~/hacking/bzr/mysql-cluster-7.0/bug-42920/storage/ndb/src$ ls
CMakeLists.txt cw kernel Makefile.am mgmclient ndbapi
common external libndb.ver.in mgmapi mgmsrv
mgmapi, mgmsrv, mgmclient... which one? Should be mgmsrv that is behaving badly...
~/hacking/bzr/mysql-cluster-7.0/bug-42920/storage/ndb/src$ ls mgmsrv/
CMakeLists.txt ConfigSubscriber.hpp main.cpp ndb_mgmd_error.h
Config.cpp Defragger.hpp Makefile.am Services.cpp
Config.hpp DirIterator.cpp mgm_ndbinfo.cpp Services.hpp
ConfigInfo.cpp DirIterator.hpp MgmtSrvr.cpp testConfig.cpp
ConfigInfo.hpp ERROR_codes.txt MgmtSrvr.hpp
ConfigManager.cpp InitConfigFileParser.cpp MgmtThread.hpp
ConfigManager.hpp InitConfigFileParser.hpp mkconfig
:~/hacking/bzr/mysql-cluster-7.0/bug-42920/storage/ndb/src$ kwrite mgmsrv/main.cpp
main.cpp just seems like the natural place to start...
int main(int argc, char** argv)
{
return my_daemon_init(argc, argv, mgmd_main, mgmd_stop,
"ndb_mgmd", "MySQL Cluster Management Server");
}
Yepp, I can understand what that one is doing...
The my_daemon_init calls a function called run(). Right. So try grep that in the mysql source code. At this point I'm thinking I should perhaps install an IDE that could tell me in which #included file this particular run() function is defined... Ah, but run is just a function pointer here, so the real name of the function called is ndb_mgmd(). I'm fine grepping that, and the discovery continues...
I also realise I should use 6.3 after all. The startup sequence of ndb_mgmd has changed in 7.0 and it stores a copy of the config.ini for itself - somehow I feel these could affect the bug occuring and I'm potentially wasting my time with the mainline 7.0 branch. Now that I know how to use bzr this is easy. ...and sure enough, main.cpp is completely different (seems like a lot around starting the server has changed to add Win32 support for 7.0)
// These lines in 6.3 main() seem interesting...
int main(int argc, char** argv)
{
...
glob= new MgmGlobals;
...
glob->socketServer = new SocketServer();
MgmApiService * mapi = new MgmApiService();
glob->mgmObject = new MgmtSrvr(glob->socketServer,
opt_config_filename,
opt_connect_str);
...
if (glob->mgmObject->init())
goto error_end;
...
while(!glob->socketServer->tryBind(glob->port, _bind_address)){
if (--count > 0) {
NdbSleep_MilliSleep(1000);
continue;
}
ndbout_c("Unable to setup port: %s:%d!\n"
"Please check if the port is already used,\n"
"(perhaps a ndb_mgmd is already running),\n"
"and if you are executing on the correct computer",
(_bind_address ? _bind_address : "*"), glob->port);
goto error_end;
...then it continues with trying to bind to a port to listen to, becoming a daemon, etc...
Observation: So you are allowed to use goto
in NDB code. Just like in Linux too :-)
I continue reading a couple hours, following the function calls, starting with the MgmtSrvr() constructor, until I reach line 2459 in mgmapi.cpp:
err.assfmt("Could not alloc node id at %s port %d: %s",
hostname, port, buf);
prop->get("error_code", &error_code);
That is the error message in the bug report. We are getting closer.
But I'm not sure the code path I've followed is the same as in the bug report. I write a small test program to verify that I remember the short-circuiting of boolean operations correctly:
#include
int out(int i){
printf("out: %d\n", i);
return i;
}
void main(){
if((out(0) || out(0)) && out(2))
out(4);
}
-EOF-
$ ./a.out
out: 0
out: 0
On the bright side, I had absolutely no problems compiling this one, like the ones I had when building MySQL Cluster in part 1. On the embarrasing side, my fingers want to type $dollar signs in front of the i variable - too much PHP and Perl in recent years :-(
Ok, so that means that indeed the do_connect() is not executed here (in MgmtSvr.cpp), because a config filename is given and otoh a connect string is not (see the bug report for details):
// if connect_string explicitly given or
// no config filename is given then
// first try to allocate nodeid from another management server
if ((connect_string || config_filename == NULL) &&
(m_config_retriever->do_connect(0,0,0) == 0))
So I continue reading... Hmm... There are no more calls to do_connect().
Come to think of it, of course it cannot be right that no connect-string was provided. If you want to have 2 management servers (which is the case in this bug), how do you suppose they are going to find and connect to each other, if you don't provide a connect-string? Of course there is a connect-string! It was not provided on the command line, but (by mistake) from a my.cnf file. Remember that all MySQL Cluster processes read the my.cnf file too, not just mysqld nodes. The problem is, if you have 1 mysql installation that has it's configuration file in the default /etc/mysql/my.cnf location, then all of your mysql applications want to read that file whether you want them to or not.
I have to say I'm disappointed. I've now fixed 2 bugs simply by reading the related source code, and still didn't get to actually write a patch, use gdb or anything. On the bright side, I now know how the ndb_mgmd process starts. So I learned something.
Also on the bright side, this if anything shows that you can fix MySQL [Cluster] bugs whether you are a developer or not.
[To be clear, my theory of this bug has not been verified yet, and the bug is not officially closed yet.]
- Log in to post comments
- 29734 views
Jack verifies there is a bug
Update: Jack Andrews, one of the "real" MySQL Cluster developers, has now verified that there actually is a bug. Nobody has yet found out where it is though.