Despite the title of the blog this article is about a prime example why implementing your own data structures is probably a bad idea and will hunt you down. Please bear with me - I want to start from the very beginning how I found a bug in a custom implementation of a hash table that is part of a important Linux tool called kmod. The bug is subtle and unfortunately symptomatic for coding in C.

There's a TL;DR at the end if you're not interested in the full story.

Please note that I'm using RTAI 4.0 and kmod 9 on Linux 2.6.32 and 3.10. Newer versions of the respective software might contain fixes for the described problem.

What is kmod?

kmod is a drop-in replacement for the dead module-init-tools which in turn is a project that provides user space tools like depmod or modprobe that are needed for dealing with kernel modules. kmod is a fully compatible redesign built around the libkmod library that "gives early boot tools, installers, udev, and others an easy way to query and control kernel modules [...] [without] using modprobe." [1] It's shipped by default in all major distributions (Debian, Ubuntu, Arch Linux, Gentoo, ...) since about two years so it can be considered as one of the core tools of a current Linux distribution.

When a drop-in replacement isn't drop-in

I didn't even know about kmod's existance until I prepared an upgrade from Debian Squeeze to Wheezy for an embedded system. After upgrading the userland I couldn't load a certain kernel module using modprobe even though the kernel was still the same so it couldn't be a regression in the kernel. The module I was trying to load was part of RTAI. RTAI provides a Real Time Application Interface "which lets you write applications with strict timing constraints" [2]. It consists of a patched Linux kernel and a few kernel modules. I was trying to load one of them using modprobe but I got an "unknown symbol" error after the upgrade:

# modprobe rtai_sem
ERROR: could not insert 'rtai_sem': Unknown symbol in module, or unknown parameter (see dmesg)

Doing as I was told:

# dmesg | tail -n 16
[431523.692006] rtai_sem: Unknown symbol wake_up_srq (err 0)
[431523.692033] rtai_sem: Unknown symbol rt_get_time (err 0)
[431523.692049] rtai_sem: Unknown symbol rt_register (err 0)
[431523.692065] rtai_sem: Unknown symbol rtheap_alloc (err 0)
[431523.692080] rtai_sem: Unknown symbol rt_smp_current (err 0)
[431523.692093] rtai_sem: Unknown symbol rtheap_free (err 0)
[431523.692110] rtai_sem: Unknown symbol rt_smp_linux_task (err 0)
[431523.692126] rtai_sem: Unknown symbol rt_task_delete (err 0)
[431523.692139] rtai_sem: Unknown symbol rt_smp_time_h (err 0)
[431523.692153] rtai_sem: Unknown symbol rt_drg_on_adr_cnt (err 0)
[431523.692171] rtai_sem: Unknown symbol set_rt_fun_entries (err 0)
[431523.692185] rtai_sem: Unknown symbol rtai_global_heap (err 0)
[431523.692205] rtai_sem: Unknown symbol boot_epoch (err 0)
[431523.692219] rtai_sem: Unknown symbol rt_get_adr_cnt (err 0)
[431523.692238] rtai_sem: Unknown symbol reset_rt_fun_entries (err 0)
[431523.692252] rtai_sem: Unknown symbol rt_schedule (err 0)

What the kernel tried to tell me is that the module I was trying to load needs symbols that it couldn't resolve. A kernel module exports symbols to other kernel modules like a library (e.g. shared object) exposes functions to those who link against it. The module that exports the symbols declares them in the source code using the C macro EXPORT_SYMBOL(...). Since most of the unknown symbols start with rt_ I figured they must also be part of RTAI. To find out which kernel module exports the missing symbols I greped through the RTAI source code:

% grep -R 'EXPORT_SYMBOL(wake_up_srq)' *
base/sched/api.c:EXPORT_SYMBOL(wake_up_srq);

Guessing from the folder that the C file is in, it seems the symbol is exported by an RTAI kernel module with the name sched in it:

% find /lib/modules/$(uname -r)/rtai -name '*sched*'
/lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko
/lib/modules/2.6.32-rtai/rtai/rtai_sched.ko

Seems we're lucky and have even two promising results. Let's try to modprobe the first one:

# modprobe rtai_ksched

No errors and promising entries in dmesg:

# dmesg | tail -n 5
[432403.876154] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[432403.876186] , kstacks pool size = 524288 bytes.
[432403.876190] RTAI[sched]: hard timer type/freq = APIC/49883000(Hz); default timing: periodic; linear timed lists.
[432403.876194] RTAI[sched]: Linux timer freq = 250 (Hz), TimeBase freq = 1596265000 hz.
[432403.876197] RTAI[sched]: timer setup = 20 ns, resched latency = 3923 ns.

Now if we try to load rtai_sem:

# modprobe rtai_sem

No output which means that worked too!

Dependency hell

It seems that we've a dependency problem introduced by the upgrade to Wheezy. Normally dependencies are taken care of by modprobe that loads the modules in the right order so symbols are resolved correctly. How does modprobe know which module to load in which order? The tool depmod that is part of kmod generates the file /lib/modules/$(uname -r)/modules.dep (and its binary representation modules.dep.bin but for the sake of simplicity I'll always refer to modules.dep even if I mean both of them) that contains all dependencies of all the modules installed for that kernel version. It's parsed by modprobe and others. Let's see what we can find in there for rtai_sem.ko:

% grep '^rtai/rtai_sem\.ko' /lib/modules/$(uname -r)/modules.dep
rtai/rtai_sem.ko: rtai/rtai_hal.ko

The part before the colon is a relative path to the kernel module, the part after the colon are the modules that need to be loaded before. Since we needed to load rtai_ksched.ko before we could finally load rtai_sem.ko, a dependency to rtai_sched.ko is clearly missing.

To confirm that theory we can use lsmod which prints reverse dependencies for kernel modules for all loaded modules:

% lsmod | grep '^Module\|rtai_sem'
Module                Size  Used by
rtai_sem               18431  0
rtai_sched             61014  1 rtai_sem
rtai_hal              227711  2 rtai_sem,rtai_sched

The last column lists all kernel modules that depend on the given module being loaded. We can see that rtai_sem uses rtai_sched and rtai_hal so rtai_sem clearly needs a dependency to both of them but the modules.dep file only contains the one for rtai_hal. But there's another interesting thing: We've loaded rtai_ksched (note the k) but lsmod lists rtai_sched without the k. Is rtai_sched an alias for rtai_ksched (or vice versa) and both point to the same file on disk? Let's use modinfo to find out:

% modinfo rtai_ksched | grep filename
filename:       /lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko
% modinfo rtai_sched | grep filename
filename:       /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko

Nope, it's not a "conventional" module alias compiled into the module using the C macro MODULE_ALIAS because the filename field points to two different locations. But it could by a symlink?

# ls -l
lrwxrwxrwx 1 root root     13 Apr 30 12:12 rtai_ksched.ko -> rtai_sched.ko
-rw-r--r-- 1 root root 100925 Apr 30 12:12 rtai_sched.ko

Yes, it's as simple as that. rtai_ksched.ko just points to rtai_sched.ko. In fact, a multiple of other symlinks point to rtai_sched.ko. base/sched/GNUmakefile.am in the RTAI repository tells us that these symlinks are names of legacy schedulers that have been incorporated into one module. This means the symlinks are only there for compatibility reasons. While it would be somewhat resonable for depmod to ignore symlinks, it's certainly not reasonable to ignore modules that are the destination of symlinks.

Our explorations lead us to a hypothesis: depmod, which is responsible for generating modules.deps, fails to include a module if it is the destination of a symlink.

How can we verify that hypothesis now? Unfortunately modprobe only uses the binary file modules.dep.bin and ignores modules.dep despite what the man page says (this has been fixed in the man page of kmod 17). Therefore it's not possible to just patch modules.dep and see if the dependencies are then resolved correctly by modprobe. However, we can try to verify our symlink hypothesis by just moving them out of the way and rerunning depmod:

# find . -type l -execdir mv '{}' '{}'.bak \;

Now all symlinks have been renamed to *.bak and as such they're ignored by kmod:

# ls -l
lrwxrwxrwx 1 root root 13 Apr 30 12:12 rtai_ksched.ko.bak -> rtai_sched.ko
lrwxrwxrwx 1 root root 13 Apr 30 12:12 rtai_lxrt.ko.bak -> rtai_sched.ko
lrwxrwxrwx 1 root root 13 Apr 30 12:12 rtai_mup.ko.bak -> rtai_sched.ko
lrwxrwxrwx 1 root root 13 Apr 30 12:12 rtai_smp.ko.bak -> rtai_sched.ko
lrwxrwxrwx 1 root root 13 Apr 30 12:12 rtai_up.ko.bak -> rtai_sched.ko

Now we can try to regenerate modules.dep:

# depmod -a

Let's see if it changed anything:

# grep '^rtai/rtai_sem\.ko' /lib/modules/$(uname -r)/modules.dep
rtai/rtai_sem.ko: rtai/rtai_sched.ko rtai/rtai_hal.ko

Hurray! modules.dep now really contains the right dependencies. Before we investigate any further we must be certain that a dependency to rtai_sched.ko in modules.dep actually fixes our problem. So we unload the modules (or reboot the machine) and load rtai_sem again. Be sure to specify the modules on the command line in the right order. The module that isn't used by any other module must be given first and so on:

# modprobe -r rtai_sem rtai_sched rtai_hal
# modprobe rtai_sem

Hurray again! That really fixed our problem. To be absolutely sure, we should rename the symlinks and try it again:

# rename 's/.bak//g' *

Unloading and loading is now left as an exercise for the reader.

Summing up what we know

Let's summarize what we know:

  • rtai_sem can't be loaded because it misses a dependency to another module.
  • depmod fails to generate correct dependencies for rtai_sem.
  • It's somehow connected to the fact that symlinks point to the module that should be a dependency.

Increasing verbosity

Let's see why depmod doesn't generate the correct dependencies. Before we fire up gdb we should increase the verbosity level of depmod's output to DEBUG. Looking at its manpage we can see that depmod accepts a number of -v arguments for increasing the verbosity, -n for dry run to output depencies to stdout and not writing them to modules.dep and it's possible to limit the kernel module dependency calculation to those given as arguments on the command line. Limiting the dependency calculation has the advantage that it's much faster and the output is less polluted with dependencies to other modules:

# depmod -vvv -n /lib/modules/$(uname -r)/rtai/rtai_{ksched,sched,hal,sem}.ko

The command generates a lot of output since the log level is set to DEBUG. Let's examine what's going on:

[...]
DEBUG: add 0xb8f61dd8 kmod=0xb8f61d50, path=/lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko
[...]
DEBUG: add 0xb8f61ff8 kmod=0xb8f61f70, path=/lib/modules/2.6.32-rtai/rtai/rtai_sched.ko
[...]
DEBUG: add 0xb8f62210 kmod=0xb8f62188, path=/lib/modules/2.6.32-rtai/rtai/rtai_hal.ko
[...]
DEBUG: add 0xb8f62428 kmod=0xb8f623a0, path=/lib/modules/2.6.32-rtai/rtai/rtai_sem.ko

Every module that is given as an argument to depmod is loaded into memory. Then symbols for that 4 modules are loaded:

DEBUG: load symbols (4 modules)
libkmod: DEBUG ../libkmod/libkmod-module.c:709 kmod_module_get_path: name='rtai_sched' path='/lib/modules/2.6.32-rtai/rtai/rtai_sched.ko'
DEBUG: add 0xb8f62810 sym=rt_get_base_linux_task, owner=0xb8f61ff8 /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko
DEBUG: add 0xb8f627f0 sym=rt_drg_on_name, owner=0xb8f61ff8 /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko
DEBUG: add 0xb8f62788 sym=rt_set_period, owner=0xb8f61ff8 /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko
[...]

All symbols that are part of that module are loaded into memory (added). The first module that gets its symbols loaded is rtai_sched. The next is rtai_ksched:

libkmod: DEBUG ../libkmod/libkmod-module.c:709 kmod_module_get_path: name='rtai_ksched' path='/lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko'
DEBUG: free 0xb8f62810 sym=rt_get_base_linux_task, owner=0xb8f61ff8 /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko
DEBUG: add 0xb8f639d0 sym=rt_get_base_linux_task, owner=0xb8f61dd8 /lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko
DEBUG: free 0xb8f627f0 sym=rt_drg_on_name, owner=0xb8f61ff8 /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko
DEBUG: add 0xb8f65350 sym=rt_drg_on_name, owner=0xb8f61dd8 /lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko
DEBUG: free 0xb8f62788 sym=rt_set_period, owner=0xb8f61ff8 /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko
DEBUG: add 0xb8f627f0 sym=rt_set_period, owner=0xb8f61dd8 /lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko
DEBUG: free 0xb8f62e50 sym=start_rt_apic_timers, owner=0xb8f61ff8 /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko
DEBUG: add 0xb8f62810 sym=start_rt_apic_timers, owner=0xb8f61dd8 /lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko
[...]

Have you noticed the difference to rtai_sched? Suddenly for every line that adds a symbol there's also a free before! rtai_ksched is special because it's only a symlink to rtai_sched so it exports the same symbols since it's the same file eventually. This means depmod frees the reference to the old symbol when there's a new symbol with the same name. Since the free seems to come before the add this doesn't look too wrong. Let's skip the remaining symbols and go to the calculated dependencies:

DEBUG: loaded dependencies (4 modules, 231 symbols)
DEBUG: calculate dependencies and ordering (4 modules)
DEBUG: calculated dependencies and ordering (0 loops, 4 modules)
rtai/rtai_sched.ko: rtai/rtai_hal.ko
rtai/rtai_ksched.ko: rtai/rtai_hal.ko
rtai/rtai_hal.ko:
rtai/rtai_sem.ko: rtai/rtai_hal.ko
[...]

The dependency to rtai/rtai_sched.ko is missing for rtai/rtai_sem.ko so we could reproduce the bug.

Debugging kmod

Before we can fire up the debugger, we've to download the source code from https://www.kernel.org/pub/linux/utils/kernel/kmod/ and look for the line that contains the free 0xb8f62810 sym=rt_get_base_linux_task ...:

$ tar xf kmod-17.tar.gz
$ grep -R 'free.*sym=' *
tools/depmod.c:DBG("free %p sym=%s, owner=%p %s\n", sym, sym->name, sym->owner,

Let's see what we can find in tools/depmod.c:

static void symbol_free(struct symbol *sym)
{
        DBG("free %p sym=%s, owner=%p %s\n", sym, sym->name, sym->owner,
            sym->owner != NULL ? sym->owner->path : "");
        free(sym);
}

Now we've a function on which we can set a breakpoint. But first let's compile kmod:

$ ./autogen.sh
$ ./configure CFLAGS="-g" --prefix=/usr --sysconfdir=/etc --libdir=/usr/lib --enable-debug
$ make

If you're on a recent enough Debian/Ubuntu you can install the build dependencies using apt-get:

# apt-get build-dep kmod

Now fire up the debugger and put a breakpoint on function symbol_free():

$ gdb tools/depmod
(gdb) break symbol_free
Breakpoint 1 at 0x804e576: file tools/depmod.c, line 985.

Now run it:

(gdb) run -vvv -n /lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko /lib/modules/2.6.32-rtai/rtai/rtai_hal.ko /lib/modules/2.6.32-rtai/rtai/rtai_semo.ko
[...]
depmod: DEBUG: libkmod/libkmod-module.c:726 kmod_module_get_path()
name='rtai_ksched'
path='/lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko'

Breakpoint 1, symbol_free (sym=0x8078a48) at tools/depmod.c:985
985             DBG("free %p sym=%s, owner=%p %s\n", sym, sym->name, sym->owner,
(gdb) bt
#0  symbol_free (sym=0x8078a48) at tools/depmod.c:985
#1  0x08053493 in hash_add (hash=0x8071d28, key=0x808c104 "rt_get_base_linux_task", value=0x808c0f8) at libkmod/libkmod-hash.c:171
#2  0x0804f86f in depmod_symbol_add (depmod=0xbfffe25c, name=0x8077e18 "rt_get_base_linux_task", prefix_skipped=false, crc=3309780553, owner=0x8077e58) at tools/depmod.c:1440
#3  0x0804fa3a in depmod_load_modules (depmod=0xbfffe25c) at tools/depmod.c:1485
#4  0x080501e0 in depmod_load (depmod=0xbfffe25c) at tools/depmod.c:1670
#5  0x08052704 in do_depmod (argc=7, argv=0xbffff574) at tools/depmod.c:2666
#6  0x08049d94 in handle_kmod_compat_commands (argc=7, argv=0xbffff574) at tools/kmod.c:153
#7  0x08049df3 in main (argc=7, argv=0xbffff574) at tools/kmod.c:166

We're in symbol_free() now. From the backtrace we can see that a new symbol called rt_get_base_linux_task should be added to a hash that probably contains all currently existing symbols. Let's see what hash_add() actually does and in which cases symbol_free() is called:

(gdb) list hash_add
143      *
144      * none of key or value are copied, just references are remembered as is,
145      * make sure they are live while pair exists in hash!
146      */
147     int hash_add(struct hash *hash, const char *key, const void *value)
148     {
149             unsigned int keylen = strlen(key);
150             unsigned int hashval = hash_superfast(key, keylen);
151             unsigned int pos = hashval & (hash->n_buckets - 1);
152             struct hash_bucket *bucket = hash->buckets + pos;
(gdb) list +
153             struct hash_entry *entry, *entry_end;
154
155             if (bucket->used + 1 >= bucket->total) {
156                     unsigned new_total = bucket->total + hash->step;
157                     size_t size = new_total * sizeof(struct hash_entry);
158                     struct hash_entry *tmp = realloc(bucket->entries, size);
159                     if (tmp == NULL)
160                             return -errno;
161                     bucket->entries = tmp;
162                     bucket->total = new_total;
(gdb) list +
163             }
164
165             entry = bucket->entries;
166             entry_end = entry + bucket->used;
167             for (; entry < entry_end; entry++) {
168                     int c = strcmp(key, entry->key);
169                     if (c == 0) {
170                             if (hash->free_value)
171                                     hash->free_value((void *)entry->value);
172                             entry->value = value;
(gdb) list +
173                             return 0;
174                     } else if (c < 0) {
175                             memmove(entry + 1, entry,
176                                     (entry_end - entry) * sizeof(struct hash_entry));
177                             break;
178                     }
179             }
180
181             entry->key = key;
182             entry->value = value;

In line 171 we can see that a function pointer free_value() that is a member of struct hash is called. This pointer probably points to our symbol_free() function. We can verify that by looking at the source code:

$ grep -R symbol_free *
tools/depmod.c: depmod->symbols = hash_new(2048, (void (*)(void *))symbol_free);

depmod uses symbol_free() as a parameter to hash_new(). But the more interesting question is in which cases does hash_add() free the symbol? And why is the symbol still lost if it's readded after being freed?

The first question is easy to answer from the source code above. The for loop in line 167 iterates over all entries in the bucket and the strcmp in line 168 compares all found keys with the key to be added. Since a key in a hash must be unique the value for that key in the bucket is replaced with the value given to the function hash_add(). To not create any memory leaks the value in the bucket entry->value and the pointer to the value in the bucket is then set to the new value (line 172). Immediately after that the function returns.

When thinking about it it seems right. The only thing that bothers me is that the comment for that function in lines 144 to 145 says that key and value are not copied (i.e. just their references) and must be "live" (i.e. must be dereferencable) while a pair exists in hash. The call to free_value() violates that requirement because one of the pair (it's the value) isn't available while the other (key) is.

We're still in symbol_free() in gdb so we go out of free_value() and back into hash_add():

(gdb) fin
Run till exit from #0  symbol_free (sym=0x8078a48) at tools/depmod.c:985
depmod: DEBUG: free 0x8078a48 sym=rt_get_base_linux_task, owner=0x80780f8 /lib/modules/2.6.32-rtai/rtai/rtai_sched.ko
hash_add (hash=0x8071d28, key=0x808c104 "rt_get_base_linux_task", value=0x808c0f8) at libkmod/libkmod-hash.c:172
172                             entry->value = value;

Let's see what should be added as a key value pair to the hash and what already exists:

(gdb) print key
$1 = 0x808c104 "rt_get_base_linux_task"
(gdb) print value
$2 = (const void *) 0x808c0f8
(gdb) p entry->value
$3 = (const void *) 0x8078a48
(gdb) p entry->key
$4 = 0x8078a54 "rt_get_base_linux_task"

Since value is a void pointer we need to cast it before we can get anything meaningful out of it. Since symbol_free() expects a struct symbol, value is probably of the same type:

(gdb) print *(struct symbol*)value
$5 = {owner = 0x8077e58, crc = 3309780553, name = 0x808c104 "rt_get_base_linux_task"}

Interesting! Look at the address of name (0x808c104) and compare it with the one of key. The're the same! Now if entry->value is freed (line 171), entry->key contains a dangling pointer to an already freed memory region. Since the found key is reused if an entry's key entry->key is equal to the given key key it makes it a dangling pointer that can point to garbage.

Putting the pieces together

"Fine," you might say, "it's a bug in the implementation but how is it connected with our symlink problem?"

Let's recapitulate again:

  • If a symlink points to a module whose symbols have already been loaded they are added again.
  • If a symbol whose name already exists in a hash that contains all currently known symbols it replaces the already known symbol.
  • This replacement mechanism is broken. The new symbol is added under the key of the old symbol (which is equal but not identical because the keys are pointers to different memory regions!) and so it happens that the key suddenly points to freed data in memory.

Now one must know that freed data not immediately turns into garbage. The pointer still points to valid data until that memory region is reassigned and someone overwrites it and until this happens our symbol is still in the hash and valid. We can use hash_find() to see if a key is present in the hash:

(gdb) p hash_find(hash, "rt_get_base_linux_task")
$6 = (void *) 0x808c0f8

But after continuing for a few times (i.e. after some other symbols have been added) our symbol is lost:

(gdb) continue
[...]
(gdb) continue
[...]
(gdb) continue
[...]
(gdb) continue
[...]
(gdb) continue
[...]
(gdb) break hash_add
Breakpoint 2 at 0x805338a: file libkmod/libkmod-hash.c, line 149.
(gdb) continue
Continuing.
depmod: DEBUG: add 0x808c140 sym=rt_get_adr, owner=0x8077e58 /lib/modules/2.6.32-rtai/rtai/rtai_ksched.ko
Breakpoint 2, hash_add (hash=0x8071d28, key=0x8079534 "put_current_on_cpu", value=0x8079528) at libkmod/libkmod-hash.c:149
149             unsigned int keylen = strlen(key);
(gdb) p hash_find(hash, "rt_get_base_linux_task")
$7 = (void *) 0x0

This happens when the memory region key pointed to is reassigned by a call to malloc() and thus overwritten. The hash can still be found manually, though, by iterating over all entries in hash's bucket:

(gdb) set $i = 0
(gdb) while ($i < hash->n_buckets)
 >if (hash->buckets[$i]->entries != 0 && strcmp(((struct symbol*)hash->buckets[$i]->entries->value)->name, "rt_get_base_linux_task") == 0)
  >print hash->buckets[$i]->entries->key
  >print *(struct symbol*)hash->buckets[$i]->entries->value
  >end
 >set $i = $i + 1
 >end
$8 = 0x8078a54 "start_rt_apic_timers"
$9 = {owner = 0x8077e58, crc = 3309780553, name = 0x808c104 "rt_get_base_linux_task"}

But hash_find() doesn't find it under it's key rt_get_base_linux_task:

(gdb) p hash_find(hash, "rt_get_base_linux_task")
$10 = (void *) 0x0

Because as we can see from the output above hash_find() compares the key at address 0x8078a54 which is start_rt_apic_timers and not rt_get_base_linux_task which is at 0x808c104. Do you remember the first address? It's the one that was used for the first occurrence of the symbol:

[...]
(gdb) p entry->key
$4 = 0x8078a54 "rt_get_base_linux_task"

So now that we understand the problem all there's left to do is a) fix RTAI to use MODULE_ALIAS() instead of symlinking modules and b) fix kmod to set entry->key to the given key.

If it's broken, fix it

kmod

The change to kmod is just a one liner. Instead of using the old key we use the new one given as an argument to the hash_add() function:

diff --git a/libkmod/libkmod-hash.c b/libkmod/libkmod-hash.c
index c751d2d..eb7afb7 100644
--- a/libkmod/libkmod-hash.c
+++ b/libkmod/libkmod-hash.c
@@ -169,6 +169,7 @@ int hash_add(struct hash *hash, const char *key, const void *value)
                if (c == 0) {
                        if (hash->free_value)
                                hash->free_value((void *)entry->value);
+                       entry->key = key;
                        entry->value = value;
                        return 0;
                } else if (c < 0) {

Just one line with such an impact.

One last test with the new kmod (after recompiling it):

$ ./tools/depmod -n /lib/modules/$(uname -r)/rtai/rtai_{ksched,sched,hal,sem}.ko | grep '^rtai/rtai_sem\.ko'
rtai/rtai_sem.ko: rtai/rtai_ksched.ko rtai/rtai_hal.ko

It works! What a relief. I've posted a patch to the kmod mailing list.

RTAI

Fixing RTAI is quite straightforward. One has to remove the generation of symlinks from the build system and add some MODULE_ALIAS() lines to base/sched/sched.c.

I've posted a patch to the mailing list that addresses that problem. It will be part of the upcoming RTAI 4.1.

TL;DR

I ran into a problem in kmod, a set of command line tools built around a library to deal with kernel modules. kmod has a bug that causes modules that are the destination of a symlink not being recognized as dependencies due to an error in their hash table implementation. However, symlinks in the /lib/modules/$(uname -r) directory are not used for kernel modules that are part of the kernel tree. Since this bug is triggered only by out-of-tree builds of kernel modules, as developer Lucas De Marchi points out on the mailing list, from a practical point of view it's not a critical issue. However, even if it doesn't affect the average use, it's still a use-after-free error. If you implement your own data structures you'll run into bugs. Especially if your code doesn't contain any unit tests for these data structures.

Update: Lucas De Marchi pointed out that "[...] The hash implementation was not written from scratch, but rather parts were taken from existing implementations". I've rephrased some paragraphs to reflect that objection.

[1]https://lwn.net/Articles/472354/
[2]https://www.rtai.org