[Dnsmasq-discuss] Seg. fault in cache.c after commt b6f926fb

Discussion:

Kristian Evensen

2018-09-18 10:28:47 UTC

Hello,

I recently updated one of my x86-based OpenWRT-routers to the latest
nightly, which bumped dnsmasq to 2.80test6. After the update, I see
that dnsmasq sometimes enters a crash loop. The crash occurs right
startup (SIGSEV), and the backtrace, looks as follows:

0x0000000000406a78 in make_non_terminals
(source=***@entry=0x7ffff7ffefa0) at cache.c:1437
1437 cache.c: No such file or directory.
(gdb) bt
#0 0x0000000000406a78 in make_non_terminals
(source=***@entry=0x7ffff7ffefa0) at cache.c:1437
#1 0x0000000000407192 in add_hosts_entry (cache=0x7ffff7ffefa0,
addr=0x7fffffffea28, addrlen=4, index=2, rhash=<optimized out>,
hashsz=<optimized out>) at cache.c:911
#2 0x00000000004074e1 in read_hostsfile
(filename=***@entry=0x425a8a "/etc/hosts", index=***@entry=2,
cache_size=***@entry=150, rhash=0x635020,
hashsz=***@entry=641) at cache.c:1040
#3 0x0000000000407810 in cache_reload () at cache.c:1185
#4 0x0000000000418037 in clear_cache_and_reload
(now=***@entry=1537265437) at dnsmasq.c:1547
#5 0x0000000000405ec3 in async_event (now=1537265437, pipe=15) at
dnsmasq.c:1310
#6 main (argc=<optimized out>, argv=<optimized out>) at dnsmasq.c:1077

My current work-around is to check if crecp is NULL and then return
from make_non_terminals(). However, I don't know the code well enough
to know if this change will break anything else (though everything
seems to work fine).

There is nothing special with my hosts file, it looks as follows:

127.0.0.1 localhost

::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
192.168.5.1 myrouter.lan

If checking for NULL is acceptable, I will be more than happy to send
my patch. If not, I wonder if anyone has any hints on how to proceed
to solve this issue?

BR,
Kristian

Simon Kelley

2018-09-18 22:22:59 UTC

Permalink

Post by Kristian Evensen
Hello,
I recently updated one of my x86-based OpenWRT-routers to the latest
nightly, which bumped dnsmasq to 2.80test6. After the update, I see
that dnsmasq sometimes enters a crash loop. The crash occurs right
0x0000000000406a78 in make_non_terminals
1437 cache.c: No such file or directory.
(gdb) bt
#0 0x0000000000406a78 in make_non_terminals
#1 0x0000000000407192 in add_hosts_entry (cache=0x7ffff7ffefa0,
addr=0x7fffffffea28, addrlen=4, index=2, rhash=<optimized out>,
hashsz=<optimized out>) at cache.c:911
#2 0x00000000004074e1 in read_hostsfile
#3 0x0000000000407810 in cache_reload () at cache.c:1185
#4 0x0000000000418037 in clear_cache_and_reload
#5 0x0000000000405ec3 in async_event (now=1537265437, pipe=15) at
dnsmasq.c:1310
#6 main (argc=<optimized out>, argv=<optimized out>) at dnsmasq.c:1077
My current work-around is to check if crecp is NULL and then return
from make_non_terminals(). However, I don't know the code well enough
to know if this change will break anything else (though everything
seems to work fine).
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
192.168.5.1 myrouter.lan
If checking for NULL is acceptable, I will be more than happy to send
my patch. If not, I wonder if anyone has any hints on how to proceed
to solve this issue?

Thanks for the report. The obvious explanation is that whine_malloc() is
returning NULL, and the code should handle that. whine_malloc only
returns NULL if the system cannot allocate any more memory, which is
possible, but unlikely. Is your router very short on memory?
whine_malloc() is just a wrapper around malloc that logs an error if the
malloc fails (hence the name) Are you seeing messages like

failed to allocate %d bytes

just before the crashes? That would make me more comfortable that we
understand the problem. I think the best solution is to wrap all of

*crecp = *source;
crecp->flags &= ~(F_IPV4 | F_IPV6 | F_CNAME | F_DNSKEY | F_DS |
F_REVERSE);
crecp->flags |= F_NAMEP;
crecp->name.namep = name;

cache_hash(crecp);

with

if (crecp)
{
}

and I'll commit that to the git repo now.

Cheers,

Simon

Kristian Evensen

2018-09-19 07:59:12 UTC

Permalink

Hi Simon,

Thanks for a quick reply.

Post by Simon Kelley
Thanks for the report. The obvious explanation is that whine_malloc() is
returning NULL, and the code should handle that. whine_malloc only
returns NULL if the system cannot allocate any more memory, which is
possible, but unlikely. Is your router very short on memory?

No, the router has plenty of memory (2GB) and I don't see the "failed
to allocate"-message, so I guess whine_malloc() can't be the culprit.
Since I am using OpenWRT, there could be some defines affecting the
line numbers. I tried to read up on how ifdefs affects line numbers in
gdb backtraces to see if the error could be somewhere else than the
"default" line 1437, but I unfortunately couldn't find anything.
Probably my google-foo is a bit rusty.

When looking over my notes, I see that I have made the following
observations related to this bug:

* Crash happens quite rarely.
* I have only seen the bug right after boot.
* When the bug strikes, dnsmasq will enter a crash loop and never
recover. I.e., I can restart dnsmasq as many times as I like, crash
always happens.
* If I start dnsmasq manually and run it in the foreground after a
crash, I also see the error.

So there seems to be something in the system causing this error, but I
can't figure out what.

Post by Simon Kelley
I think the best solution is to wrap all of
*crecp = *source;
crecp->flags &= ~(F_IPV4 | F_IPV6 | F_CNAME | F_DNSKEY | F_DS |
F_REVERSE);
crecp->flags |= F_NAMEP;
crecp->name.namep = name;
cache_hash(crecp);
with
if (crecp)
{
}

Thanks, this is basically the same as my current fix, so I can already
report that it is good :)

BR,
Kristian

Kevin Darbyshire-Bryant

2018-09-19 08:22:33 UTC

Permalink

Post by Kristian Evensen
Hi Simon,
Thanks for a quick reply.

No, the router has plenty of memory (2GB) and I don't see the "failed
to allocate"-message, so I guess whine_malloc() can't be the culprit.
Since I am using OpenWRT, there could be some defines affecting the
line numbers. I tried to read up on how ifdefs affects line numbers in
gdb backtraces to see if the error could be somewhere else than the
"default" line 1437, but I unfortunately couldn't find anything.
Probably my google-foo is a bit rusty.
When looking over my notes, I see that I have made the following
* Crash happens quite rarely.
* I have only seen the bug right after boot.
* When the bug strikes, dnsmasq will enter a crash loop and never
recover. I.e., I can restart dnsmasq as many times as I like, crash
always happens.
* If I start dnsmasq manually and run it in the foreground after a
crash, I also see the error.
So there seems to be something in the system causing this error, but I
can't figure out what.

Thanks, this is basically the same as my current fix, so I can already
report that it is good :)
BR,
Kristian

And I backported the fix into openwrt master this morning.

Cheers,

Kevin D-B

012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A

Simon Kelley

2018-09-19 11:44:03 UTC

Permalink

Post by Kristian Evensen
Hi Simon,
Thanks for a quick reply.

No, the router has plenty of memory (2GB) and I don't see the "failed
to allocate"-message, so I guess whine_malloc() can't be the culprit.
Since I am using OpenWRT, there could be some defines affecting the
line numbers. I tried to read up on how ifdefs affects line numbers in
gdb backtraces to see if the error could be somewhere else than the
"default" line 1437, but I unfortunately couldn't find anything.
Probably my google-foo is a bit rusty.
When looking over my notes, I see that I have made the following
* Crash happens quite rarely.
* I have only seen the bug right after boot.
* When the bug strikes, dnsmasq will enter a crash loop and never
recover. I.e., I can restart dnsmasq as many times as I like, crash
always happens.
* If I start dnsmasq manually and run it in the foreground after a
crash, I also see the error.
So there seems to be something in the system causing this error, but I
can't figure out what.

Thanks, this is basically the same as my current fix, so I can already
report that it is good :)

This all makes me slightly uneasy. I think the "out of memory"
explanation for the crashes you are seeing is not a good one.

The patch is definitely needed. The philosophy for memory allocation
failures is that the code should be not terminate: it logs an error and
continues, but not necessarily behaving correctly. That's about the best
you can do.

Unfortunately, with the patch, if crecp is NULL at this point for some
other reason, it will now fail silently, and behave wrongly (the
non-terminal cache entry will not be created)

It would be really good to find out what actually causes this, and
preferably a way to reproduce it.

Cheers,

Simon.

Kristian Evensen

2018-09-19 12:30:39 UTC

Permalink

Post by Simon Kelley
This all makes me slightly uneasy. I think the "out of memory"
explanation for the crashes you are seeing is not a good one.

No, I agree. I have compiled an OpenWRT image without the fix and
installed it on my device, and I am trying to reproduce the issue.
Will let you know when or if I figure out something.

BR,
Kristian