Discussion:
[Dnsmasq-discuss] dnsmasq stops receiving packets after network restart
Kristian Evensen
2018-09-24 18:12:25 UTC
Permalink
Hello,

I have some routers running OpenWRT (latest nightly) and that I have
to access remotely (using reverse SSH). When I restart networking
(/etc/init.d/network restart), clients on the LAN can no longer obtain
an IP address using DHCP. If I restart networking locally, DHCP works
as expected after the network is back up.

In order to try and figure out what is going on, I have checked/tried
the following:

* I started out by checking if dnsmasq has been restarted and if the
DHCP socket has been created. I can always see the socket in netstat.
* I then took a look at the firewall. I can see the DHCP packets in
the INPUT chain in filter, which according to my understanding of
Netfilter-internals is the last stop before a packet is delivered to a
socket.
* I then instrumented dnsmasq and added some logging in dhcp_packet()
in dhcp.c. This function is never called, as none of my log-messages
are written to syslog. I checked that the logging works by checking
for my messages when DHCP is working.
* Restarting dnsmasq makes DHCP work again. I can't see any difference
in for example netstat-output.

Does anyone have any idea on what to try or where to look next? After
having spent a couple of days on this issue, I am quickly starting to
run out of ideas.

Thanks in advance for any help,
Kristian
Simon Kelley
2018-09-26 16:52:16 UTC
Permalink
Post by Kristian Evensen
Hello,
I have some routers running OpenWRT (latest nightly) and that I have
to access remotely (using reverse SSH). When I restart networking
(/etc/init.d/network restart), clients on the LAN can no longer obtain
an IP address using DHCP. If I restart networking locally, DHCP works
as expected after the network is back up.
In order to try and figure out what is going on, I have checked/tried
* I started out by checking if dnsmasq has been restarted and if the
DHCP socket has been created. I can always see the socket in netstat.
* I then took a look at the firewall. I can see the DHCP packets in
the INPUT chain in filter, which according to my understanding of
Netfilter-internals is the last stop before a packet is delivered to a
socket.
* I then instrumented dnsmasq and added some logging in dhcp_packet()
in dhcp.c. This function is never called, as none of my log-messages
are written to syslog. I checked that the logging works by checking
for my messages when DHCP is working.
* Restarting dnsmasq makes DHCP work again. I can't see any difference
in for example netstat-output.
Does anyone have any idea on what to try or where to look next? After
having spent a couple of days on this issue, I am quickly starting to
run out of ideas.
I wonder if this is caused by dnsmasq using the BINDTODEVICE sockopt on
the DHCP socket. If the networking restart takes down and re-creates the
network interface, then that socket may be remain bound to the old
interface.

This comment in whichdevice() in dhcp-common.c decribes the condition
under which the binding happens.

/* If we are doing DHCP on exactly one interface, and running linux, do
SO_BINDTODEVICE
to that device. This is for the use case of (eg) OpenStack, which
runs a new
dnsmasq instance for each VLAN interface it creates. Without the
BINDTODEVICE,
individual processes don't always see the packets they should.
SO_BINDTODEVICE is only available Linux.

Note that if wildcards are used in --interface, or --interface is
not used at all,
or a configured interface doesn't yet exist, then more interfaces
may arrive later,
so we can't safely assert there is only one interface and proceed.
*/

Simplest test is to make whichdevice always return NULL, and see if that
helps.


Cheers,

Simon.
Kristian Evensen
2018-09-27 13:42:06 UTC
Permalink
Hi Simon,
Post by Simon Kelley
Simplest test is to make whichdevice always return NULL, and see if that
helps.
Making whichdevice() always return NULL makes the issue go away.
Without the change, DHCP after a network restart (which triggers
recreating devices) only works after I manually restart dnsmasq. With
the change, DHCP works fine. Chainging dnsmasq to use two interfaces
also makes the issue disappear. I unfortunately do not know what has
suddenly triggered this error. I see that the code in whichdevice() is
from 2012/2013, so it must be something in a different component.

Carrying a local patch is no problem for me, but I guess a generic
solution is desirable. Would a patch adding a configuration option be
acceptable?

BR,
Kristian
Simon Kelley
2018-09-27 19:53:34 UTC
Permalink
Post by Kristian Evensen
Hi Simon,
Post by Simon Kelley
Simplest test is to make whichdevice always return NULL, and see if that
helps.
Making whichdevice() always return NULL makes the issue go away.
Without the change, DHCP after a network restart (which triggers
recreating devices) only works after I manually restart dnsmasq. With
the change, DHCP works fine. Chainging dnsmasq to use two interfaces
also makes the issue disappear. I unfortunately do not know what has
suddenly triggered this error. I see that the code in whichdevice() is
from 2012/2013, so it must be something in a different component.
Progress. AFAIK, the dnsmasq behaviour around this has not changed at al
in that time period. I think it's likely that the change is in the
OpenWRT network infrastructure, maybe hotplug/coldplug stuff that now
destroys and re-creates the kernel-level network device, rather than
just reloading its configuration.

I run the bleeding edge dnsmasq code (we suffer so you don't have too!)
on an old, stable Chaos-calmer OpenWRT install, and I'm not seeing this
effect, which adds weight to the theory that the change is elsewhere.
Post by Kristian Evensen
Carrying a local patch is no problem for me, but I guess a generic
solution is desirable. Would a patch adding a configuration option be
acceptable?
Dnsmasq is quite clever at handling changes in kernel network level
devices under its feet, maybe there's a way to re-bind when that
happens? I'll have a look. A configuration option would be the last
resort here: adding "pull this lever to make it work" options is
something I try and avoid.



Cheers,

Simon.
Kristian Evensen
2018-09-27 20:07:07 UTC
Permalink
Hi,
Post by Simon Kelley
Progress. AFAIK, the dnsmasq behaviour around this has not changed at al
in that time period. I think it's likely that the change is in the
OpenWRT network infrastructure, maybe hotplug/coldplug stuff that now
destroys and re-creates the kernel-level network device, rather than
just reloading its configuration.
I run the bleeding edge dnsmasq code (we suffer so you don't have too!)
on an old, stable Chaos-calmer OpenWRT install, and I'm not seeing this
effect, which adds weight to the theory that the change is elsewhere.
Yes, I agree. I also haven't seen this error up until recently, so
there is something else that has broken. I will try to dig a bit when
or if I have time, and see if I can discover something.
Post by Simon Kelley
Dnsmasq is quite clever at handling changes in kernel network level
devices under its feet, maybe there's a way to re-bind when that
happens? I'll have a look. A configuration option would be the last
resort here: adding "pull this lever to make it work" options is
something I try and avoid.
I agree here as well. I checked if there was a socket event we were
missing, but at least no event was received on my boxes. I guess the
most elegant approach would be to monitor RTNLGRP_LINK for DELLINK,
and close the socket when DELLINK arrives. The socket could then be
recreated on NEWLINK, or, proably even better, NEWADDR.

BR,
Kristian

Loading...