Discussion:
[Dnsmasq-discuss] dnsmasq socket file disappearing
Zi Dvbelju
2018-01-31 21:12:58 UTC
Permalink
*I’m experiencing an issue where all DNS resolutions sent to dnsmasq
timeout, but only after the dnsmasq service has been successfully running
for a period of time (anecdotally, after a few weeks of time). After a lot
of digging, I’ve discovered that dnsmasq’s UDP socket file will eventually
“disappear”. The issue can be resolved by restarting the dnsmasq service.I
haven’t been able to reproduce it yet, but it has happened numerous times
on servers which are running dozens of docker containers. From what I know,
nothing should be removing this socket file and I can’t find anything
relevant in the dnsmasq logs. Is anyone aware of any situations that can
cause socket files to disappear? EnvironmentUbuntu 16.04.3 LTS8 Cores, 16GB
of RAMDnsmasq 2.75-1ubuntu0.16.04.4BackgroundI’m using dnsmasq to forward
requests to Consul <https://www.consul.io/docs/guides/forwarding.html>,
which is used for service discovery. The Consul agent listens on port 8600
and is configured to bind to all interfaces (the relevant interface here is
172.17.0.1, which docker creates). Resolv.conf```# Dynamic resolv.conf(5)
file for glibc resolver(3) generated by resolvconf(8)nameserver
127.0.0.1```Dnsmasq.conf ```server=/consul/172.17.0.1#8600
<http://172.17.0.1/#8600>server=/10.in-addr.arpa/172.17.0.1#8600
<http://172.17.0.1/#8600>bind-dynamic```Systemd config for
Docker```ExecStart=/usr/bin/dockerd --bip=172.17.0.1/24
<http://172.17.0.1/24> --dns=172.17.0.1 -H fd://```While investigating the
servers in the broken state, I observed the following: - nslookup / dig DNS
resolutions are timing out- Docker logs show containers are also timing out
on DNS resolutions- Systemd reports that dnsmasq is still running, pid
still exists - DNS resolutions sent directly to the consul agent
(127.0.0.1:8600 <http://127.0.0.1:8600/>) succeed - DNS resolutions sent to
system[dnsmasq] (127.0.0.1:53 <http://127.0.0.1:53/>) time out- IPV6 UDP
(::1) resolutions sent to dnsmasq succeeded- Netstat shows that the IPV4
UDP socket file for dnsmasq is missing- No relevant messages in kernel log
(specifically, no dnsmasq OOM kill events)- File descriptor usage for the
entire server was normal- File descriptor usage for the individual dnsmasq
process was normal- CPU, RAM, and storage all look goodThanks in advance
for any discussion at all - I've been really struggling with this one for a
while now.Zach*
Shankar Unni
2018-02-03 01:36:42 UTC
Permalink
I, too, have observed this, occasionally, under OpenWrt. If you do figure
this out, or get a solution/explanation off-line, we would really
appreciate an update.
Post by Zi Dvbelju
*I’m experiencing an issue where all DNS resolutions sent to dnsmasq
timeout, but only after the dnsmasq service has been successfully running
for a period of time (anecdotally, after a few weeks of time). After a lot
of digging, I’ve discovered that dnsmasq’s UDP socket file will eventually
“disappear”. The issue can be resolved by restarting the dnsmasq service.I
haven’t been able to reproduce it yet, but it has happened numerous times
on servers which are running dozens of docker containers. From what I know,
nothing should be removing this socket file and I can’t find anything
relevant in the dnsmasq logs. Is anyone aware of any situations that can
cause socket files to disappear? EnvironmentUbuntu 16.04.3 LTS8 Cores, 16GB
of RAMDnsmasq 2.75-1ubuntu0.16.04.4BackgroundI’m using dnsmasq to forward
requests to Consul <https://www.consul.io/docs/guides/forwarding.html>,
which is used for service discovery. The Consul agent listens on port 8600
and is configured to bind to all interfaces (the relevant interface here is
172.17.0.1, which docker creates). Resolv.conf```# Dynamic resolv.conf(5)
file for glibc resolver(3) generated by resolvconf(8)nameserver
127.0.0.1```Dnsmasq.conf ```server=/consul/172.17.0.1#8600
<http://172.17.0.1/#8600>server=/10.in-addr.arpa/172.17.0.1#8600
<http://172.17.0.1/#8600>bind-dynamic```Systemd config for
Docker```ExecStart=/usr/bin/dockerd --bip=172.17.0.1/24
<http://172.17.0.1/24> --dns=172.17.0.1 -H fd://```While investigating the
servers in the broken state, I observed the following: - nslookup / dig DNS
resolutions are timing out- Docker logs show containers are also timing out
on DNS resolutions- Systemd reports that dnsmasq is still running, pid
still exists - DNS resolutions sent directly to the consul agent
(127.0.0.1:8600 <http://127.0.0.1:8600/>) succeed - DNS resolutions sent to
system[dnsmasq] (127.0.0.1:53 <http://127.0.0.1:53/>) time out- IPV6 UDP
(::1) resolutions sent to dnsmasq succeeded- Netstat shows that the IPV4
UDP socket file for dnsmasq is missing- No relevant messages in kernel log
(specifically, no dnsmasq OOM kill events)- File descriptor usage for the
entire server was normal- File descriptor usage for the individual dnsmasq
process was normal- CPU, RAM, and storage all look goodThanks in advance
for any discussion at all - I've been really struggling with this one for a
while now.Zach*
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
Simon Kelley
2018-02-04 15:13:26 UTC
Permalink
You're using --bind-dynamic, so dnsmasq creates a set of listener
sockets, one for each if the addresses on the machine's interfaces. When
such an address is removed or created, dnsmasq gets an event via
netlink, and it does two things,

1) Enumerates the current set of addresses on the machines interfaces.
2) Updates the set of listening sockets to reflect that set: creating
new sockets for new addresses, and deleting sockets for addresses which
are no longer in use.

That process is an obvious potential source of the behaviour you're seeing.

I'm slightly confused that your description talks about THE socket file:
there should be one for each address possessed by the machine. To try
and get a handle on what's happening, we need to see what's happening
to all the members of that set.

"DNS resolutions sent to system[dnsmasq] (127.0.0.1:53) time out"

implies that 127.0.1:53 the socket listening on 127.0.0.1 is going, but
are there still sockets listening on port 53 for other addresses, or are
all the UDP sockets going?

There is a second set of sockets listening on the same addresses/ports
for TCP connections. It would be interesting to see if the TCP sockets
go as well, or is it's only the UDP socket that disappears?

Is the network config of docker containers changed at any time? If so,
forcing that is an obvious way of trying to reproduce this problem.
Given that 127.0.0.1 is an address which dispappears, anything which
fiddles with the lo interface is of particular suspicion.



Cheers,


Simon.
*
I’m experiencing an issue where allDNS resolutions sent to dnsmasq
timeout, but only after the dnsmasq service has been successfullyrunning
for a period of time (anecdotally, after a few weeks of time). After a
lot of digging, I’ve discovered that dnsmasq’s UDP socket file will
eventually “disappear”. The issue can be resolved by restarting the
dnsmasq service.
I haven’t been able to reproduce it yet, but it has happened numerous
times on servers which are running dozens of docker containers. From
what I know, nothing should be removing this socket file and I can’t
find anything relevant in the dnsmasq logs. Is anyone aware of any
situations that can cause socket files to disappear?
Environment
Ubuntu 16.04.3 LTS
8 Cores, 16GB of RAM
Dnsmasq 2.75-1ubuntu0.16.04.4
Background
I’m using dnsmasq to forward requests to Consul
<https://www.consul.io/docs/guides/forwarding.html>, which is used for
service discovery. The Consul agent listens on port 8600 and is
configured to bind to allinterfaces (the relevant interface here is
172.17.0.1, which docker creates).  
Resolv.conf
```
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by
resolvconf(8)
nameserver 127.0.0.1
```
Dnsmasq.conf
```
server=/consul/172.17.0.1#8600 <http://172.17.0.1/#8600>
server=/10.in-addr.arpa/172.17.0.1#8600 <http://172.17.0.1/#8600>
bind-dynamic
```
Systemd config for Docker
```
ExecStart=/usr/bin/dockerd --bip=172.17.0.1/24 <http://172.17.0.1/24>
--dns=172.17.0.1 -H fd://
```
While investigating the servers in the broken state, I observed the
*
nslookup / dig DNS resolutions are timing out
*
Docker logs show containers are also timing out on DNS resolutions
*
Systemd reports that dnsmasq is still running, pid still exists
*
DNS resolutions sent directly to the consul agent (127.0.0.1:8600
<http://127.0.0.1:8600/>) succeed
*
DNS resolutions sent to system[dnsmasq] (127.0.0.1:53
<http://127.0.0.1:53/>) time out
*
IPV6 UDP (::1) resolutions sent to dnsmasq succeeded
*
Netstat shows that the IPV4 UDP socket file for dnsmasq is missing
*
No relevant messages in kernel log (specifically, no dnsmasq OOM
kill events)
*
File descriptor usage for the entire server was normal
*
File descriptor usage for the individual dnsmasq process was normal
*
CPU, RAM, and storage all look good
Thanks in advance for any discussion at all - I've been really
struggling with this one for a while now.
Zach
*
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
Zi Dvbelju
2018-02-05 15:43:05 UTC
Permalink
Hey Simon,

Thanks for looking into this. It just happened again and I have more data.

To clarify a few things, docker is configured to send *all* dns requests to
172.17.0.1:53 (docker itself creates the 172.17.0.1 interface)*.* I
implemented dnsmasq with bind-dynamic because docker would typically start
*after* dnsmasq; dnsmasq would never get a chance to bind to 172.17.0.1.
The ipv6 sockets that dnsmasq binds to (shown below) are never actually
utilized by anything in my system, just a side effect of the current
configuration.

Sample ifconfig
```
docker0 Link encap:Ethernet HWaddr 02:42:ea:b9:37:ce
inet addr:172.17.0.1 Bcast:172.17.0.255 Mask:255.255.255.0
inet6 addr: fe80::42:eaff:feb9:37ce/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1137922568 errors:0 dropped:0 overruns:0 frame:0
TX packets:1183858585 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:694264821785 (694.2 GB) TX bytes:731462199754 (731.4 GB)

ens5 Link encap:Ethernet HWaddr 02:f3:39:28:91:08
inet addr:10.3.4.228 Bcast:10.3.4.255 Mask:255.255.255.0
inet6 addr: fe80::f3:39ff:fe28:9108/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:1414749050 errors:0 dropped:0 overruns:0 frame:0
TX packets:1409286954 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:746355077483 (746.3 GB) TX bytes:716453625946 (716.4 GB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:39175242 errors:0 dropped:0 overruns:0 frame:0
TX packets:39175242 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:15795890457 (15.7 GB) TX bytes:15795890457 (15.7 GB)

```

Failed Node (netstat)
```
udp6 0 0 fe80::888a:23ff:fe7a:53 :::*
1359/dnsmasq
udp6 0 0 fe80::d817:59ff:feb4:53 :::*
1359/dnsmasq
udp6 0 0 fe80::7c83:e5ff:fe73:53 :::*
1359/dnsmasq
udp6 0 0 fe80::387c:fff:fe8b::53 :::*
1359/dnsmasq
udp6 0 0 fe80::844c:15ff:fe94:53 :::*
1359/dnsmasq
udp6 0 0 fe80::8402:6bff:fefa:53 :::*
1359/dnsmasq
udp6 0 0 fe80::740e:c1ff:feef:53 :::*
1359/dnsmasq
udp6 0 0 fe80::f841:8fff:fefb:53 :::*
1359/dnsmasq
udp6 0 0 fe80::d0b3:42ff:fe20:53 :::*
1359/dnsmasq
udp6 0 0 fe80::984b:45ff:fe2b:53 :::*
1359/dnsmasq
udp6 0 0 fe80::e4cc:c4ff:fe87:53 :::*
1359/dnsmasq
udp6 0 0 fe80::704d:afff:fec4:53 :::*
1359/dnsmasq
udp6 0 0 fe80::707c:7eff:fea2:53 :::*
1359/dnsmasq
udp6 0 0 fe80::c8c3:70ff:fe8c:53 :::*
1359/dnsmasq
udp6 0 0 fe80::506b:10ff:fe1a:53 :::*
1359/dnsmasq
udp6 0 0 fe80::88c:2eff:fe38::53 :::*
1359/dnsmasq
udp6 0 0 fe80::4478:d5ff:fea0:53 :::*
1359/dnsmasq
udp6 0 0 fe80::1c34:96ff:fe2c:53 :::*
1359/dnsmasq
udp6 0 0 fe80::1079:21ff:fe36:53 :::*
1359/dnsmasq
udp6 0 0 fe80::4dc:53ff:fe8a::53 :::*
1359/dnsmasq
udp6 0 0 fe80::c851:e0ff:fe4c:53 :::*
1359/dnsmasq
udp6 0 0 fe80::4c0:e3ff:fe15::53 :::*
1359/dnsmasq
udp6 0 0 fe80::fcca:afff:fe23:53 :::*
1359/dnsmasq
udp6 0 0 fe80::a067:8cff:fefb:53 :::*
1359/dnsmasq
udp6 0 0 fe80::1081:54ff:feca:53 :::*
1359/dnsmasq
udp6 0 0 fe80::fc67:89ff:fe1a:53 :::*
1359/dnsmasq
udp6 0 0 fe80::a892:65ff:fe5a:53 :::*
1359/dnsmasq
udp6 0 0 fe80::689e:e5ff:fe4d:53 :::*
1359/dnsmasq
udp6 0 0 fe80::9cc5:59ff:fe94:53 :::*
1359/dnsmasq
udp6 0 0 fe80::f031:faff:feb8:53 :::*
1359/dnsmasq
udp6 0 0 fe80::ccf2:e4ff:fe3f:53 :::*
1359/dnsmasq
udp6 0 0 fe80::5c66:efff:fe89:53 :::*
1359/dnsmasq
udp6 0 0 fe80::2caf:fcff:fe07:53 :::*
1359/dnsmasq
udp6 0 0 fe80::ac43:e1ff:fe44:53 :::*
1359/dnsmasq
udp6 0 0 fe80::1c3f:ccff:fe94:53 :::*
1359/dnsmasq
udp6 0 0 fe80::2ca6:27ff:feac:53 :::*
1359/dnsmasq
udp6 0 0 fe80::dc16:3bff:fe2b:53 :::*
1359/dnsmasq
udp6 0 0 fe80::58e8:ddff:fe94:53 :::*
1359/dnsmasq
udp6 0 0 fe80::9c60:f7ff:fe12:53 :::*
1359/dnsmasq
udp6 0 0 fe80::d02d:a9ff:fe52:53 :::*
1359/dnsmasq
udp6 0 0 fe80::42:eaff:feb9:3:53 :::*
1359/dnsmasq
udp6 0 0 fe80::c840:a8ff:fe6d:53 :::*
1359/dnsmasq
udp6 0 0 ::1:53 :::*
1359/dnsmasq
udp6 0 0 fe80::f3:39ff:fe28:9:53 :::*
1359/dnsmasq
```

Healthy Node (netstat)
```
*udp 0 0 127.0.0.1:53 <http://127.0.0.1:53>
0.0.0.0:* 1400/dnsmasq*
*udp 0 0 10.3.1.79:53 <http://10.3.1.79:53>
0.0.0.0:* 1400/dnsmasq*
*udp 0 0 172.17.0.1:53 <http://172.17.0.1:53>
0.0.0.0:* 1400/dnsmasq*
udp6 0 0 fe80::54d7:eff:fe7a::53 :::*
1400/dnsmasq
udp6 0 0 fe80::6ccd:80ff:fe39:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b870:1dff:fe4f:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b8f9:20ff:fe9b:53 :::*
1400/dnsmasq
udp6 0 0 fe80::7416:9bff:fed7:53 :::*
1400/dnsmasq
udp6 0 0 fe80::94d7:29ff:fe2a:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b4e1:a0ff:fe7e:53 :::*
1400/dnsmasq
udp6 0 0 fe80::84b8:20ff:feb2:53 :::*
1400/dnsmasq
udp6 0 0 fe80::603f:dbff:feec:53 :::*
1400/dnsmasq
udp6 0 0 fe80::284f:5fff:fe4f:53 :::*
1400/dnsmasq
udp6 0 0 fe80::100b:7aff:fe23:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b088:7eff:fe20:53 :::*
1400/dnsmasq
udp6 0 0 fe80::4434:10ff:fe9e:53 :::*
1400/dnsmasq
udp6 0 0 fe80::f052:16ff:fe10:53 :::*
1400/dnsmasq
udp6 0 0 fe80::789b:19ff:fefd:53 :::*
1400/dnsmasq
udp6 0 0 fe80::c66:baff:fe98::53 :::*
1400/dnsmasq
udp6 0 0 fe80::f08c:f7ff:fe7d:53 :::*
1400/dnsmasq
udp6 0 0 fe80::9898:dff:fe21::53 :::*
1400/dnsmasq
udp6 0 0 fe80::3847:85ff:fe30:53 :::*
1400/dnsmasq
udp6 0 0 fe80::98cd:23ff:fe7f:53 :::*
1400/dnsmasq
udp6 0 0 fe80::c9b:6bff:fed6::53 :::*
1400/dnsmasq
udp6 0 0 fe80::b002:f6ff:fe8a:53 :::*
1400/dnsmasq
udp6 0 0 fe80::8cf8:63ff:fe9a:53 :::*
1400/dnsmasq
udp6 0 0 fe80::4095:1eff:fe09:53 :::*
1400/dnsmasq
udp6 0 0 fe80::e8cf:10ff:fe32:53 :::*
1400/dnsmasq
udp6 0 0 fe80::fca4:12ff:fed7:53 :::*
1400/dnsmasq
udp6 0 0 fe80::d4e6:d4ff:fe7e:53 :::*
1400/dnsmasq
udp6 0 0 fe80::8c74:57ff:fe24:53 :::*
1400/dnsmasq
udp6 0 0 fe80::c0bb:f3ff:fe49:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b462:beff:fe75:53 :::*
1400/dnsmasq
udp6 0 0 fe80::c816:f6ff:fec8:53 :::*
1400/dnsmasq
udp6 0 0 fe80::409f:6dff:feb8:53 :::*
1400/dnsmasq
udp6 0 0 fe80::30b8:caff:fec8:53 :::*
1400/dnsmasq
udp6 0 0 fe80::42:12ff:fe82:5:53 :::*
1400/dnsmasq
udp6 0 0 ::1:53 :::*
1400/dnsmasq
udp6 0 0 fe80::468:88ff:fe00::53 :::*
1400/dnsmasq

```

The above 'udp' interfaces (not udp6) are actually the only ones I care
about - these are the ones that are mysteriously disappearing.

Failed Node Logs
```
/var/log/syslog.48:2018-02-03T13:36:23.540231+00:00 dnsmasq[1359]: failed
to create listening socket for fe80::fc42:b2ff:fe85:7702: No such device
/var/log/syslog.48:2018-02-03T13:36:23.540442+00:00 dnsmasq[1359]: failed
to create listening socket for fe80::fc42:b2ff:fe85:7702: No such device
/var/log/syslog.69:2018-02-02T16:56:16.907906+00:00 dnsmasq[1359]: failed
to create listening socket for fe80::dcb4:e5ff:fe94:9b31: No such device
/var/log/syslog.69:2018-02-02T16:56:16.908069+00:00 dnsmasq[1359]: failed
to create listening socket for fe80::dcb4:e5ff:fe94:9b31: No such device
/var/log/syslog.7:2018-02-05T06:05:31.066648+00:00 dnsmasq[1359]: failed to
create listening socket for 172.17.0.1: Address already in use
/var/log/syslog.7:2018-02-05T06:05:31.066813+00:00 dnsmasq[1359]: failed to
create listening socket for 10.3.4.228: Address already in use
/var/log/syslog.7:2018-02-05T06:05:31.066917+00:00 dnsmasq[1359]: failed to
create listening socket for 127.0.0.1: Address already in use
```

Hopefully this helps clarify what I'm seeing. Please let me know if you
need any additional information.

Thanks,
Zach
Post by Simon Kelley
You're using --bind-dynamic, so dnsmasq creates a set of listener
sockets, one for each if the addresses on the machine's interfaces. When
such an address is removed or created, dnsmasq gets an event via
netlink, and it does two things,
1) Enumerates the current set of addresses on the machines interfaces.
2) Updates the set of listening sockets to reflect that set: creating
new sockets for new addresses, and deleting sockets for addresses which
are no longer in use.
That process is an obvious potential source of the behaviour you're seeing.
there should be one for each address possessed by the machine. To try
and get a handle on what's happening, we need to see what's happening
to all the members of that set.
"DNS resolutions sent to system[dnsmasq] (127.0.0.1:53) time out"
implies that 127.0.1:53 the socket listening on 127.0.0.1 is going, but
are there still sockets listening on port 53 for other addresses, or are
all the UDP sockets going?
There is a second set of sockets listening on the same addresses/ports
for TCP connections. It would be interesting to see if the TCP sockets
go as well, or is it's only the UDP socket that disappears?
Is the network config of docker containers changed at any time? If so,
forcing that is an obvious way of trying to reproduce this problem.
Given that 127.0.0.1 is an address which dispappears, anything which
fiddles with the lo interface is of particular suspicion.
Cheers,
Simon.
*
I’m experiencing an issue where allDNS resolutions sent to dnsmasq
timeout, but only after the dnsmasq service has been successfullyrunning
for a period of time (anecdotally, after a few weeks of time). After a
lot of digging, I’ve discovered that dnsmasq’s UDP socket file will
eventually “disappear”. The issue can be resolved by restarting the
dnsmasq service.
I haven’t been able to reproduce it yet, but it has happened numerous
times on servers which are running dozens of docker containers. From
what I know, nothing should be removing this socket file and I can’t
find anything relevant in the dnsmasq logs. Is anyone aware of any
situations that can cause socket files to disappear?
Environment
Ubuntu 16.04.3 LTS
8 Cores, 16GB of RAM
Dnsmasq 2.75-1ubuntu0.16.04.4
Background
I’m using dnsmasq to forward requests to Consul
<https://www.consul.io/docs/guides/forwarding.html>, which is used for
service discovery. The Consul agent listens on port 8600 and is
configured to bind to allinterfaces (the relevant interface here is
172.17.0.1, which docker creates).
Resolv.conf
```
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by
resolvconf(8)
nameserver 127.0.0.1
```
Dnsmasq.conf
```
server=/consul/172.17.0.1#8600 <http://172.17.0.1/#8600>
server=/10.in-addr.arpa/172.17.0.1#8600 <http://172.17.0.1/#8600>
bind-dynamic
```
Systemd config for Docker
```
ExecStart=/usr/bin/dockerd --bip=172.17.0.1/24 <http://172.17.0.1/24>
--dns=172.17.0.1 -H fd://
```
While investigating the servers in the broken state, I observed the
*
nslookup / dig DNS resolutions are timing out
*
Docker logs show containers are also timing out on DNS resolutions
*
Systemd reports that dnsmasq is still running, pid still exists
*
DNS resolutions sent directly to the consul agent (127.0.0.1:8600
<http://127.0.0.1:8600/>) succeed
*
DNS resolutions sent to system[dnsmasq] (127.0.0.1:53
<http://127.0.0.1:53/>) time out
*
IPV6 UDP (::1) resolutions sent to dnsmasq succeeded
*
Netstat shows that the IPV4 UDP socket file for dnsmasq is missing
*
No relevant messages in kernel log (specifically, no dnsmasq OOM
kill events)
*
File descriptor usage for the entire server was normal
*
File descriptor usage for the individual dnsmasq process was normal
*
CPU, RAM, and storage all look good
Thanks in advance for any discussion at all - I've been really
struggling with this one for a while now.
Zach
*
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
Simon Kelley
2018-02-06 18:26:47 UTC
Permalink
That's useful, thank you. I think we're on the right lines that
dnsmasq's response to a change in the interface configuration is where
it's all going wrong.


Am I correct that this event is happening long after dnsmasq starts?

Do you have anything in the log just before this happens to indicate
that some other part of the system may be reconfiguring the networking,
ie adding or removing addresses from interfaces?


Cheers,

Simon.
Post by Zi Dvbelju
Hey Simon,
Thanks for looking into this. It just happened again and I have more data. 
To clarify a few things, docker is configured to send *all* dns requests
to 172.17.0.1:53 <http://172.17.0.1:53> (docker itself creates the
172.17.0.1 interface)*.* I implemented dnsmasq with bind-dynamic because
docker would typically start /after/dnsmasq; dnsmasq would never get a
chance to bind to 172.17.0.1. The ipv6 sockets that dnsmasq binds to
(shown below) are never actually utilized by anything in my system, just
a side effect of the current configuration. 
Sample ifconfig
```
docker0   Link encap:Ethernet  HWaddr 02:42:ea:b9:37:ce
          inet addr:172.17.0.1  Bcast:172.17.0.255  Mask:255.255.255.0
          inet6 addr: fe80::42:eaff:feb9:37ce/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1137922568 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1183858585 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:694264821785 (694.2 GB)  TX bytes:731462199754 (731.4 GB)
ens5      Link encap:Ethernet  HWaddr 02:f3:39:28:91:08
          inet addr:10.3.4.228  Bcast:10.3.4.255  Mask:255.255.255.0
          inet6 addr: fe80::f3:39ff:fe28:9108/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:1414749050 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1409286954 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:746355077483 (746.3 GB)  TX bytes:716453625946 (716.4 GB)
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:39175242 errors:0 dropped:0 overruns:0 frame:0
          TX packets:39175242 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:15795890457 (15.7 GB)  TX bytes:15795890457 (15.7 GB)
```
Failed Node (netstat)
```
udp6       0      0 fe80::888a:23ff:fe7a:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::d817:59ff:feb4:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::7c83:e5ff:fe73:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::387c:fff:fe8b::53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::844c:15ff:fe94:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::8402:6bff:fefa:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::740e:c1ff:feef:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::f841:8fff:fefb:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::d0b3:42ff:fe20:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::984b:45ff:fe2b:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::e4cc:c4ff:fe87:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::704d:afff:fec4:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::707c:7eff:fea2:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::c8c3:70ff:fe8c:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::506b:10ff:fe1a:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::88c:2eff:fe38::53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::4478:d5ff:fea0:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::1c34:96ff:fe2c:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::1079:21ff:fe36:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::4dc:53ff:fe8a::53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::c851:e0ff:fe4c:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::4c0:e3ff:fe15::53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::fcca:afff:fe23:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::a067:8cff:fefb:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::1081:54ff:feca:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::fc67:89ff:fe1a:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::a892:65ff:fe5a:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::689e:e5ff:fe4d:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::9cc5:59ff:fe94:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::f031:faff:feb8:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::ccf2:e4ff:fe3f:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::5c66:efff:fe89:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::2caf:fcff:fe07:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::ac43:e1ff:fe44:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::1c3f:ccff:fe94:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::2ca6:27ff:feac:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::dc16:3bff:fe2b:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::58e8:ddff:fe94:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::9c60:f7ff:fe12:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::d02d:a9ff:fe52:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::42:eaff:feb9:3:53 :::*                       
        1359/dnsmasq
udp6       0      0 fe80::c840:a8ff:fe6d:53 :::*                       
        1359/dnsmasq
udp6       0      0 ::1:53                  :::*                       
                  1359/dnsmasq
udp6       0      0 fe80::f3:39ff:fe28:9:53 :::*                       
        1359/dnsmasq
```
Healthy Node (netstat)
```
*udp        0      0 127.0.0.1:53 <http://127.0.0.1:53>           
0.0.0.0:*                           1400/dnsmasq*
*udp        0      0 10.3.1.79:53 <http://10.3.1.79:53>           
0.0.0.0:*                           1400/dnsmasq*
*udp        0      0 172.17.0.1:53 <http://172.17.0.1:53>         
 0.0.0.0:*                           1400/dnsmasq*
udp6       0      0 fe80::54d7:eff:fe7a::53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::6ccd:80ff:fe39:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::b870:1dff:fe4f:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::b8f9:20ff:fe9b:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::7416:9bff:fed7:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::94d7:29ff:fe2a:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::b4e1:a0ff:fe7e:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::84b8:20ff:feb2:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::603f:dbff:feec:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::284f:5fff:fe4f:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::100b:7aff:fe23:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::b088:7eff:fe20:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::4434:10ff:fe9e:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::f052:16ff:fe10:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::789b:19ff:fefd:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::c66:baff:fe98::53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::f08c:f7ff:fe7d:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::9898:dff:fe21::53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::3847:85ff:fe30:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::98cd:23ff:fe7f:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::c9b:6bff:fed6::53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::b002:f6ff:fe8a:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::8cf8:63ff:fe9a:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::4095:1eff:fe09:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::e8cf:10ff:fe32:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::fca4:12ff:fed7:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::d4e6:d4ff:fe7e:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::8c74:57ff:fe24:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::c0bb:f3ff:fe49:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::b462:beff:fe75:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::c816:f6ff:fec8:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::409f:6dff:feb8:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::30b8:caff:fec8:53 :::*                       
        1400/dnsmasq
udp6       0      0 fe80::42:12ff:fe82:5:53 :::*                       
        1400/dnsmasq
udp6       0      0 ::1:53                  :::*                       
                 1400/dnsmasq
udp6       0      0 fe80::468:88ff:fe00::53 :::*                       
        1400/dnsmasq
```
The above 'udp' interfaces (not udp6) are actually the only ones I care
about - these are the ones that are mysteriously disappearing.
Failed Node Logs
```
failed to create listening socket for fe80::fc42:b2ff:fe85:7702: No such
device
failed to create listening socket for fe80::fc42:b2ff:fe85:7702: No such
device
failed to create listening socket for fe80::dcb4:e5ff:fe94:9b31: No such
device
failed to create listening socket for fe80::dcb4:e5ff:fe94:9b31: No such
device
/var/log/syslog.7:2018-02-05T06:05:31.066648+00:00 dnsmasq[1359]: failed
to create listening socket for 172.17.0.1 <http://172.17.0.1>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066813+00:00 dnsmasq[1359]: failed
to create listening socket for 10.3.4.228 <http://10.3.4.228>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066917+00:00 dnsmasq[1359]: failed
to create listening socket for 127.0.0.1 <http://127.0.0.1>: Address
already in use
```
Hopefully this helps clarify what I'm seeing. Please let me know if you
need any additional information.
Thanks,
Zach
You're using --bind-dynamic, so dnsmasq creates a set of listener
sockets, one for each if the addresses on the machine's interfaces. When
such an address is removed or created, dnsmasq gets an event via
netlink, and it does two things,
1) Enumerates the current set of addresses on the machines interfaces.
2) Updates the set of listening sockets to reflect that set: creating
new sockets for new addresses, and deleting sockets for addresses which
are no longer in use.
That process is an obvious potential source of the behaviour you're seeing.
there should be one for each address possessed by the machine. To try
and get a handle on what's happening, we need  to see what's happening
to all the members of that set.
"DNS resolutions sent to system[dnsmasq] (127.0.0.1:53
<http://127.0.0.1:53>) time out"
implies that 127.0.1:53 the socket listening on 127.0.0.1 is going, but
are there still sockets listening on port 53 for other addresses, or are
all the UDP sockets going?
There is a second set of sockets listening on the same addresses/ports
for TCP connections. It would be interesting to see if the TCP sockets
go as well, or is it's only the UDP socket that disappears?
Is the network config of docker containers changed at any time? If so,
forcing that is an obvious way of trying to reproduce this problem.
Given that 127.0.0.1 is an address which dispappears, anything which
fiddles with the lo interface is of particular suspicion.
Cheers,
Simon.
*
I’m experiencing an issue where allDNS resolutions sent to dnsmasq
timeout, but only after the dnsmasq service has been successfullyrunning
for a period of time (anecdotally, after a few weeks of time). After a
lot of digging, I’ve discovered that dnsmasq’s UDP socket file will
eventually “disappear”. The issue can be resolved by restarting the
dnsmasq service.
I haven’t been able to reproduce it yet, but it has happened numerous
times on servers which are running dozens of docker containers. From
what I know, nothing should be removing this socket file and I can’t
find anything relevant in the dnsmasq logs. Is anyone aware of any
situations that can cause socket files to disappear?
Environment
Ubuntu 16.04.3 LTS
8 Cores, 16GB of RAM
Dnsmasq 2.75-1ubuntu0.16.04.4
Background
I’m using dnsmasq to forward requests to Consul
<https://www.consul.io/docs/guides/forwarding.html
<https://www.consul.io/docs/guides/forwarding.html>>, which is used for
service discovery. The Consul agent listens on port 8600 and is
configured to bind to allinterfaces (the relevant interface here is
172.17.0.1, which docker creates).  
Resolv.conf
```
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by
resolvconf(8)
nameserver 127.0.0.1
```
Dnsmasq.conf
```
server=/consul/172.17.0.1#8600 <http://172.17.0.1#8600>
<http://172.17.0.1/#8600>
server=/10.in-addr.arpa/172.17.0.1#8600 <http://172.17.0.1#8600>
<http://172.17.0.1/#8600>
bind-dynamic
```
Systemd config for Docker
```
ExecStart=/usr/bin/dockerd --bip=172.17.0.1/24
<http://172.17.0.1/24> <http://172.17.0.1/24>
--dns=172.17.0.1 -H fd://
```
While investigating the servers in the broken state, I observed the
   *
     nslookup / dig DNS resolutions are timing out
   *
     Docker logs show containers are also timing out on DNS resolutions
   *
     Systemd reports that dnsmasq is still running, pid still exists
   *
     DNS resolutions sent directly to the consul agent (127.0.0.1:8600 <http://127.0.0.1:8600>
     <http://127.0.0.1:8600/>) succeed
   *
     DNS resolutions sent to system[dnsmasq] (127.0.0.1:53 <http://127.0.0.1:53>
     <http://127.0.0.1:53/>) time out
   *
     IPV6 UDP (::1) resolutions sent to dnsmasq succeeded
   *
     Netstat shows that the IPV4 UDP socket file for dnsmasq is missing
   *
     No relevant messages in kernel log (specifically, no dnsmasq OOM
     kill events)
   *
     File descriptor usage for the entire server was normal
   *
     File descriptor usage for the individual dnsmasq process was normal
   *
     CPU, RAM, and storage all look good
Thanks in advance for any discussion at all - I've been really
struggling with this one for a while now.
Zach
*
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
<http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss>
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
<http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss>
Simon Kelley
2018-02-06 18:35:36 UTC
Permalink
Post by Zi Dvbelju
Healthy Node (netstat)
```
*udp        0      0 127.0.0.1:53 <http://127.0.0.1:53>           
0.0.0.0:*                           1400/dnsmasq*
*udp        0      0 10.3.1.79:53 <http://10.3.1.79:53>           
0.0.0.0:*                           1400/dnsmasq*
*udp        0      0 172.17.0.1:53 <http://172.17.0.1:53>         
 0.0.0.0:*                           1400/dnsmasq*
/var/log/syslog.7:2018-02-05T06:05:31.066648+00:00 dnsmasq[1359]: failed
to create listening socket for 172.17.0.1 <http://172.17.0.1>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066813+00:00 dnsmasq[1359]: failed
to create listening socket for 10.3.4.228 <http://10.3.4.228>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066917+00:00 dnsmasq[1359]: failed
to create listening socket for 127.0.0.1 <http://127.0.0.1>: Address
already in use
```
Further to my previous question, I can see that the dnsmasq _was_
listening in 127.0.0.1, 172.17.0.1 and 10.3.1.79, and then it attempts
to listen on 127.0.0.1, 172.17.0.1 and 10.3.4.228, so that implies a
network configuration change, with something changing 10.3.1.79 ->
10.3.4.228. Any idea what that's about? If you can find what causes that
change, we might, at very least, have a way to reproduce this at will.



Cheers,

Simon.
Zi Dvbelju
2018-02-06 20:42:37 UTC
Permalink
Hey Simon,

In my previous email, "Healthy Node" and "Failed Node" are two entirely
different servers, completely independent in each other (which explains why
they have different IP addresses).

Sorry - I should have made that clearer! Just wanted to illustrate the
difference in output between a known working server and a failed server.
Your thought process did spark my interest in DHCP logs; however, the IP
address has been consistent throughout the issues.

Thanks,
Zach
Post by Simon Kelley
Post by Zi Dvbelju
Healthy Node (netstat)
```
*udp 0 0 127.0.0.1:53 <http://127.0.0.1:53>
0.0.0.0:* 1400/dnsmasq*
*udp 0 0 10.3.1.79:53 <http://10.3.1.79:53>
0.0.0.0:* 1400/dnsmasq*
*udp 0 0 172.17.0.1:53 <http://172.17.0.1:53>
0.0.0.0:* 1400/dnsmasq*
/var/log/syslog.7:2018-02-05T06:05:31.066648+00:00 dnsmasq[1359]: failed
to create listening socket for 172.17.0.1 <http://172.17.0.1>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066813+00:00 dnsmasq[1359]: failed
to create listening socket for 10.3.4.228 <http://10.3.4.228>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066917+00:00 dnsmasq[1359]: failed
to create listening socket for 127.0.0.1 <http://127.0.0.1>: Address
already in use
```
Further to my previous question, I can see that the dnsmasq _was_
listening in 127.0.0.1, 172.17.0.1 and 10.3.1.79, and then it attempts
to listen on 127.0.0.1, 172.17.0.1 and 10.3.4.228, so that implies a
network configuration change, with something changing 10.3.1.79 ->
10.3.4.228. Any idea what that's about? If you can find what causes that
change, we might, at very least, have a way to reproduce this at will.
Cheers,
Simon.
Simon Kelley
2018-02-14 21:02:28 UTC
Permalink
Understood. Nevertheless, it looks like _something_ is causing dnsmasq
to re-read the set of addresses associated with the interfaces, and then
botch opening new sockets to handle those. Is there any clue in the log
just before the

/var/log/syslog.7:2018-02-05T06:05:31.066813+00:00 dnsmasq[1359]: failed
Post by Zi Dvbelju
Post by Zi Dvbelju
to create listening socket for 10.3.4.228 <http://10.3.4.228>: Address
already in use
messages as to what that might be?



Also, there should be tcp listening sockets for the same set of
addresses as the udp ones. When the udp sockets go, do the tcp ones
disappear also?



Cheers,

Simon.
Post by Zi Dvbelju
Hey Simon,
In my previous email, "Healthy Node" and "Failed Node" are two entirely
different servers, completely independent in each other (which explains
why they have different IP addresses). 
Sorry - I should have made that clearer! Just wanted to illustrate the
difference in output between a known working server and a failed server.
Your thought process did spark my interest in DHCP logs; however, the IP
address has been consistent throughout the issues.
Thanks,
Zach
Post by Zi Dvbelju
Healthy Node (netstat)
```
*udp        0      0 127.0.0.1:53 <http://127.0.0.1:53> <http://127.0.0.1:53>           
0.0.0.0:*                           1400/dnsmasq*
*udp        0      0 10.3.1.79:53 <http://10.3.1.79:53> <http://10.3.1.79:53>           
0.0.0.0:*                           1400/dnsmasq*
*udp        0      0 172.17.0.1:53 <http://172.17.0.1:53> <http://172.17.0.1:53>         
 0.0.0.0:*                           1400/dnsmasq*
/var/log/syslog.7:2018-02-05T06:05:31.066648+00:00 dnsmasq[1359]: failed
to create listening socket for 172.17.0.1 <http://172.17.0.1>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066813+00:00 dnsmasq[1359]: failed
to create listening socket for 10.3.4.228 <http://10.3.4.228>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066917+00:00 dnsmasq[1359]: failed
to create listening socket for 127.0.0.1 <http://127.0.0.1>: Address
already in use
```
Further to my previous question, I can see that the dnsmasq _was_
listening in  127.0.0.1, 172.17.0.1 and 10.3.1.79, and then it attempts
to listen on  127.0.0.1, 172.17.0.1 and 10.3.4.228, so that implies a
network configuration change, with something changing 10.3.1.79 ->
10.3.4.228. Any idea what that's about? If you can find what causes that
change, we might, at very least, have a way to reproduce this at will.
Cheers,
Simon.
Loading...