That's useful, thank you. I think we're on the right lines that
it's all going wrong.
Simon.
Post by Zi DvbeljuHey Simon,
Thanks for looking into this. It just happened again and I have more data.
To clarify a few things, docker is configured to send *all* dns requests
to 172.17.0.1:53 <http://172.17.0.1:53> (docker itself creates the
172.17.0.1 interface)*.* I implemented dnsmasq with bind-dynamic because
docker would typically start /after/dnsmasq; dnsmasq would never get a
chance to bind to 172.17.0.1. The ipv6 sockets that dnsmasq binds to
(shown below) are never actually utilized by anything in my system, just
a side effect of the current configuration.
Sample ifconfig
```
docker0 Link encap:Ethernet HWaddr 02:42:ea:b9:37:ce
inet addr:172.17.0.1 Bcast:172.17.0.255 Mask:255.255.255.0
inet6 addr: fe80::42:eaff:feb9:37ce/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1137922568 errors:0 dropped:0 overruns:0 frame:0
TX packets:1183858585 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:694264821785 (694.2 GB) TX bytes:731462199754 (731.4 GB)
ens5 Link encap:Ethernet HWaddr 02:f3:39:28:91:08
inet addr:10.3.4.228 Bcast:10.3.4.255 Mask:255.255.255.0
inet6 addr: fe80::f3:39ff:fe28:9108/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:1414749050 errors:0 dropped:0 overruns:0 frame:0
TX packets:1409286954 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:746355077483 (746.3 GB) TX bytes:716453625946 (716.4 GB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:39175242 errors:0 dropped:0 overruns:0 frame:0
TX packets:39175242 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:15795890457 (15.7 GB) TX bytes:15795890457 (15.7 GB)
```
Failed Node (netstat)
```
udp6 0 0 fe80::888a:23ff:fe7a:53 :::*
1359/dnsmasq
udp6 0 0 fe80::d817:59ff:feb4:53 :::*
1359/dnsmasq
udp6 0 0 fe80::7c83:e5ff:fe73:53 :::*
1359/dnsmasq
udp6 0 0 fe80::387c:fff:fe8b::53 :::*
1359/dnsmasq
udp6 0 0 fe80::844c:15ff:fe94:53 :::*
1359/dnsmasq
udp6 0 0 fe80::8402:6bff:fefa:53 :::*
1359/dnsmasq
udp6 0 0 fe80::740e:c1ff:feef:53 :::*
1359/dnsmasq
udp6 0 0 fe80::f841:8fff:fefb:53 :::*
1359/dnsmasq
udp6 0 0 fe80::d0b3:42ff:fe20:53 :::*
1359/dnsmasq
udp6 0 0 fe80::984b:45ff:fe2b:53 :::*
1359/dnsmasq
udp6 0 0 fe80::e4cc:c4ff:fe87:53 :::*
1359/dnsmasq
udp6 0 0 fe80::704d:afff:fec4:53 :::*
1359/dnsmasq
udp6 0 0 fe80::707c:7eff:fea2:53 :::*
1359/dnsmasq
udp6 0 0 fe80::c8c3:70ff:fe8c:53 :::*
1359/dnsmasq
udp6 0 0 fe80::506b:10ff:fe1a:53 :::*
1359/dnsmasq
udp6 0 0 fe80::88c:2eff:fe38::53 :::*
1359/dnsmasq
udp6 0 0 fe80::4478:d5ff:fea0:53 :::*
1359/dnsmasq
udp6 0 0 fe80::1c34:96ff:fe2c:53 :::*
1359/dnsmasq
udp6 0 0 fe80::1079:21ff:fe36:53 :::*
1359/dnsmasq
udp6 0 0 fe80::4dc:53ff:fe8a::53 :::*
1359/dnsmasq
udp6 0 0 fe80::c851:e0ff:fe4c:53 :::*
1359/dnsmasq
udp6 0 0 fe80::4c0:e3ff:fe15::53 :::*
1359/dnsmasq
udp6 0 0 fe80::fcca:afff:fe23:53 :::*
1359/dnsmasq
udp6 0 0 fe80::a067:8cff:fefb:53 :::*
1359/dnsmasq
udp6 0 0 fe80::1081:54ff:feca:53 :::*
1359/dnsmasq
udp6 0 0 fe80::fc67:89ff:fe1a:53 :::*
1359/dnsmasq
udp6 0 0 fe80::a892:65ff:fe5a:53 :::*
1359/dnsmasq
udp6 0 0 fe80::689e:e5ff:fe4d:53 :::*
1359/dnsmasq
udp6 0 0 fe80::9cc5:59ff:fe94:53 :::*
1359/dnsmasq
udp6 0 0 fe80::f031:faff:feb8:53 :::*
1359/dnsmasq
udp6 0 0 fe80::ccf2:e4ff:fe3f:53 :::*
1359/dnsmasq
udp6 0 0 fe80::5c66:efff:fe89:53 :::*
1359/dnsmasq
udp6 0 0 fe80::2caf:fcff:fe07:53 :::*
1359/dnsmasq
udp6 0 0 fe80::ac43:e1ff:fe44:53 :::*
1359/dnsmasq
udp6 0 0 fe80::1c3f:ccff:fe94:53 :::*
1359/dnsmasq
udp6 0 0 fe80::2ca6:27ff:feac:53 :::*
1359/dnsmasq
udp6 0 0 fe80::dc16:3bff:fe2b:53 :::*
1359/dnsmasq
udp6 0 0 fe80::58e8:ddff:fe94:53 :::*
1359/dnsmasq
udp6 0 0 fe80::9c60:f7ff:fe12:53 :::*
1359/dnsmasq
udp6 0 0 fe80::d02d:a9ff:fe52:53 :::*
1359/dnsmasq
udp6 0 0 fe80::42:eaff:feb9:3:53 :::*
1359/dnsmasq
udp6 0 0 fe80::c840:a8ff:fe6d:53 :::*
1359/dnsmasq
udp6 0 0 ::1:53 :::*
1359/dnsmasq
udp6 0 0 fe80::f3:39ff:fe28:9:53 :::*
1359/dnsmasq
```
Healthy Node (netstat)
```
*udp 0 0 127.0.0.1:53 <http://127.0.0.1:53>
0.0.0.0:* 1400/dnsmasq*
*udp 0 0 10.3.1.79:53 <http://10.3.1.79:53>
0.0.0.0:* 1400/dnsmasq*
*udp 0 0 172.17.0.1:53 <http://172.17.0.1:53>
0.0.0.0:* 1400/dnsmasq*
udp6 0 0 fe80::54d7:eff:fe7a::53 :::*
1400/dnsmasq
udp6 0 0 fe80::6ccd:80ff:fe39:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b870:1dff:fe4f:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b8f9:20ff:fe9b:53 :::*
1400/dnsmasq
udp6 0 0 fe80::7416:9bff:fed7:53 :::*
1400/dnsmasq
udp6 0 0 fe80::94d7:29ff:fe2a:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b4e1:a0ff:fe7e:53 :::*
1400/dnsmasq
udp6 0 0 fe80::84b8:20ff:feb2:53 :::*
1400/dnsmasq
udp6 0 0 fe80::603f:dbff:feec:53 :::*
1400/dnsmasq
udp6 0 0 fe80::284f:5fff:fe4f:53 :::*
1400/dnsmasq
udp6 0 0 fe80::100b:7aff:fe23:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b088:7eff:fe20:53 :::*
1400/dnsmasq
udp6 0 0 fe80::4434:10ff:fe9e:53 :::*
1400/dnsmasq
udp6 0 0 fe80::f052:16ff:fe10:53 :::*
1400/dnsmasq
udp6 0 0 fe80::789b:19ff:fefd:53 :::*
1400/dnsmasq
udp6 0 0 fe80::c66:baff:fe98::53 :::*
1400/dnsmasq
udp6 0 0 fe80::f08c:f7ff:fe7d:53 :::*
1400/dnsmasq
udp6 0 0 fe80::9898:dff:fe21::53 :::*
1400/dnsmasq
udp6 0 0 fe80::3847:85ff:fe30:53 :::*
1400/dnsmasq
udp6 0 0 fe80::98cd:23ff:fe7f:53 :::*
1400/dnsmasq
udp6 0 0 fe80::c9b:6bff:fed6::53 :::*
1400/dnsmasq
udp6 0 0 fe80::b002:f6ff:fe8a:53 :::*
1400/dnsmasq
udp6 0 0 fe80::8cf8:63ff:fe9a:53 :::*
1400/dnsmasq
udp6 0 0 fe80::4095:1eff:fe09:53 :::*
1400/dnsmasq
udp6 0 0 fe80::e8cf:10ff:fe32:53 :::*
1400/dnsmasq
udp6 0 0 fe80::fca4:12ff:fed7:53 :::*
1400/dnsmasq
udp6 0 0 fe80::d4e6:d4ff:fe7e:53 :::*
1400/dnsmasq
udp6 0 0 fe80::8c74:57ff:fe24:53 :::*
1400/dnsmasq
udp6 0 0 fe80::c0bb:f3ff:fe49:53 :::*
1400/dnsmasq
udp6 0 0 fe80::b462:beff:fe75:53 :::*
1400/dnsmasq
udp6 0 0 fe80::c816:f6ff:fec8:53 :::*
1400/dnsmasq
udp6 0 0 fe80::409f:6dff:feb8:53 :::*
1400/dnsmasq
udp6 0 0 fe80::30b8:caff:fec8:53 :::*
1400/dnsmasq
udp6 0 0 fe80::42:12ff:fe82:5:53 :::*
1400/dnsmasq
udp6 0 0 ::1:53 :::*
1400/dnsmasq
udp6 0 0 fe80::468:88ff:fe00::53 :::*
1400/dnsmasq
```
The above 'udp' interfaces (not udp6) are actually the only ones I care
about - these are the ones that are mysteriously disappearing.
Failed Node Logs
```
failed to create listening socket for fe80::fc42:b2ff:fe85:7702: No such
device
failed to create listening socket for fe80::fc42:b2ff:fe85:7702: No such
device
failed to create listening socket for fe80::dcb4:e5ff:fe94:9b31: No such
device
failed to create listening socket for fe80::dcb4:e5ff:fe94:9b31: No such
device
/var/log/syslog.7:2018-02-05T06:05:31.066648+00:00 dnsmasq[1359]: failed
to create listening socket for 172.17.0.1 <http://172.17.0.1>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066813+00:00 dnsmasq[1359]: failed
to create listening socket for 10.3.4.228 <http://10.3.4.228>: Address
already in use
/var/log/syslog.7:2018-02-05T06:05:31.066917+00:00 dnsmasq[1359]: failed
to create listening socket for 127.0.0.1 <http://127.0.0.1>: Address
already in use
```
Hopefully this helps clarify what I'm seeing. Please let me know if you
need any additional information.
Thanks,
Zach
You're using --bind-dynamic, so dnsmasq creates a set of listener
sockets, one for each if the addresses on the machine's interfaces. When
such an address is removed or created, dnsmasq gets an event via
netlink, and it does two things,
1) Enumerates the current set of addresses on the machines interfaces.
2) Updates the set of listening sockets to reflect that set: creating
new sockets for new addresses, and deleting sockets for addresses which
are no longer in use.
That process is an obvious potential source of the behaviour you're seeing.
there should be one for each address possessed by the machine. To try
and get a handle on what's happening, we need to see what's happening
to all the members of that set.
"DNS resolutions sent to system[dnsmasq] (127.0.0.1:53
<http://127.0.0.1:53>) time out"
implies that 127.0.1:53 the socket listening on 127.0.0.1 is going, but
are there still sockets listening on port 53 for other addresses, or are
all the UDP sockets going?
There is a second set of sockets listening on the same addresses/ports
for TCP connections. It would be interesting to see if the TCP sockets
go as well, or is it's only the UDP socket that disappears?
Is the network config of docker containers changed at any time? If so,
forcing that is an obvious way of trying to reproduce this problem.
Given that 127.0.0.1 is an address which dispappears, anything which
fiddles with the lo interface is of particular suspicion.
Cheers,
Simon.
*
I’m experiencing an issue where allDNS resolutions sent to dnsmasq
timeout, but only after the dnsmasq service has been successfullyrunning
for a period of time (anecdotally, after a few weeks of time). After a
lot of digging, I’ve discovered that dnsmasq’s UDP socket file will
eventually “disappear”. The issue can be resolved by restarting the
dnsmasq service.
I haven’t been able to reproduce it yet, but it has happened numerous
times on servers which are running dozens of docker containers. From
what I know, nothing should be removing this socket file and I can’t
find anything relevant in the dnsmasq logs. Is anyone aware of any
situations that can cause socket files to disappear?
Environment
Ubuntu 16.04.3 LTS
8 Cores, 16GB of RAM
Dnsmasq 2.75-1ubuntu0.16.04.4
Background
I’m using dnsmasq to forward requests to Consul
<https://www.consul.io/docs/guides/forwarding.html
<https://www.consul.io/docs/guides/forwarding.html>>, which is used for
service discovery. The Consul agent listens on port 8600 and is
configured to bind to allinterfaces (the relevant interface here is
172.17.0.1, which docker creates).
Resolv.conf
```
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by
resolvconf(8)
nameserver 127.0.0.1
```
Dnsmasq.conf
```
server=/consul/172.17.0.1#8600 <http://172.17.0.1#8600>
<http://172.17.0.1/#8600>
server=/10.in-addr.arpa/172.17.0.1#8600 <http://172.17.0.1#8600>
<http://172.17.0.1/#8600>
bind-dynamic
```
Systemd config for Docker
```
ExecStart=/usr/bin/dockerd --bip=172.17.0.1/24
<http://172.17.0.1/24> <http://172.17.0.1/24>
--dns=172.17.0.1 -H fd://
```
While investigating the servers in the broken state, I observed the
*
nslookup / dig DNS resolutions are timing out
*
Docker logs show containers are also timing out on DNS resolutions
*
Systemd reports that dnsmasq is still running, pid still exists
*
DNS resolutions sent directly to the consul agent (127.0.0.1:8600 <http://127.0.0.1:8600>
<http://127.0.0.1:8600/>) succeed
*
DNS resolutions sent to system[dnsmasq] (127.0.0.1:53 <http://127.0.0.1:53>
<http://127.0.0.1:53/>) time out
*
IPV6 UDP (::1) resolutions sent to dnsmasq succeeded
*
Netstat shows that the IPV4 UDP socket file for dnsmasq is missing
*
No relevant messages in kernel log (specifically, no dnsmasq OOM
kill events)
*
File descriptor usage for the entire server was normal
*
File descriptor usage for the individual dnsmasq process was normal
*
CPU, RAM, and storage all look good
Thanks in advance for any discussion at all - I've been really
struggling with this one for a while now.
Zach
*
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
<http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss>
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
<http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss>