Warner, Andrew C [CTO]
2018-08-14 17:05:52 UTC
Subject: DNSMASQ failing to return SRV records with loss of communication to a single DNS server
Issue: We have SIP SRV records for a domain which can be provided by two DNS servers in our environment. During testing we have noticed that if one of the DNS servers is un-reachable, the request for the SRV records via dnsmasq times out.
This only happens when the query is originated from outside the box where dnsmasq is running. IE - if we issue the SRV query from the dnsmasq server, the SRV records are returned. If we issue the request from a client VM which is set to resolve queries against our dnsmasq host - the request times out.
Note: some of the information below has been changed/replaced with xxx, such as IP addresses and domain names for security reasons.
Dnsmasq.conf has the following entries - indicating to forward requests for labdomain.net to 10.xx.xx.12 and 10.xx.xx.20.
server=/labdomain.net/10.xx.xx.12
server=/labdomain.net/10.xx.xx.20
VM making SRV queries is 10.xx.xx.99
When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ server, and have commented out the non-reachable DNS server: 10.xx.xx.12 - we receive a response to the SRV query.
#server=/labdomain.net/10.xx.xx.12
server=/labdomain.net/10.xx.xx.20
[***@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
;; Truncated, retrying in TCP mode.
; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14584
;; flags: qr aa; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
;; QUESTION SECTION:
;_sip._udp.scscf.sprout.lp.labdomain.net. IN SRV
;; ANSWER SECTION:
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-05.labdomain.net.
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-01.labdomain.net.
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-02.labdomain.net.
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-03.labdomain.net.
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-04.labdomain.net.
;; ADDITIONAL SECTION:
ovpklp-viscscf-spn-05.labdomain.net. 43200 IN A 10.xx.xx.18
ovpklp-viscscf-spn-01.labdomain.net. 43200 IN A 10.xx.xx.14
ovpklp-viscscf-spn-02.labdomain.net. 43200 IN A 10.xx.xx.15
ovpklp-viscscf-spn-03.labdomain.net. 43200 IN A 10.xx.xx.16
ovpklp-viscscf-spn-04.labdomain.net. 43200 IN A 10.xx.xx.17
;; Query time: 2 msec
;; SERVER: 10.xx.xx.5#53(10.xx.xx.5)
;; WHEN: Mon Aug 13 16:34:40 2018
;; MSG SIZE rcvd: 528
When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ server, and have both the good and non-reachable DNS server in play - we receive a timeout to the SRV query. In this case - 10.xx.xx.20 is fully capable of responding to the SRV query.
server=/labdomain.net/10.xx.xx.12 <-- not reachable
server=/labdomain.net/10.xx.xx.20
[***@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
;; Truncated, retrying in TCP mode.
; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
;; global options: +cmd
;; connection timed out; no servers could be reached
Dnsmasq logging shows:
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: query[SRV] _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded _sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.12
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded _sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.20
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: nameserver 10.xx.xx.20 refused to do a recursive query
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5172]: query[SRV] _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
Aug 14 16:22:24 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5173]: query[SRV] _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
Aug 14 16:22:34 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5174]: query[SRV] _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
I could use some ideas on how to further troubleshoot this issue.
Andy Warner
Telecom Design Engineer
O: 406-752-3330 / M: 913-972-7521
***@sprint.com
[cid:***@pvmxe13g01]
Issue: We have SIP SRV records for a domain which can be provided by two DNS servers in our environment. During testing we have noticed that if one of the DNS servers is un-reachable, the request for the SRV records via dnsmasq times out.
This only happens when the query is originated from outside the box where dnsmasq is running. IE - if we issue the SRV query from the dnsmasq server, the SRV records are returned. If we issue the request from a client VM which is set to resolve queries against our dnsmasq host - the request times out.
Note: some of the information below has been changed/replaced with xxx, such as IP addresses and domain names for security reasons.
Dnsmasq.conf has the following entries - indicating to forward requests for labdomain.net to 10.xx.xx.12 and 10.xx.xx.20.
server=/labdomain.net/10.xx.xx.12
server=/labdomain.net/10.xx.xx.20
VM making SRV queries is 10.xx.xx.99
When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ server, and have commented out the non-reachable DNS server: 10.xx.xx.12 - we receive a response to the SRV query.
#server=/labdomain.net/10.xx.xx.12
server=/labdomain.net/10.xx.xx.20
[***@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
;; Truncated, retrying in TCP mode.
; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14584
;; flags: qr aa; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
;; QUESTION SECTION:
;_sip._udp.scscf.sprout.lp.labdomain.net. IN SRV
;; ANSWER SECTION:
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-05.labdomain.net.
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-01.labdomain.net.
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-02.labdomain.net.
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-03.labdomain.net.
_sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054 ovpklp-viscscf-spn-04.labdomain.net.
;; ADDITIONAL SECTION:
ovpklp-viscscf-spn-05.labdomain.net. 43200 IN A 10.xx.xx.18
ovpklp-viscscf-spn-01.labdomain.net. 43200 IN A 10.xx.xx.14
ovpklp-viscscf-spn-02.labdomain.net. 43200 IN A 10.xx.xx.15
ovpklp-viscscf-spn-03.labdomain.net. 43200 IN A 10.xx.xx.16
ovpklp-viscscf-spn-04.labdomain.net. 43200 IN A 10.xx.xx.17
;; Query time: 2 msec
;; SERVER: 10.xx.xx.5#53(10.xx.xx.5)
;; WHEN: Mon Aug 13 16:34:40 2018
;; MSG SIZE rcvd: 528
When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ server, and have both the good and non-reachable DNS server in play - we receive a timeout to the SRV query. In this case - 10.xx.xx.20 is fully capable of responding to the SRV query.
server=/labdomain.net/10.xx.xx.12 <-- not reachable
server=/labdomain.net/10.xx.xx.20
[***@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
;; Truncated, retrying in TCP mode.
; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
;; global options: +cmd
;; connection timed out; no servers could be reached
Dnsmasq logging shows:
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: query[SRV] _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded _sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.12
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded _sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.20
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: nameserver 10.xx.xx.20 refused to do a recursive query
Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5172]: query[SRV] _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
Aug 14 16:22:24 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5173]: query[SRV] _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
Aug 14 16:22:34 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5174]: query[SRV] _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
I could use some ideas on how to further troubleshoot this issue.
Andy Warner
Telecom Design Engineer
O: 406-752-3330 / M: 913-972-7521
***@sprint.com
[cid:***@pvmxe13g01]