[Dnsmasq-discuss] Bug forward upstream SERVFAIL

Discussion:

Martin Wetterwald

2016-11-22 15:03:03 UTC

Hello,

At OVH, we use dnsmasq in our product OverTheBox, an OpenWRT based
router.

We found what we think is a bug (at least a not wanted behaviour), but
it seems it's actually a feature, when looking at commits 4ace25c5 and
51967f980 (pasted at the end of this email).

If you have say 4 upstreams, and one of them has a problem: it will
always give SERVFAIL responses back to dnsmasq. The problem is that
dnsmasq will immediately forward the SERVFAIL response back to the
client, even if other upstreams are working (providing the SERVFAIL
answer is the first to arrive).

If dnsmasq has several upstreams, isn't it to make it more robust?
Shouldn't dnsmasq try as much as possible to be independent of upstream
errors?

You will find by Pull Request here:
https://github.com/MartinWetterwald/dnsmasq/pull/1/files

You could cherry-pick my commit if you agree with this behaviour.

Best Regards, Martin Wetterwald

commit 51967f9807665dae403f1497b827165c5fa1084b
Author: Simon Kelley <***@thekelleys.org.uk>
Date: Tue Mar 25 21:07:00 2014 +0000

SERVFAIL is an expected error return, don't try all servers.

commit 4ace25c5d6c30949be9171ff1c524b2139b989d3
Author: Chris Novakovic <***@chrisn.me.uk>
Date: Mon Jan 25 21:54:35 2016 +0000

Treat REFUSED (not SERVFAIL) as an unsuccessful upstream response

Commit 51967f9807665dae403f1497b827165c5fa1084b began treating SERVFAIL
as a successful response from an upstream server (thus ignoring future
responses to the query from other upstream servers), but a typo in that
commit means that REFUSED responses are accidentally being treated as
successful instead of SERVFAIL responses.

This commit corrects this typo and provides the behaviour intended by
commit 51967f9: SERVFAIL responses are considered successful (and will
be sent back to the requester), while REFUSED responses are considered
unsuccessful (and dnsmasq will wait for responses from other upstream
servers that haven't responded yet).

Chris Novakovic

2016-11-22 16:18:55 UTC

Permalink

Post by Martin Wetterwald
We found what we think is a bug (at least a not wanted behaviour), but
it seems it's actually a feature, when looking at commits 4ace25c5 and
51967f980 (pasted at the end of this email).

4ace25c5 is a red herring: that provides REFUSED responses with the
behaviour you're looking for. Whether the same behaviour ought to be
applied to SERVFAIL responses is for Simon to decide: the commit message
for 51967f980 isn't clear about why SERVFAIL should be considered a
"successful" upstream response, but I'm sure there was a reason, and I'm
sure he can fill us in.

/dev/rob0

2016-11-22 18:02:14 UTC

Permalink

Post by Chris Novakovic

Post by Martin Wetterwald
We found what we think is a bug (at least a not wanted
behaviour), but it seems it's actually a feature, when looking at
commits 4ace25c5 and 51967f980 (pasted at the end of this email).

4ace25c5 is a red herring: that provides REFUSED responses with the
behaviour you're looking for. Whether the same behaviour ought to
be applied to SERVFAIL responses is for Simon to decide: the commit
message for 51967f980 isn't clear about why SERVFAIL should be
considered a "successful" upstream response, but I'm sure there was
a reason, and I'm sure he can fill us in.

SERVFAIL can sometimes be considered "successful" depending on
circumstances.

If all the authoritative NS hosts for a zone are returning SERVFAIL
for queries, then indeed, that's as best as can be done.

But the problem could be on the recursive resolver, such as [for one
example] cache poisoning causing DNSSEC validation failure.

Unfortunately dnsmasq is not in a position to know which it is.

I think the most prudent thing for dnsmasq to do on SERVFAIL is to
attempt the query with other upstream servers, if possible. But an
answer needs to be provided to the client before its own timeout
value.

--
http://rob0.nodns4.us/
Offlist GMX mail is seen only if "/dev/rob0" is in the Subject:

Martin Wetterwald

2016-11-23 12:04:36 UTC

Permalink

Yes, the behaviour I had in mind is to only forward SERVFAIL to the
client if we didn't have any "better" answer (NOERROR) from any other
upstream.

That way, DNS resolution with several upstreams stays reliable even if
some of them SERVFAIL.

Does that seem reasonable? Does that still respects the RFC definition
of "SERVFAIL"?

Martin

Post by /dev/rob0

Post by Chris Novakovic

Post by Martin Wetterwald
We found what we think is a bug (at least a not wanted
behaviour), but it seems it's actually a feature, when looking at
commits 4ace25c5 and 51967f980 (pasted at the end of this email).

4ace25c5 is a red herring: that provides REFUSED responses with the
behaviour you're looking for. Whether the same behaviour ought to
be applied to SERVFAIL responses is for Simon to decide: the commit
message for 51967f980 isn't clear about why SERVFAIL should be
considered a "successful" upstream response, but I'm sure there was
a reason, and I'm sure he can fill us in.

SERVFAIL can sometimes be considered "successful" depending on
circumstances.
If all the authoritative NS hosts for a zone are returning SERVFAIL
for queries, then indeed, that's as best as can be done.
But the problem could be on the recursive resolver, such as [for one
example] cache poisoning causing DNSSEC validation failure.
Unfortunately dnsmasq is not in a position to know which it is.
I think the most prudent thing for dnsmasq to do on SERVFAIL is to
attempt the query with other upstream servers, if possible. But an
answer needs to be provided to the client before its own timeout
value.
--
http://rob0.nodns4.us/
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss

Simon Kelley

2016-12-16 21:05:38 UTC

Permalink

The rationale behind this change is that SERVFAIL is the expected
reply if DNSSEC checking is turned on, and the upstream server cannot
validate the DNSSEC chain-of-trust for the requested record. This
change went as part of the dnsmasq DNSSEC implementation, because it
was expected that the "check DNSSEC" bit would be set on queries, so a
return of SERVFAIL implies a DNSSEC problem, which is not recoverable.

If SERVFAIL is an expected error return which means DNSSEC validation
failure, you don't necessarily want to spend time forwarding the query
to other servers and waiting for them to reply.

I'm not sure the above is cast-iron argument for the current
behaviour, and even if it is the behaviour could be modulated,
depending on if the "check DNSSEC" bit was set in the query.

Was the original problem DNSEC related, or does the SERVFAIL originate
from some other error?

Cheers,

Simon.

Post by Martin Wetterwald
Yes, the behaviour I had in mind is to only forward SERVFAIL to
the client if we didn't have any "better" answer (NOERROR) from any
other upstream.
That way, DNS resolution with several upstreams stays reliable even
if some of them SERVFAIL.
Does that seem reasonable? Does that still respects the RFC
definition of "SERVFAIL"?
Martin

Post by /dev/rob0

Post by Chris Novakovic

Post by Martin Wetterwald
We found what we think is a bug (at least a not wanted
behaviour), but it seems it's actually a feature, when
looking at commits 4ace25c5 and 51967f980 (pasted at the end
of this email).

4ace25c5 is a red herring: that provides REFUSED responses with
the behaviour you're looking for. Whether the same behaviour
ought to be applied to SERVFAIL responses is for Simon to
decide: the commit message for 51967f980 isn't clear about why
SERVFAIL should be considered a "successful" upstream response,
but I'm sure there was a reason, and I'm sure he can fill us
in.

SERVFAIL can sometimes be considered "successful" depending on
circumstances.
If all the authoritative NS hosts for a zone are returning
SERVFAIL for queries, then indeed, that's as best as can be
done.
But the problem could be on the recursive resolver, such as [for
one example] cache poisoning causing DNSSEC validation failure.
Unfortunately dnsmasq is not in a position to know which it is.
I think the most prudent thing for dnsmasq to do on SERVFAIL is
to attempt the query with other upstream servers, if possible.
But an answer needs to be provided to the client before its own
timeout value. -- http://rob0.nodns4.us/ Offlist GMX mail is seen
_______________________________________________ Dnsmasq-discuss
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss

_______________________________________________ Dnsmasq-discuss
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss

Martin Wetterwald

2017-01-03 13:42:41 UTC

Permalink

Hi and happy new year :)

We don't use DNSSEC, the problem doesn't seem DNSSEC related.

But even if DNSSEC is enabled, a SERVFAIL answer should be forwarded by
dnsmasq to the client only if all upstreams fail DNSSEC chain-of-trust
validation and all send a SERVFAIL to dnsmasq.

How do you think about this behaviour?
Why not forward one upstream answer that succeeded chain-of-trust validation even
if other failed?

However, our case is not DNSSEC related and can be reproduced by setting up
two upstreams, with one always replying by SERVFAILs answers, the other
one working normally.

- The first request made to dnsmasq yields SERVFAIL (because the
SERVFAIL answer arrives faster than the not-yet-cached good answer
(NOERROR) from the other upstream).
- The following requests made to dnsmasq always yield the NOERROR good
answer because it's now in cache and faster than the
SERVFAIL-replying-upstream.

Martin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
The rationale behind this change is that SERVFAIL is the expected
reply if DNSSEC checking is turned on, and the upstream server cannot
validate the DNSSEC chain-of-trust for the requested record. This
change went as part of the dnsmasq DNSSEC implementation, because it
was expected that the "check DNSSEC" bit would be set on queries, so a
return of SERVFAIL implies a DNSSEC problem, which is not recoverable.
If SERVFAIL is an expected error return which means DNSSEC validation
failure, you don't necessarily want to spend time forwarding the query
to other servers and waiting for them to reply.
I'm not sure the above is cast-iron argument for the current
behaviour, and even if it is the behaviour could be modulated,
depending on if the "check DNSSEC" bit was set in the query.
Was the original problem DNSEC related, or does the SERVFAIL originate
from some other error?
Cheers,
Simon.

Post by /dev/rob0

Post by Chris Novakovic

Post by Martin Wetterwald
We found what we think is a bug (at least a not wanted
behaviour), but it seems it's actually a feature, when
looking at commits 4ace25c5 and 51967f980 (pasted at the end
of this email).

4ace25c5 is a red herring: that provides REFUSED responses with
the behaviour you're looking for. Whether the same behaviour
ought to be applied to SERVFAIL responses is for Simon to
decide: the commit message for 51967f980 isn't clear about why
SERVFAIL should be considered a "successful" upstream response,
but I'm sure there was a reason, and I'm sure he can fill us
in.

SERVFAIL can sometimes be considered "successful" depending on
circumstances.
If all the authoritative NS hosts for a zone are returning
SERVFAIL for queries, then indeed, that's as best as can be
done.
But the problem could be on the recursive resolver, such as [for
one example] cache poisoning causing DNSSEC validation failure.
Unfortunately dnsmasq is not in a position to know which it is.
I think the most prudent thing for dnsmasq to do on SERVFAIL is
to attempt the query with other upstream servers, if possible.
But an answer needs to be provided to the client before its own
timeout value. -- http://rob0.nodns4.us/ Offlist GMX mail is seen
_______________________________________________ Dnsmasq-discuss
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss

_______________________________________________ Dnsmasq-discuss
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAEBCAAGBQJYVFciAAoJEBXN2mrhkTWiYzQP/3m2IDxpNFrif/r7Y7AKTlv+
HiBIcHJCGuMxrxAAyBjh2OrS8ePM880fiK1Hbin2q2lJ7n5adSx2KncmKTh14qJt
4NELzU21NlW7FvOufmqUvoYR2RzlR42GajtL9kjgvG+MW4EkvLF0gnLwEZLEzhbp
HTUHQCqvgIr4Tya7Ut+wyxywwsem20pXAub5Na9rR9gqZzGeE96zErWxTxKjeUr6
N/AavO5ls6qJo1Xf9qihpSPMbr3OHV+o5Tb+Nk4JWXZ7RJDBAkVxwV/BzrXdD2aL
In7YZUpnFyboGtEWiiYZ7CxKxGypS/vm8TdPMBX8K0738NnwHWAWJBzVeNMJDJub
aYx7ATHDhxEtT9rSeGoQJ9B+tma5mwNMDNsXZ44xClV40hjuZWGqFuLUb0vJCEHP
+BBL/H7lKLsNBrf8qSqWitQBTKLj9MSv8HxVljkzWWcuorXmF13mQF+vmG673ERg
ZhfZ/wpGBtPfmZ9O3riV9/r24sk8VXK6AzQAzJYZGDMvfqR5zmHlIripg2Fow7Do
0tL/v3PGfWDFvbDPF3yxmwJUI0UmPPbzsGZtixf0Tic9csZ31ROYAeuPGXZclI0h
ABzw20msbZvEXUAoZuIjyMbUd89v0W5TyBpVLkUJHsH8tGjSYRoLAppXF/6yPz3R
r3i5gpPShBB1cV6moJk5
=PVK8
-----END PGP SIGNATURE-----
_______________________________________________
Dnsmasq-discuss mailing list
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss

Kurt H Maier

2017-01-03 19:02:55 UTC

Permalink

Post by Martin Wetterwald
However, our case is not DNSSEC related and can be reproduced by setting up
two upstreams, with one always replying by SERVFAILs answers, the other
one working normally.

You can 'reproduce' all kind of stuff by setting up bizarre and invalid
configurations. The code is currently doing the right thing; if you
need it to do something wrong I wish you'd maintain a local patchset
instead of trying to get dnsmasq to behave badly.

khm