Discussion:
[Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour
John Knight
2016-10-12 23:50:11 UTC
Permalink
Hi,

I think I may know what the issue is... it appears that the time may be changed by ntp in my failure scenario as suggested by the URLs referencing ntp in the dnsmasq.log file. There are numerous references to difftime in dnsmasq code. One of which acknowledges potential problem if the clock goes backwards... and is handled by comparing last_resolv >1.0 and < -1.0 to accommodate such a possibility. However, in function poll_resolv(), the difftime() call checks for > 0.0, assuming the modification time of the file is greater than the last_change time. If the time had changed on the router, then its possible that the modification time of the /etc/resolv.conf could be less than that of the last_change. I think this needs to be a check for != 0. If the time is changed negatively, then the existing code will not work properly me thinks. Its imperative that latest gets set in order for the reload_servers() code to run... and if the time is not right, then the reload_servers() won't get called. This specific code (poll_resolv) hasn't changed, and if I am right, it is also broken in 2.76.

What do you think? I am going to make the change locally and re-test and see if I can make it fail again. Unfortunately, it doesn't always fail, but I have reproduced it twice now, hopefully it will happen again if my fix is not right.

Best Regards,

John Knight

__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version fran?aise: http://www.belkin.com/email-notice/French.html F?r die deutsche ?bersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
Albert ARIBAUD
2016-10-13 09:16:08 UTC
Permalink
Hi,

Just a generic comment: from what I can see, all absolute times in
dnsmasq are returned bu dnsmasq_time() which calls either times() or
time(). This, IIUC, corresponds to CLOCK_REALTIME in clock_gettime(),
which is indeed affected when (re)setting the time.

Maybe a fix to time jump issues would be (in Linux at least) to replace
time() with clock_gettime(CLOCK_BOOTTIME,...) which seems to have been
designed to get around discontinuities caused by settimeofday().

Note: maybe dates used for logging purposes should still use time() or
clock_gettime(CLOCK_REALTIME) in order to remain comparable to other
logs in the same system -- or maybe not.

Sources: man times, man time, man clock_gettime.

HTH,

Amicalement,
Albert.

Le Wed, 12 Oct 2016 23:50:11 +0000
Post by John Knight
Hi,
I think I may know what the issue is... it appears that the time may
be changed by ntp in my failure scenario as suggested by the URLs
referencing ntp in the dnsmasq.log file. There are numerous
references to difftime in dnsmasq code. One of which acknowledges
potential problem if the clock goes backwards... and is handled by
comparing last_resolv >1.0 and < -1.0 to accommodate such a
possibility. However, in function poll_resolv(), the difftime() call
checks for > 0.0, assuming the modification time of the file is
greater than the last_change time. If the time had changed on the
router, then its possible that the modification time of
the /etc/resolv.conf could be less than that of the last_change. I
think this needs to be a check for != 0. If the time is changed
negatively, then the existing code will not work properly me thinks.
Its imperative that latest gets set in order for the reload_servers()
code to run... and if the time is not right, then the
reload_servers() won't get called. This specific code (poll_resolv)
hasn't changed, and if I am right, it is also broken in 2.76.
What do you think? I am going to make the change locally and re-test
and see if I can make it fail again. Unfortunately, it doesn't
always fail, but I have reproduced it twice now, hopefully it will
happen again if my fix is not right.
Best Regards,
John Knight
__________________________________________________________________
Confidential This e-mail and any files transmitted with it are the
property of Belkin International, Inc. and/or its affiliates, are
confidential, and are intended solely for the use of the individual
or entity to whom this e-mail is addressed. If you are not one of the
named recipients or otherwise have reason to believe that you have
received this e-mail in error, please notify the sender and delete
this message immediately from your computer. Any other use,
retention, dissemination, forwarding, printing or copying of this
http://www.belkin.com/email-notice/French.html F?r die
deutsche ?bersetzung: http://www.belkin.com/email-notice/German.html
__________________________________________________________________
Amicalement,
--
Albert.
John Knight
2016-10-13 17:38:03 UTC
Permalink
Hi Albert,

That sounds like a very good idea to use CLOCK_BOOTTIME. Good suggestion.

When I did a search for difftime in the source code... there are quite a few calls... each one is a potential issue with respect to time going backwards. I only see one instance that actually considers the case if time goes backwards and that is in dnsmasq.c where it does difftime(now, daemon->last_resolv) and compares the result to both > 1.0 and < -1.0. So in general, I am somewhat concerned about possible affects of changing time on dnsmasq. We have seen some issues in the past which we suspected were probably caused by changing the time, so your suggested change could potentially fix some other issues.

Thanks!

John




-----Original Message-----
From: Albert ARIBAUD [mailto:***@free.fr]
Sent: Thursday, October 13, 2016 2:16 AM
To: John Knight
Cc: dnsmasq-***@lists.thekelleys.org.uk
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour

Hi,

Just a generic comment: from what I can see, all absolute times in dnsmasq are returned bu dnsmasq_time() which calls either times() or time(). This, IIUC, corresponds to CLOCK_REALTIME in clock_gettime(), which is indeed affected when (re)setting the time.

Maybe a fix to time jump issues would be (in Linux at least) to replace
time() with clock_gettime(CLOCK_BOOTTIME,...) which seems to have been designed to get around discontinuities caused by settimeofday().

Note: maybe dates used for logging purposes should still use time() or
clock_gettime(CLOCK_REALTIME) in order to remain comparable to other logs in the same system -- or maybe not.

Sources: man times, man time, man clock_gettime.

HTH,

Amicalement,
Albert.

Le Wed, 12 Oct 2016 23:50:11 +0000
Post by John Knight
Hi,
I think I may know what the issue is... it appears that the time may
be changed by ntp in my failure scenario as suggested by the URLs
referencing ntp in the dnsmasq.log file. There are numerous
references to difftime in dnsmasq code. One of which acknowledges
potential problem if the clock goes backwards... and is handled by
comparing last_resolv >1.0 and < -1.0 to accommodate such a
possibility. However, in function poll_resolv(), the difftime() call
checks for > 0.0, assuming the modification time of the file is
greater than the last_change time. If the time had changed on the
router, then its possible that the modification time of the
/etc/resolv.conf could be less than that of the last_change. I think
this needs to be a check for != 0. If the time is changed negatively,
then the existing code will not work properly me thinks.
Its imperative that latest gets set in order for the reload_servers()
code to run... and if the time is not right, then the
reload_servers() won't get called. This specific code (poll_resolv)
hasn't changed, and if I am right, it is also broken in 2.76.
What do you think? I am going to make the change locally and re-test
and see if I can make it fail again. Unfortunately, it doesn't always
fail, but I have reproduced it twice now, hopefully it will happen
again if my fix is not right.
Best Regards,
John Knight
__________________________________________________________________
Confidential This e-mail and any files transmitted with it are the
property of Belkin International, Inc. and/or its affiliates, are
confidential, and are intended solely for the use of the individual or
entity to whom this e-mail is addressed. If you are not one of the
named recipients or otherwise have reason to believe that you have
received this e-mail in error, please notify the sender and delete
this message immediately from your computer. Any other use, retention,
dissemination, forwarding, printing or copying of this e-mail is
http://www.belkin.com/email-notice/French.html F?r die deutsche
?bersetzung: http://www.belkin.com/email-notice/German.html
__________________________________________________________________
Amicalement,
--
Albert.

__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version française: http://www.belkin.com/email-notice/French.html Für die deutsche Übersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
Vladislav Grishenko
2016-10-13 20:15:15 UTC
Permalink
Hi,
Why not just use existing HAVE_BROKEN_RTC?CLOCK_BOOTIME is linux-specific, non-portable, absent in older (but still running) kernels and logically is the same as CLOCK_MONOTONIC except counting suspended/sleep time. In turn using CLOCK_MONOTONIC is already there in times() form when HAVE_BROKEN_RTC is enabled.

Best Regards, Vladislav Grishenko

_____________________________
From: John Knight <***@belkin.com>
Sent: четверг, Пктября 13, 2016 11:00 ПП
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour
To: Albert ARIBAUD <***@free.fr>
Cc: <dnsmasq-***@lists.thekelleys.org.uk>


Hi Albert,

That sounds like a very good idea to use CLOCK_BOOTTIME. Good suggestion.

When I did a search for difftime in the source code... there are quite a few calls... each one is a potential issue with respect to time going backwards. I only see one instance that actually considers the case if time goes backwards and that is in dnsmasq.c where it does difftime(now, daemon->last_resolv) and compares the result to both > 1.0 and < -1.0. So in general, I am somewhat concerned about possible affects of changing time on dnsmasq. We have seen some issues in the past which we suspected were probably caused by changing the time, so your suggested change could potentially fix some other issues.

Thanks!

John




-----Original Message-----
From: Albert ARIBAUD [mailto:***@free.fr]
Sent: Thursday, October 13, 2016 2:16 AM
To: John Knight
Cc: dnsmasq-***@lists.thekelleys.org.uk
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour

Hi,

Just a generic comment: from what I can see, all absolute times in dnsmasq are returned bu dnsmasq_time() which calls either times() or time(). This, IIUC, corresponds to CLOCK_REALTIME in clock_gettime(), which is indeed affected when (re)setting the time.

Maybe a fix to time jump issues would be (in Linux at least) to replace
time() with clock_gettime(CLOCK_BOOTTIME,...) which seems to have been designed to get around discontinuities caused by settimeofday().

Note: maybe dates used for logging purposes should still use time() or
clock_gettime(CLOCK_REALTIME) in order to remain comparable to other logs in the same system -- or maybe not.

Sources: man times, man time, man clock_gettime.

HTH,

Amicalement,
Albert.

Le Wed, 12 Oct 2016 23:50:11 +0000
Post by John Knight
Hi,
I think I may know what the issue is... it appears that the time may
be changed by ntp in my failure scenario as suggested by the URLs
referencing ntp in the dnsmasq.log file. There are numerous
references to difftime in dnsmasq code. One of which acknowledges
potential problem if the clock goes backwards... and is handled by
comparing last_resolv >1.0 and < -1.0 to accommodate such a
possibility. However, in function poll_resolv(), the difftime() call
checks for > 0.0, assuming the modification time of the file is
greater than the last_change time. If the time had changed on the
router, then its possible that the modification time of the
/etc/resolv.conf could be less than that of the last_change. I think
this needs to be a check for != 0. If the time is changed negatively,
then the existing code will not work properly me thinks.
Its imperative that latest gets set in order for the reload_servers()
code to run... and if the time is not right, then the
reload_servers() won't get called. This specific code (poll_resolv)
hasn't changed, and if I am right, it is also broken in 2.76.
What do you think? I am going to make the change locally and re-test
and see if I can make it fail again. Unfortunately, it doesn't always
fail, but I have reproduced it twice now, hopefully it will happen
again if my fix is not right.
Best Regards,
John Knight
__________________________________________________________________
Confidential This e-mail and any files transmitted with it are the
property of Belkin International, Inc. and/or its affiliates, are
confidential, and are intended solely for the use of the individual or
entity to whom this e-mail is addressed. If you are not one of the
named recipients or otherwise have reason to believe that you have
received this e-mail in error, please notify the sender and delete
this message immediately from your computer. Any other use, retention,
dissemination, forwarding, printing or copying of this e-mail is
http://www.belkin.com/email-notice/French.html F?r die deutsche
?bersetzung: http://www.belkin.com/email-notice/German.html
__________________________________________________________________
Amicalement,
--
Albert.

__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version française: http://www.belkin.com/email-notice/French.html FÃŒr die deutsche Übersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
Albert ARIBAUD
2016-10-13 21:39:52 UTC
Permalink
Hi,

I think it is preferable not to use HAVE_BROKEN_RTC for at least two
reasons, in increasing order of importance:

1. HAVE_BROKEN_RTC should be used for, well, broken RTCs. Here, we
are not dealing with broken RTC.

2. The man mage for times() states that "a portable application would be
wise to avoid using [the] value [returned by times()]. To measure
changes in elapsed time, use clock_gettime(2) instead".

But you are right that CLOCK_BOOTTIME is Linux specific (I did mention
that, in fact).

So my proposal would become:

- if CLOCK_BOOTTIME is defined as compile time, and if
clock_gettime(CLOCK_BOOTTIME,...) succeeds at run time, use that;

- otherwise, if CLOCK_MONOTONIC is defined (it should always) and if
clock_gettime(CLOCK_MONOTONIC,...) succeeds at run time, use that;

- otherwise, if CLOCK_REALTIME is defined (it should always) and if
clock_gettime(CLOCK_REALTIME,...) succeeds at run time, use that;

- otherwise, as a last resort, use times().

Amicalement,
Albert.

Le Thu, 13 Oct 2016 20:15:15 +0000 (UTC)
Post by Vladislav Grishenko
Hi,
Why not just use existing HAVE_BROKEN_RTC?CLOCK_BOOTIME is
linux-specific, non-portable, absent in older (but still running)
kernels and logically is the same as CLOCK_MONOTONIC except counting
suspended/sleep time. In turn using CLOCK_MONOTONIC is already there
in times() form when HAVE_BROKEN_RTC is enabled.
Best Regards, Vladislav Grishenko
_____________________________
Sent: четверг, октября 13, 2016 11:00 ПП
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an
Hi Albert,
That sounds like a very good idea to use CLOCK_BOOTTIME. Good
suggestion.
When I did a search for difftime in the source code... there are
quite a few calls... each one is a potential issue with respect to
time going backwards. I only see one instance that actually
considers the case if time goes backwards and that is in dnsmasq.c
where it does difftime(now, daemon->last_resolv) and compares the
result to both > 1.0 and < -1.0. So in general, I am somewhat
concerned about possible affects of changing time on dnsmasq. We
have seen some issues in the past which we suspected were probably
caused by changing the time, so your suggested change could
potentially fix some other issues.
Thanks!
John
-----Original Message-----
Sent: Thursday, October 13, 2016 2:16 AM
To: John Knight
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour
Hi,
Just a generic comment: from what I can see, all absolute times in
dnsmasq are returned bu dnsmasq_time() which calls either times() or
time(). This, IIUC, corresponds to CLOCK_REALTIME in clock_gettime(),
which is indeed affected when (re)setting the time.
Maybe a fix to time jump issues would be (in Linux at least) to
replace time() with clock_gettime(CLOCK_BOOTTIME,...) which seems to
have been designed to get around discontinuities caused by
settimeofday().
Note: maybe dates used for logging purposes should still use time() or
clock_gettime(CLOCK_REALTIME) in order to remain comparable to other
logs in the same system -- or maybe not.
Sources: man times, man time, man clock_gettime.
HTH,
Amicalement,
Albert.
Le Wed, 12 Oct 2016 23:50:11 +0000
Post by John Knight
Hi,
I think I may know what the issue is... it appears that the time may
be changed by ntp in my failure scenario as suggested by the URLs
referencing ntp in the dnsmasq.log file. There are numerous
references to difftime in dnsmasq code. One of which acknowledges
potential problem if the clock goes backwards... and is handled by
comparing last_resolv >1.0 and < -1.0 to accommodate such a
possibility. However, in function poll_resolv(), the difftime()
call checks for > 0.0, assuming the modification time of the file is
greater than the last_change time. If the time had changed on the
router, then its possible that the modification time of the
/etc/resolv.conf could be less than that of the last_change. I
think this needs to be a check for != 0. If the time is changed
negatively, then the existing code will not work properly me thinks.
Its imperative that latest gets set in order for the
reload_servers() code to run... and if the time is not right, then
the reload_servers() won't get called. This specific code
(poll_resolv) hasn't changed, and if I am right, it is also broken
in 2.76.
What do you think? I am going to make the change locally and
re-test and see if I can make it fail again. Unfortunately, it
doesn't always fail, but I have reproduced it twice now, hopefully
it will happen again if my fix is not right.
Best Regards,
John Knight
__________________________________________________________________
Confidential This e-mail and any files transmitted with it are the
property of Belkin International, Inc. and/or its affiliates, are
confidential, and are intended solely for the use of the individual
or entity to whom this e-mail is addressed. If you are not one of
the named recipients or otherwise have reason to believe that you
have received this e-mail in error, please notify the sender and
delete this message immediately from your computer. Any other use,
retention, dissemination, forwarding, printing or copying of this
http://www.belkin.com/email-notice/French.html F?r die deutsche
?bersetzung: http://www.belkin.com/email-notice/German.html
__________________________________________________________________
Amicalement,
--
Albert.
__________________________________________________________________
Confidential This e-mail and any files transmitted with it are the
property of Belkin International, Inc. and/or its affiliates, are
confidential, and are intended solely for the use of the individual
or entity to whom this e-mail is addressed. If you are not one of the
named recipients or otherwise have reason to believe that you have
received this e-mail in error, please notify the sender and delete
this message immediately from your computer. Any other use,
retention, dissemination, forwarding, printing or copying of this
http://www.belkin.com/email-notice/French.html Für die deutsche
Übersetzung: http://www.belkin.com/email-notice/German.html
__________________________________________________________________
_______________________________________________ Dnsmasq-discuss
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
Amicalement,
--
Albert.
Vladislav Grishenko
2016-10-14 06:52:33 UTC
Permalink
Hi, Albert,
1. HAVE_BROKEN_RTC should be used for, well, broken RTCs. Here, we are
not dealing with broken RTC.
One of which acknowledges potential problem if the clock goes backwards...
As for me it's indeed broken RTC behavior, not?
2. The man mage for times() states that "a portable application would be
wise to avoid using [the] value [returned by times()]. To measure
changes in elapsed time, use clock_gettime(2) instead".
Because start value of posix's times() return value may vary across kernel versions & UNIX impl., combined with possibility of value overflow the clock_t range.
Since we don’t care neither about initial boot value nor about sleep/suspended time (files can't be modified when suspended, right?) - the only possible issue is overflow.
Since times() not counting CPU ticks in sleep/supspended mode, suggestion clock_gettime here is about using CLOCK_MONOTONIC which is almost the same, but with no-overflow API.
- otherwise, if CLOCK_MONOTONIC is defined (it should always) and if
clock_gettime(CLOCK_MONOTONIC,...) succeeds at run time, use that;
Even with defined CLOCK_MONOTONIC, the real presence of this clock source can be retrieved from kernel in runtime only.
Yes, there're old running kernels with no CLOCK_MONOTONIC, clock_gettime() returns EINVAL. Same check is true for CLOCK_BOOTTIME.

Best Regards, Vladislav Grishenko
-----Original Message-----
Sent: Friday, October 14, 2016 2:40 AM
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour
Hi,
I think it is preferable not to use HAVE_BROKEN_RTC for at least two
1. HAVE_BROKEN_RTC should be used for, well, broken RTCs. Here, we are
not dealing with broken RTC.
2. The man mage for times() states that "a portable application would be
wise to avoid using [the] value [returned by times()]. To measure
changes in elapsed time, use clock_gettime(2) instead".
But you are right that CLOCK_BOOTTIME is Linux specific (I did mention that,
in fact).
- if CLOCK_BOOTTIME is defined as compile time, and if
clock_gettime(CLOCK_BOOTTIME,...) succeeds at run time, use that;
- otherwise, if CLOCK_MONOTONIC is defined (it should always) and if
clock_gettime(CLOCK_MONOTONIC,...) succeeds at run time, use that;
- otherwise, if CLOCK_REALTIME is defined (it should always) and if
clock_gettime(CLOCK_REALTIME,...) succeeds at run time, use that;
- otherwise, as a last resort, use times().
Amicalement,
Albert.
Le Thu, 13 Oct 2016 20:15:15 +0000 (UTC) Vladislav Grishenko
Post by Vladislav Grishenko
Hi,
Why not just use existing HAVE_BROKEN_RTC?CLOCK_BOOTIME is
linux-specific, non-portable, absent in older (but still running)
kernels and logically is the same as CLOCK_MONOTONIC except counting
suspended/sleep time. In turn using CLOCK_MONOTONIC is already there
in times() form when HAVE_BROKEN_RTC is enabled.
Best Regards, Vladislav Grishenko
_____________________________
Sent: четверг, октября 13, 2016 11:00 ПП
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an
Hi Albert,
That sounds like a very good idea to use CLOCK_BOOTTIME. Good suggestion.
When I did a search for difftime in the source code... there are quite
a few calls... each one is a potential issue with respect to time
going backwards. I only see one instance that actually considers the
case if time goes backwards and that is in dnsmasq.c where it does
difftime(now, daemon->last_resolv) and compares the
result to both > 1.0 and < -1.0. So in general, I am somewhat
concerned about possible affects of changing time on dnsmasq. We have
seen some issues in the past which we suspected were probably caused
by changing the time, so your suggested change could potentially fix
some other issues.
Thanks!
John
-----Original Message-----
Sent: Thursday, October 13, 2016 2:16 AM
To: John Knight
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour
Hi,
Just a generic comment: from what I can see, all absolute times in
dnsmasq are returned bu dnsmasq_time() which calls either times() or
time(). This, IIUC, corresponds to CLOCK_REALTIME in clock_gettime(),
which is indeed affected when (re)setting the time.
Maybe a fix to time jump issues would be (in Linux at least) to
replace time() with clock_gettime(CLOCK_BOOTTIME,...) which seems to
have been designed to get around discontinuities caused by
settimeofday().
Note: maybe dates used for logging purposes should still use time() or
clock_gettime(CLOCK_REALTIME) in order to remain comparable to other
logs in the same system -- or maybe not.
Sources: man times, man time, man clock_gettime.
HTH,
Amicalement,
Albert.
Le Wed, 12 Oct 2016 23:50:11 +0000
Post by John Knight
Hi,
I think I may know what the issue is... it appears that the time may
be changed by ntp in my failure scenario as suggested by the URLs
referencing ntp in the dnsmasq.log file. There are numerous
references to difftime in dnsmasq code. One of which acknowledges
potential problem if the clock goes backwards... and is handled by
comparing last_resolv >1.0 and < -1.0 to accommodate such a
possibility. However, in function poll_resolv(), the difftime()
call checks for > 0.0, assuming the modification time of the file is
greater than the last_change time. If the time had changed on the
router, then its possible that the modification time of the
/etc/resolv.conf could be less than that of the last_change. I
think this needs to be a check for != 0. If the time is changed
negatively, then the existing code will not work properly me thinks.
Its imperative that latest gets set in order for the
reload_servers() code to run... and if the time is not right, then
the reload_servers() won't get called. This specific code
(poll_resolv) hasn't changed, and if I am right, it is also broken
in 2.76.
What do you think? I am going to make the change locally and
re-test and see if I can make it fail again. Unfortunately, it
doesn't always fail, but I have reproduced it twice now, hopefully
it will happen again if my fix is not right.
Best Regards,
John Knight
__________________________________________________________
________
Post by Vladislav Grishenko
Post by John Knight
Confidential This e-mail and any files transmitted with it are the
property of Belkin International, Inc. and/or its affiliates, are
confidential, and are intended solely for the use of the individual
or entity to whom this e-mail is addressed. If you are not one of
the named recipients or otherwise have reason to believe that you
have received this e-mail in error, please notify the sender and
delete this message immediately from your computer. Any other use,
retention, dissemination, forwarding, printing or copying of this
http://www.belkin.com/email-notice/French.html F?r die deutsche
?bersetzung: http://www.belkin.com/email-notice/German.html
__________________________________________________________
________
Post by Vladislav Grishenko
Amicalement,
--
Albert.
__________________________________________________________
________
Post by Vladislav Grishenko
Confidential This e-mail and any files transmitted with it are the
property of Belkin International, Inc. and/or its affiliates, are
confidential, and are intended solely for the use of the individual or
entity to whom this e-mail is addressed. If you are not one of the
named recipients or otherwise have reason to believe that you have
received this e-mail in error, please notify the sender and delete
this message immediately from your computer. Any other use, retention,
dissemination, forwarding, printing or copying of this e-mail is
http://www.belkin.com/email-notice/French.html Für die deutsche
Übersetzung: http://www.belkin.com/email-notice/German.html
__________________________________________________________
________
Post by Vladislav Grishenko
_______________________________________________ Dnsmasq-
discuss
Post by Vladislav Grishenko
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
Amicalement,
--
Albert.
w***@gmail.com
2016-10-14 11:22:03 UTC
Permalink
Post by Vladislav Grishenko
Hi, Albert,
1. HAVE_BROKEN_RTC should be used for, well, broken RTCs. Here, we are
not dealing with broken RTC.
One of which acknowledges potential problem if the clock goes backwards...
As for me it's indeed broken RTC behavior, not?
not... what defines a "broken RTC"? the time can easily be set back during a NTP
update... how far back can it be set before there's a problem? 1 millisecond? 1
hundredth of a second? 1 tenth of a second? 1 second?


case in point: recovering from Hurricane Matthew... three days without power...
while bring up the network, several machines had reverted to their default time
settings in the BIOS... on several of them, that was back in 2002... the people
bringing the machines up set the BIOS time manually and allowed the boot to
continue... on some machines, NTP syncing is run from cron at some time
period... others use NTP and adjust the clock by drift... manually setting the
time and then allowing NTP to set it more accurately can easily result in the
clock being set back by NTP... we won't even mention the problem of setting the
clock to local time and the machine using UTC so when NTP syncs, there can be a
huge (4, 5, 6, 7, 8+) hour jump backwards...
--
NOTE: No off-list assistance is given without prior approval.
*Please keep mailing list traffic on the list* unless
private contact is specifically requested and granted.
Albert ARIBAUD
2016-10-14 14:03:48 UTC
Permalink
Hi Vladislav,

Le Fri, 14 Oct 2016 11:52:33 +0500
Post by Vladislav Grishenko
Hi, Albert,
Post by Albert ARIBAUD
1. HAVE_BROKEN_RTC should be used for, well, broken RTCs. Here, we
are not dealing with broken RTC.
One of which acknowledges potential problem if the clock goes backwards...
As for me it's indeed broken RTC behavior, not?
The clock and the RTC are two different things.

The RTC is what keeps time in between power-offs. It has backup power
(frequently on PCs it is a CR20xx battery) and a slow autonomous clock
source (typically a 32 kHz crystal). IT is not set or read frequently
while the system is on, because it is usually slow to access.

So while it is on, the system keeps track of the time not through the
RTC but through some internal clock reference which is easier and
faster to read and set. It is this clock reference which gets read
every time a process needs to known what time is "now", and it is what
gets affected by NTP etc.
Post by Vladislav Grishenko
Post by Albert ARIBAUD
2. The man mage for times() states that "a portable application
would be wise to avoid using [the] value [returned by times()]. To
measure changes in elapsed time, use clock_gettime(2) instead".
Because start value of posix's times() return value may vary across
kernel versions & UNIX impl., combined with possibility of value
overflow the clock_t range. Since we don’t care neither about initial
boot value nor about sleep/suspended time (files can't be modified
when suspended, right?)
But timeouts can occur, TTLs can get past, etc. To treat those properly,
dnsmasq needs to know how much time has flown while it was sleeping (if
it ever does, of course).
Post by Vladislav Grishenko
- the only possible issue is overflow. Since
times() not counting CPU ticks in sleep/supspended mode, suggestion
clock_gettime here is about using CLOCK_MONOTONIC which is almost the
same, but with no-overflow API.
Hm... I believe that's what my proposal amounts to on systems where
CLOCK_BOOTTIME is not available.
Post by Vladislav Grishenko
Post by Albert ARIBAUD
- otherwise, if CLOCK_MONOTONIC is defined (it should always) and if
clock_gettime(CLOCK_MONOTONIC,...) succeeds at run time, use that;
Even with defined CLOCK_MONOTONIC, the real presence of this clock
source can be retrieved from kernel in runtime only.
Indeed, hence my double test suggestion: one must test the existence
of CLOCK_MONOTONIC and only call clock_gettime(CLOCK_MONOTONIC...) if
it exists, and mone must test the return value of clock_gettime() in
case dnsmasq runs on an older kernel/glic than it was compiled and
built against.
Post by Vladislav Grishenko
Yes, there're
old running kernels with no CLOCK_MONOTONIC, clock_gettime() returns
EINVAL. Same check is true for CLOCK_BOOTTIME.
Exact, hence my "returns success" conditional.
Post by Vladislav Grishenko
Best Regards, Vladislav Grishenko
Amicalement,
--
Albert.
Vladislav Grishenko
2016-10-14 14:46:13 UTC
Permalink
Hi Albert,
Post by Albert ARIBAUD
The clock and the RTC are two different things.
Of course they are.
In my opinion, RTC in dnsmasq terms shouldn't be literally treated as it exactly is, but as main (original) reason leads to local time available to application is not equal to real time.
If platform has no RTC chip, NTP is the almost only time source, so time shifts are expeted.
If platform has RTC chip, but system clock source is inaccurate and NTP "fixes" it from time to time - time shifts are expeted.
If platform has RTC chip, but is has wrong value (battery failure/etc) - time shifts are expected.
All expected time shifts cases require HAVE_BROKEN_RTC to be define for normal flow.
There's nothing to fix here, because here nothing is broken.
Post by Albert ARIBAUD
But timeouts can occur, TTLs can get past, etc. To treat those properly,
dnsmasq needs to know how much time has flown while it was sleeping (if
it ever does, of course).
It does (actually not due sleeping, more likely to timer granulation), can be easily checked in code.
Despite the fact that sleeping DHCP/DNS servers looks quite ridiculous, if dnsmasq code without HAVE_BROKEN_RTC has issue with time goes backward too much, it needs to be fixed.
Not with changing the clock source, because it'll just mask the problem, but with proper dealing with such kind of time values.
And, seems John have already found last_change variable wasn't static, didn’t check it by myself yet.

Best Regards, Vladislav Grishenko
Post by Albert ARIBAUD
-----Original Message-----
Sent: Friday, October 14, 2016 7:04 PM
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour
Hi Vladislav,
Le Fri, 14 Oct 2016 11:52:33 +0500
Post by Vladislav Grishenko
Hi, Albert,
Post by Albert ARIBAUD
1. HAVE_BROKEN_RTC should be used for, well, broken RTCs. Here, we
are not dealing with broken RTC.
One of which acknowledges potential problem if the clock goes backwards...
As for me it's indeed broken RTC behavior, not?
The clock and the RTC are two different things.
The RTC is what keeps time in between power-offs. It has backup power
(frequently on PCs it is a CR20xx battery) and a slow autonomous clock
source (typically a 32 kHz crystal). IT is not set or read frequently
while the system is on, because it is usually slow to access.
So while it is on, the system keeps track of the time not through the
RTC but through some internal clock reference which is easier and
faster to read and set. It is this clock reference which gets read
every time a process needs to known what time is "now", and it is what
gets affected by NTP etc.
Post by Vladislav Grishenko
Post by Albert ARIBAUD
2. The man mage for times() states that "a portable application
would be wise to avoid using [the] value [returned by times()]. To
measure changes in elapsed time, use clock_gettime(2) instead".
Because start value of posix's times() return value may vary across
kernel versions & UNIX impl., combined with possibility of value
overflow the clock_t range. Since we don’t care neither about initial
boot value nor about sleep/suspended time (files can't be modified
when suspended, right?)
But timeouts can occur, TTLs can get past, etc. To treat those properly,
dnsmasq needs to know how much time has flown while it was sleeping (if
it ever does, of course).
Post by Vladislav Grishenko
- the only possible issue is overflow. Since
times() not counting CPU ticks in sleep/supspended mode, suggestion
clock_gettime here is about using CLOCK_MONOTONIC which is almost the
same, but with no-overflow API.
Hm... I believe that's what my proposal amounts to on systems where
CLOCK_BOOTTIME is not available.
Post by Vladislav Grishenko
Post by Albert ARIBAUD
- otherwise, if CLOCK_MONOTONIC is defined (it should always) and if
clock_gettime(CLOCK_MONOTONIC,...) succeeds at run time, use that;
Even with defined CLOCK_MONOTONIC, the real presence of this clock
source can be retrieved from kernel in runtime only.
Indeed, hence my double test suggestion: one must test the existence
of CLOCK_MONOTONIC and only call clock_gettime(CLOCK_MONOTONIC...) if
it exists, and mone must test the return value of clock_gettime() in
case dnsmasq runs on an older kernel/glic than it was compiled and
built against.
Post by Vladislav Grishenko
Yes, there're
old running kernels with no CLOCK_MONOTONIC, clock_gettime() returns
EINVAL. Same check is true for CLOCK_BOOTTIME.
Exact, hence my "returns success" conditional.
Post by Vladislav Grishenko
Best Regards, Vladislav Grishenko
Amicalement,
--
Albert.
Albert ARIBAUD
2016-10-14 17:09:31 UTC
Permalink
Hi,

Le Fri, 14 Oct 2016 19:46:13 +0500
Post by Vladislav Grishenko
Post by Albert ARIBAUD
But timeouts can occur, TTLs can get past, etc. To treat those
properly, dnsmasq needs to know how much time has flown while it
was sleeping (if it ever does, of course).
It does (actually not due sleeping, more likely to timer
granulation), can be easily checked in code. Despite the fact that
sleeping DHCP/DNS servers looks quite ridiculous, if dnsmasq code
without HAVE_BROKEN_RTC has issue with time goes backward too much,
it needs to be fixed.
I think we can agree on this. :)
Post by Vladislav Grishenko
Not with changing the clock source, because
it'll just mask the problem
Well, IIUC, here the source of the problem *is* the clock source --
namely, that CLOCK_REALTIME is used for measuring elapsed time but is
not monotonic and therefore ill-suited for measuring elapsed time.
Switching to CLOCK_MONOTONIC, which is, well, monotonic, is certainly
not "masking the problem". Switching to CLOCK_BOOTTIME, which is "more
monotonic" yet, is an even better solution if applicable.
Post by Vladislav Grishenko
, but with proper dealing with such kind
of time values. And, seems John have already found last_change
variable wasn't static, didn’t check it by myself yet.
How exactly is that second is totally related to how dnsmasq
handles time?
Post by Vladislav Grishenko
Best Regards, Vladislav Grishenko
Amicalement,
--
Albert.
Albert ARIBAUD
2016-10-14 18:14:31 UTC
Permalink
Hi,

Le Fri, 14 Oct 2016 19:09:31 +0200
Post by Albert ARIBAUD
How exactly is that second is totally related to how dnsmasq
handles time?
Ahem. Rolling that back.

How is that second issue related to now dnsmasq handles time?

With apologies.

Amicalement,
--
Albert.
John Knight
2016-10-17 21:15:54 UTC
Permalink
Hi Albert,

Actually the second issue relating to the variable last_change not being a static is NOT specifically related to how dnsmasq handles time. Without the static definition of last_change, the "if(difftime(statbuf.st_mtime, last_change)" ALWAYS evaluates to > 0, which is not what the code was intended to do. I believe the intent of this code is to avoid unnecessary calls to reload_servers()... in other words, if there has been no time elapsed since we last reloaded the servers, then we shouldn't do it again. The result of the difftime() in this case should be the number of seconds since dnsmasq last did the reload_servers. With the broken code, the difftime() ALWAYS resulted in the number of seconds between time 0 (the initialized value of last_change) and the timestamp on the /etc/resolv.conf file... and consequently, this would ALWAYS result in a call to reload_servers(), whether it was needed or not. I don't consider this a major issue.. with the broken code, it just means that there may be an extra reload_servers done here and there.

The issue with the current time being potentially less than current timestamps in dnsmasq() is an issue in dnsmasq in general and comes into play whenever difftime() is used to determine elapsed time. In most uses of difftime, the code assumes current time is greater than any cached time; when current time is less than the cached timestamp, the code has issues. As I pointed out previously, I saw only one use of difftime() where the code actually took into consideration that the current time might be less than cached timestamp... all other uses of difftime MAY have issues when result of difftime is negative. With the impact of ntp, it is possible that the time goes backwards on the router, such as might happen with daylight savings time in the fall, when clocks are adjusted backwards in the USA by one hour... or when the clock on the router is not right and ntp corrects it.

I hope this clarifies what I was trying to say.

John


__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version française: http://www.belkin.com/email-notice/French.html Für die deutsche Übersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
John Knight
2016-10-14 01:17:49 UTC
Permalink
Hi All,

Well, I believe I have found another issue with the function poll_resolv(). The variable last_change needs to be static. Otherwise, the difftime(statbuf.st_mtime, last_change) always is a rather large number... the number of seconds since time 0 (the initialized value of last_change at the top of the function). By making this variable static, the last_change truly becomes the elapsed time in seconds since the last change for /etc/resolv.conf. The last_change variable really represents the timestamp of when reload_servers() was called last.

To see the issue, I added a my_syslog to log that a change is detected and what the time difference is:

if (force || (statbuf.st_mtime != res->mtime))
{
res->mtime = statbuf.st_mtime;

if (difftime(statbuf.st_mtime, last_change) != 0.0)
{
my_syslog(LOG_INFO, _("change detected for %s; time difference is %f"),
res->name, difftime(statbuf.st_mtime, last_change));
last_change = statbuf.st_mtime;
latest = res;
}
}
}

Without the static for last_change variable, the time difference reported is rather large... number of seconds since epoch (0).
static time_t last_change = 0;

I added the static in the above statement at the top of the function.

John

__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version fran?aise: http://www.belkin.com/email-notice/French.html F?r die deutsche ?bersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
John Knight
2016-10-12 21:49:50 UTC
Permalink
Hi,

I am having an issue with dnsmasq version 2.55.

First a little background about when the issue occurs: In our router, we have both an ether wan port and a 3G dongle. If the wan cable is plugged in, the connectivity to the internet is via the wan cable. However, if the wan cable is unplugged and there is a dongle plugged into the router, connectivity to the internet now switches to the 3g dongle. In essence a switchover occurs. What we have found is that the dnsmasq's dns sometimes fails after the switchover. The /etc/resolv.conf file in this case remains the same before and after the switchover.

I have invoked dnsmasq with the debugging option -q, and I see that at time 22:09:33 a message "no servers found in /etc/resolv.conf, will retry". Presumably this is because at the time dnsmasq found that it could not connect to the google dns server 8.8.8.8, probably while the connection was being switched over. For nearly an hour according to the dnsmasq.log file, I see various queries to dnsmasq asking for IP addresses associated with the given URLs. Unfortunately, no response is given to these queries. During this period after the switchover, I can ping the google dns server at 8.8.8.8 and it responds; I can NOT ping google.com as it fails to resolve the IP address for google.com. I can also ping other IP addresses successfully after the switchover as long as I ping with the IP address and not the URL. Nslookup consistently fails to resolve the IP addresses. So it is pretty certain that we have connectivity to the internet, just no DNS server to help return the IP addresses for the URLS.

Now back to the dnsmasq.log, at timestamp 23:08:02 we finally see the message "reading /etc/resolv.conf". After this dnsmasq is now successfully answering the queries from the clients, resolving their addresses once again. From my perspective, it appears that there is something wrong with dnsmasq not doing the retry to connect to the upstream dns server (8.8.8.8) in a timely manner. I have now seen this anomaly twice, both times with a delay of almost exactly an hour for the dnsmasq to come back to life. The immediate question is what is driving how often does dnsmasq do the retry? I would think that if the upstream dns server is not accessible, it would retry and a fairly short time interval... but alas, I don't see this in the dnsmasq.log file. How often should dnsmasq be retrying? And what would do you think would make it retry at almost exactly an hour later?

Is this a known issue? I have looked at the CHANGELOG and I see only a few changes that from their description might be related... changes in version 2.72:

version 2.72
Fix race condition which could lock up dnsmasq when an
interface goes down and up rapidly. Thanks to Conrad
Kostecki for helping to chase this down.

Fix bug which caused dnsmasq to become unresponsive if it
failed to send packets due to a network interface disappearing.
Thanks to Niels Peen for spotting this.

But alas, I don't know if these fixes have anything to do with the issue I am seeing; there is not enough info here to know how the issues manifested themselves.
I have attached the dnsmasq.log file for you review.
Please let me know what you think. I need to get to the bottom of this asap.
Best Regards,
John Knight




__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version fran?aise: http://www.belkin.com/email-notice/French.html F?r die deutsche ?bersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
John Knight
2016-10-18 22:36:07 UTC
Permalink
Hi All,

Still trying to figure this one out, At this point, I don't think the issue is caused by time going backwards, although I am convinced that there are issues with dnsmasq when time does go backwards. There is a lot of previous discussion on the backwards time issue.

What I am pursuing at the moment is the "retry later" that is supposed to occur after reload_servers() fails because the /etc/resolv.conf is being updated at the same time as dnsmasq is trying to read it.

One thing I have noticed is that dnsmasq does NOT call poll_resolv() very often. I would expect that if retry later condition occurs, dnsmasq should be calling poll_resolv() frequently as nothing can really be done until the upstream dns server is identified.

Now for some questions:

The main while(1) loop uses select() to determine if it has work to do. In most cases, it appears to use timeout of 0, which I believe means just wait indefinitely for work on the file descriptors. Other times, it appears that the timeout is set to a quarter second when doing a tftp transfer or polling the dbus.

Now what concerns me is that when a "retry later" condition occurs, we may get stuck on the select() for a long period of time. Alas, I do not know how frequent one might expect to see work arrive on the file descriptors that select is watching, so I don't really know if this is a long time or not. It seems though that in this failure scenario, the poll_resolv() function does NOT get called very often at all.

My gut feeling is that there always needs to be a timeout on the select call as the poll_resolv() should be called fairly frequently. The code that exists today where poll_resolv() normally is called from this loop suggests a poll rate of about once a second. This definitely does not happen today. By just adding a my_syslog() message to the top of poll_resolv(), it is very clear from the logfile that it is not called often, and way to infrequently to resolve the "retry later" condition in a timely manner.

Going forward, as the next thing for me to try, I am going to add a timeout for the select... perhaps a modest once a second or two.

But I would like to know what you all of think of this... does this make sense to do? Is there ever a case where we might not get any work on the files select is monitoring for nearly an hour? I am trying to make sense of this issue.

Thanks,

John Knight

__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version fran?aise: http://www.belkin.com/email-notice/French.html F?r die deutsche ?bersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
Albert ARIBAUD
2016-10-19 05:53:04 UTC
Permalink
Hi John,

Le Tue, 18 Oct 2016 22:36:07 +0000
Post by John Knight
Hi All,
The main while(1) loop uses select() to determine if it has work to
do. In most cases, it appears to use timeout of 0, which I believe
means just wait indefinitely for work on the file descriptors. Other
times, it appears that the timeout is set to a quarter second when
doing a tftp transfer or polling the dbus.
Now what concerns me is that when a "retry later" condition occurs,
we may get stuck on the select() for a long period of time. Alas, I
do not know how frequent one might expect to see work arrive on the
file descriptors that select is watching, so I don't really know if
this is a long time or not. It seems though that in this failure
scenario, the poll_resolv() function does NOT get called very often
at all.
Actually, if dnsmasq does not receive any request from clients, it does
not need to poll servers, so I would ask: does the select() include
descriptors for client requests (either UDP datagrams received, or TCP
connections opened)? If so, I think it will exit just when necessary
and no tiemout is needed; otherwise, you are right that a timeout is
required.

Also, it may be improbable that select() does not return for a whole
hour; but then, is every return from select() followed by a resolv file
poll, or can select() return and then be entered again without polling
the resolv files? I am thinking, for instance, about cached answers
which do not need servers if their TTL is long enough.
Post by John Knight
My gut feeling is that there always needs to be a timeout on the
select call as the poll_resolv() should be called fairly frequently.
The code that exists today where poll_resolv() normally is called
from this loop suggests a poll rate of about once a second. This
definitely does not happen today. By just adding a my_syslog()
message to the top of poll_resolv(), it is very clear from the
logfile that it is not called often, and way to infrequently to
resolve the "retry later" condition in a timely manner.
Can you compare when poll_resolv() is called wrt when the select() is
exited -- and for what reason?
Post by John Knight
Going forward, as the next thing for me to try, I am going to add a
timeout for the select... perhaps a modest once a second or two.
I would personally investigate further on a gut feeling without
changing the code behavior, because my changes might have unwanted
effects which can actually hide the root cause I am looking for -- but
to each his/her own.
Post by John Knight
But I would like to know what you all of think of this... does this
make sense to do? Is there ever a case where we might not get any
work on the files select is monitoring for nearly an hour? I am
trying to make sense of this issue.
Not entirely sure what you mean with "Is there ever a case where we
might not get any work on the files select is monitoring for nearly an
hour"; I will assume you mean "Is there a normal case where dnsmasq
would not poll for changes in resolv files for an hour". If so, then I
would say it depends on how much traffic dnsmasq receives and how much
of it can be answered from cache.
Post by John Knight
Thanks,
John Knight
Amicalement,
--
Albert.
John Knight
2016-10-19 19:56:58 UTC
Permalink
Hi Albert,

My comments inline.

John
Post by John Knight
Hi All,
The main while(1) loop uses select() to determine if it has work to
do. In most cases, it appears to use timeout of 0, which I believe
means just wait indefinitely for work on the file descriptors. Other
times, it appears that the timeout is set to a quarter second when
doing a tftp transfer or polling the dbus.
Now what concerns me is that when a "retry later" condition occurs, we
may get stuck on the select() for a long period of time. Alas, I do
not know how frequent one might expect to see work arrive on the file
descriptors that select is watching, so I don't really know if this is
a long time or not. It seems though that in this failure scenario,
the poll_resolv() function does NOT get called very often at all.
Albert: Actually, if dnsmasq does not receive any request from clients, it does not need to poll servers, so I would ask: does the select() include descriptors for client requests (either UDP datagrams received, or TCP connections opened)? If so, I think it will exit just when necessary and no tiemout is needed; otherwise, you are right that a timeout is required.

Albert: Also, it may be improbable that select() does not return for a whole hour; but then, is every return from select() followed by a resolv file poll, or can select() return and then be entered again without polling the resolv files? I am thinking, for instance, about cached answers which do not need servers if their TTL is long enough.

John: I have made a simple change that provides a one second timeout for select. I have found that dnsmasq is much more responsive now to changes made to /etc/resolv.conf. With code that calls poll_resolv, it rate limits the calls to once every two seconds, which I believe is fine and responsive enough.

John: Given I am testing this in a lab situation and just me on the console and one idle PC connected to the router, there is little use of DNS. In my experience since the initial failure, I believe I did see poll_resolv polled in one case at an interval of about 20 minutes. I don’t think this poll interval should be driven by how active the users are and how much they use dns; just my personal feeling about that.

John: It should be noted that if I had been doing a tftp transfer, the code would set the select timeout for 250ms. I am not sure why the tftp transfer being active would warrant the much quicker timeout? Anyhow, what I did was an else statement... if tftp transfer, set timeout to 250ms else set timeout to 1 second.

John: I don't know dnsmasq well enough to answer your other questions about select and what all of the file descriptors are associated with. Perhaps someone more knowledgeable can chime in. My change was made in response to the situation where a "retry later" situation was pending, and not getting poll_resolv was not getting polled again in a reasonable time period to do the retry.

John: I believe on our router, dhcp entries have an hour TTL and we do use dnsmasq for dhcp. On an idle PC, would it have any reason to initiate a dnsmasq query? Occasionally if the browser is up and running, I do see the browser query the address of its update server, but I haven't generally speaking had my browser running on the PC while doing my dnsmasq testing. So it seems to me that the two possible sources to cause dnsmasq activity (ie. Browser and dhcp) may be idle for at least an hour... so this seems like a possibility that poll_resolv() may not be getting called in this scenario for a long time.
Post by John Knight
My gut feeling is that there always needs to be a timeout on the
select call as the poll_resolv() should be called fairly frequently.
The code that exists today where poll_resolv() normally is called from
this loop suggests a poll rate of about once a second. This
definitely does not happen today. By just adding a my_syslog()
message to the top of poll_resolv(), it is very clear from the logfile
that it is not called often, and way to infrequently to resolve the
"retry later" condition in a timely manner.
Albert: Can you compare when poll_resolv() is called wrt when the select() is exited -- and for what reason?

John: What I did to see relative times between select and calls to poll_resolv was to add calls to my_syslog() before the select and at the top of poll_resolv(). The timestamp in the dnsmasq logfile was used to see how much time between calls. I don't know what the reason for exiting select is... indeed, for what I was doing, I really didn't care... I just needed to know when poll_resolv() was getting called and how often.
Post by John Knight
Going forward, as the next thing for me to try, I am going to add a
timeout for the select... perhaps a modest once a second or two.
Albert: I would personally investigate further on a gut feeling without changing the code behavior, because my changes might have unwanted effects which can actually hide the root cause I am looking for -- but to each his/her own.

John: My boss is on my case to getting this resolved asap. Based on my trying of the select timeout, this appears to have at least solved part of the problem... poll_resolv() not getting called back in a reasonable timeframe after a "retry later" issue. I need to keep moving forward; not sure I have the time for indepth investigation. I do know other code does set select timeout, so I do know this code path is not unprecedented, so risk should be low.
Post by John Knight
But I would like to know what you all of think of this... does this
make sense to do? Is there ever a case where we might not get any
work on the files select is monitoring for nearly an hour? I am
trying to make sense of this issue.
Albert: Not entirely sure what you mean with "Is there ever a case where we might not get any work on the files select is monitoring for nearly an hour"; I will assume you mean "Is there a normal case where dnsmasq would not poll for changes in resolv files for an hour". If so, then I would say it depends on how much traffic dnsmasq receives and how much of it can be answered from cache.

John: Your interpretation is correct. Thanks for the info and your help Albert. I am glad I have someone listening. When I am done, I will forward the diffs for the changes I have made to dnsmasq for your review.
Post by John Knight
Thanks,
John Knight
Amicalement,
--
Albert.

__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version française: http://www.belkin.com/email-notice/French.html Für die deutsche Übersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
John Knight
2016-10-24 17:57:03 UTC
Permalink
Hi Albert,

I have finished making my changes to dnsmasq 2.55 and I have a patch file. However, I am not sure how to submit it... do I send it to the discussion list?

Thanks,

John Knight


-----Original Message-----
From: John Knight
Sent: Wednesday, October 19, 2016 12:57 PM
To: 'Albert ARIBAUD'
Cc: dnsmasq-***@lists.thekelleys.org.uk
Subject: RE: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour

Hi Albert,

My comments inline.

John
Post by John Knight
Hi All,
The main while(1) loop uses select() to determine if it has work to
do. In most cases, it appears to use timeout of 0, which I believe
means just wait indefinitely for work on the file descriptors. Other
times, it appears that the timeout is set to a quarter second when
doing a tftp transfer or polling the dbus.
Now what concerns me is that when a "retry later" condition occurs, we
may get stuck on the select() for a long period of time. Alas, I do
not know how frequent one might expect to see work arrive on the file
descriptors that select is watching, so I don't really know if this is
a long time or not. It seems though that in this failure scenario,
the poll_resolv() function does NOT get called very often at all.
Albert: Actually, if dnsmasq does not receive any request from clients, it does not need to poll servers, so I would ask: does the select() include descriptors for client requests (either UDP datagrams received, or TCP connections opened)? If so, I think it will exit just when necessary and no tiemout is needed; otherwise, you are right that a timeout is required.

Albert: Also, it may be improbable that select() does not return for a whole hour; but then, is every return from select() followed by a resolv file poll, or can select() return and then be entered again without polling the resolv files? I am thinking, for instance, about cached answers which do not need servers if their TTL is long enough.

John: I have made a simple change that provides a one second timeout for select. I have found that dnsmasq is much more responsive now to changes made to /etc/resolv.conf. With code that calls poll_resolv, it rate limits the calls to once every two seconds, which I believe is fine and responsive enough.

John: Given I am testing this in a lab situation and just me on the console and one idle PC connected to the router, there is little use of DNS. In my experience since the initial failure, I believe I did see poll_resolv polled in one case at an interval of about 20 minutes. I don’t think this poll interval should be driven by how active the users are and how much they use dns; just my personal feeling about that.

John: It should be noted that if I had been doing a tftp transfer, the code would set the select timeout for 250ms. I am not sure why the tftp transfer being active would warrant the much quicker timeout? Anyhow, what I did was an else statement... if tftp transfer, set timeout to 250ms else set timeout to 1 second.

John: I don't know dnsmasq well enough to answer your other questions about select and what all of the file descriptors are associated with. Perhaps someone more knowledgeable can chime in. My change was made in response to the situation where a "retry later" situation was pending, and not getting poll_resolv was not getting polled again in a reasonable time period to do the retry.

John: I believe on our router, dhcp entries have an hour TTL and we do use dnsmasq for dhcp. On an idle PC, would it have any reason to initiate a dnsmasq query? Occasionally if the browser is up and running, I do see the browser query the address of its update server, but I haven't generally speaking had my browser running on the PC while doing my dnsmasq testing. So it seems to me that the two possible sources to cause dnsmasq activity (ie. Browser and dhcp) may be idle for at least an hour... so this seems like a possibility that poll_resolv() may not be getting called in this scenario for a long time.
Post by John Knight
My gut feeling is that there always needs to be a timeout on the
select call as the poll_resolv() should be called fairly frequently.
The code that exists today where poll_resolv() normally is called from
this loop suggests a poll rate of about once a second. This
definitely does not happen today. By just adding a my_syslog()
message to the top of poll_resolv(), it is very clear from the logfile
that it is not called often, and way to infrequently to resolve the
"retry later" condition in a timely manner.
Albert: Can you compare when poll_resolv() is called wrt when the select() is exited -- and for what reason?

John: What I did to see relative times between select and calls to poll_resolv was to add calls to my_syslog() before the select and at the top of poll_resolv(). The timestamp in the dnsmasq logfile was used to see how much time between calls. I don't know what the reason for exiting select is... indeed, for what I was doing, I really didn't care... I just needed to know when poll_resolv() was getting called and how often.
Post by John Knight
Going forward, as the next thing for me to try, I am going to add a
timeout for the select... perhaps a modest once a second or two.
Albert: I would personally investigate further on a gut feeling without changing the code behavior, because my changes might have unwanted effects which can actually hide the root cause I am looking for -- but to each his/her own.

John: My boss is on my case to getting this resolved asap. Based on my trying of the select timeout, this appears to have at least solved part of the problem... poll_resolv() not getting called back in a reasonable timeframe after a "retry later" issue. I need to keep moving forward; not sure I have the time for indepth investigation. I do know other code does set select timeout, so I do know this code path is not unprecedented, so risk should be low.
Post by John Knight
But I would like to know what you all of think of this... does this
make sense to do? Is there ever a case where we might not get any
work on the files select is monitoring for nearly an hour? I am
trying to make sense of this issue.
Albert: Not entirely sure what you mean with "Is there ever a case where we might not get any work on the files select is monitoring for nearly an hour"; I will assume you mean "Is there a normal case where dnsmasq would not poll for changes in resolv files for an hour". If so, then I would say it depends on how much traffic dnsmasq receives and how much of it can be answered from cache.

John: Your interpretation is correct. Thanks for the info and your help Albert. I am glad I have someone listening. When I am done, I will forward the diffs for the changes I have made to dnsmasq for your review.
Post by John Knight
Thanks,
John Knight
Amicalement,
--
Albert.

__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version française: http://www.belkin.com/email-notice/French.html Für die deutsche Übersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
John Knight
2016-10-24 21:09:56 UTC
Permalink
Hi Albert,

I took a look at 2.76 before even working on 2.55. In this particular area, the code bases are virtually the same... so the changes would apply to the latest version as well. I will submit patch to the mail list as separate email.

John

-----Original Message-----
From: Albert ARIBAUD [mailto:***@free.fr]
Sent: Monday, October 24, 2016 1:56 PM
To: John Knight
Cc: dnsmasq-***@lists.thekelleys.org.uk
Subject: Re: [Dnsmasq-discuss] Dnsmasq not resolving addresses for an hour

Hi John,

Yes, you can submit patches to the list.

However, 2.55 is quite old with respect to the current release of dnsmasq, which is 2.76 IIRC.

Amicalement,
Albert.

__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version française: http://www.belkin.com/email-notice/French.html Für die deutsche Übersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
John Knight
2016-10-24 21:16:46 UTC
Permalink
Hi All,

Here is the patch file I generated to fix the issue with poll_resolv that I was seeing with dnsmasq 2.55:

----------------- Begin Patch -------------
diff -Naur dnsmasq-2.55/src/dnsmasq.c dnsmasq-2.55_new/src/dnsmasq.c
--- dnsmasq-2.55/src/dnsmasq.c2016-10-20 10:43:23.609655958 -0700
+++ dnsmasq-2.55_new/src/dnsmasq.c2016-10-20 10:42:47.745654627 -0700
@@ -587,6 +587,14 @@ int main (int argc, char **argv)
t.tv_usec = 250000;
tp = &t;
}
+ else
+{
+ /* we need to call poll_resolv at least once a second...
+ so we should not hang on select for more than a second */
+ t.tv_sec = 1;
+ t.tv_usec = 0;
+ tp = &t;
+}

#ifdef HAVE_DBUS
set_dbus_listeners(&maxfd, &rset, &wset, &eset);
@@ -904,7 +912,7 @@ void poll_resolv(int force, int do_reloa
{
struct resolvc *res, *latest;
struct stat statbuf;
- time_t last_change = 0;
+ static time_t last_change = 0;
/* There may be more than one possible file.
Go through and find the one which changed _last_.
Warn of any which can't be read. */
@@ -939,8 +947,9 @@ void poll_resolv(int force, int do_reloa
if (force || (statbuf.st_mtime != res->mtime))
{
res->mtime = statbuf.st_mtime;
- if (difftime(statbuf.st_mtime, last_change) > 0.0)
+ if (difftime(statbuf.st_mtime, last_change) != 0.0)
{
+ my_syslog(LOG_INFO, _("change detected for %s; time difference is %f"), res->name, difftime(statbuf.st_mtime, last_change));
last_change = statbuf.st_mtime;
latest = res;
}
@@ -963,6 +972,8 @@ void poll_resolv(int force, int do_reloa
else
{
latest->mtime = 0;
+ /* force reload next time poll_resolv is called */
+ last_change = 0;
if (!warned)
{
my_syslog(LOG_WARNING, _("no servers found in %s, will retry"), latest->name);
__________________________________________________________________ Confidential This e-mail and any files transmitted with it are the property of Belkin International, Inc. and/or its affiliates, are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not one of the named recipients or otherwise have reason to believe that you have received this e-mail in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing or copying of this e-mail is strictly prohibited. Pour la version française: http://www.belkin.com/email-notice/French.html FÃŒr die deutsche Übersetzung: http://www.belkin.com/email-notice/German.html __________________________________________________________________
Vladislav Grishenko
2016-10-24 22:02:25 UTC
Permalink
This post might be inappropriate. Click to display it.
Loading...