[PATCH] Fix GDB hang with remote after error from resume

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Fix GDB hang with remote after error from resume

Andreas Arnez-2
Since this commit --

  Fix PR18360 - internal error when using "interrupt -a"
  (https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=c65d6b55)

-- the testsuite shows long delays on s390 with native-gdbserver when
executing certain tests, such as watchpoints.exp.  These hangs have been
discussed before in the context of buildbot problems, see here:

  https://sourceware.org/ml/gdb-patches/2017-12/msg00413.html

The problem can easily be triggered by stopping on a breakpoint, then
setting impossible watchpoints, and finally doing "continue".  Then, after
having set the step-over state (in keep_going_pass_signal in infrun.c),
GDB tries to insert breakpoints and watchpoints into the inferior.  This
fails, and the "continue" command is aborted.  But the step-over state is
not cleared in this case, which causes future step-over attempts to be
skipped since GDB thinks that "we already have an in-line step-over
operation ongoing" (see start_step_over in infrun.c).  Thus the next
"continue" just goes on to wait for events from the remote, which will
never occur.

The problem can also be reproduced on amd64 with native-gdbserver, using
the following change to watchpoints.exp:

-- >8 --
--- a/gdb/testsuite/gdb.base/watchpoints.exp
+++ b/gdb/testsuite/gdb.base/watchpoints.exp
@@ -61,2 +61,3 @@ with_test_prefix "before inferior start" {
     gdb_test "watch ival3" ".*" ""
+    gdb_test "watch *(char \[256\] *) main"

-- >8 --

To fix the hang, this patch clears the step-over info when
insert_breakpoints has failed.  Of course, with native-gdbserver the
watchpoints.exp test case still causes many FAILs on s390, because
gdbserver does not support watchpoints for that target.  This is a
separate issue.

gdb/ChangeLog:

        * infrun.c (keep_going_pass_signal): Clear step-over info when
        insert_breakpoints fails.
---
 gdb/infrun.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gdb/infrun.c b/gdb/infrun.c
index 7e8d8da..720443b 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -7751,6 +7751,7 @@ keep_going_pass_signal (struct execution_control_state *ecs)
  {
   exception_print (gdb_stderr, e);
   stop_waiting (ecs);
+  clear_step_over_info ();
   return;
  }
       END_CATCH
--
2.5.0

Reply | Threaded
Open this post in threaded view
|

[pushed + testcase] Re: [PATCH] Fix GDB hang with remote after error from resume

Pedro Alves-7
On 01/10/2018 04:56 PM, Andreas Arnez wrote:

> Since this commit --
>
>   Fix PR18360 - internal error when using "interrupt -a"
>   (https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=c65d6b55)
>
> -- the testsuite shows long delays on s390 with native-gdbserver when
> executing certain tests, such as watchpoints.exp.  These hangs have been
> discussed before in the context of buildbot problems, see here:
>
>   https://sourceware.org/ml/gdb-patches/2017-12/msg00413.html
>
> The problem can easily be triggered by stopping on a breakpoint, then
> setting impossible watchpoints, and finally doing "continue".  Then, after
> having set the step-over state (in keep_going_pass_signal in infrun.c),
> GDB tries to insert breakpoints and watchpoints into the inferior.  This
> fails, and the "continue" command is aborted.  But the step-over state is
> not cleared in this case, which causes future step-over attempts to be
> skipped since GDB thinks that "we already have an in-line step-over
> operation ongoing" (see start_step_over in infrun.c).  Thus the next
> "continue" just goes on to wait for events from the remote, which will
> never occur.
>

Thanks much for the fix.

> The problem can also be reproduced on amd64 with native-gdbserver, using
> the following change to watchpoints.exp:
>
> -- >8 --
> --- a/gdb/testsuite/gdb.base/watchpoints.exp
> +++ b/gdb/testsuite/gdb.base/watchpoints.exp
> @@ -61,2 +61,3 @@ with_test_prefix "before inferior start" {
>      gdb_test "watch ival3" ".*" ""
> +    gdb_test "watch *(char \[256\] *) main"
>
> -- >8 --
>

One question I had with this is why would it only trigger with
native-gdbserver.  After debugging a bit, it was obvious -- the reason
is simply that native debugging uses displaced stepping by default, unlike
remote debugging, because native debugging enables all-stop-on-top-of-non-stop
by default.  I never got around to flipping that
on ("maint set target-non-stop on") by default with remote debugging.

I've pushed in your patch to both master and 8.1, along with this
follow up commit adding a testcase.

From 1d17025506de70cb1d9d5b7a5654e40ce689bf26 Mon Sep 17 00:00:00 2001
From: Pedro Alves <[hidden email]>
Date: Fri, 12 Jan 2018 18:59:40 +0000
Subject: [PATCH] Add testcase for GDB hang fixed by previous commit

This adds a testcase for the previous commit.  The regression was
related to in-line step overs.  The reason we didn't see it on native
x86-64/s390 GNU/Linux testing is that native debugging uses displaced
stepping by default (because native debugging defaults to "maint set
target-non-stop on"), unlike remote debugging.

So in order to trigger the bug with native debugging as well, the
testcase disables displaced stepping explicitly.

Also, instead of using watchpoints to trigger the regression, the
testcase uses a breakpoint at address 0, which should be more
portable.

gdb/testsuite/ChangeLog:
2018-01-12  Pedro Alves  <[hidden email]>

        * gdb.base/continue-after-aborted-step-over.c: New.
        * gdb.base/continue-after-aborted-step-over.exp: New.
---
 gdb/testsuite/ChangeLog                            |  5 ++
 .../gdb.base/continue-after-aborted-step-over.c    | 29 ++++++++
 .../gdb.base/continue-after-aborted-step-over.exp  | 87 ++++++++++++++++++++++
 3 files changed, 121 insertions(+)
 create mode 100644 gdb/testsuite/gdb.base/continue-after-aborted-step-over.c
 create mode 100644 gdb/testsuite/gdb.base/continue-after-aborted-step-over.exp

diff --git a/gdb/testsuite/ChangeLog b/gdb/testsuite/ChangeLog
index 7d7c389d98d..90ffb4fa443 100644
--- a/gdb/testsuite/ChangeLog
+++ b/gdb/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2018-01-12  Pedro Alves  <[hidden email]>
+
+ * gdb.base/continue-after-aborted-step-over.c: New.
+ * gdb.base/continue-after-aborted-step-over.exp: New.
+
 2018-01-11  Pedro Alves  <[hidden email]>
 
  PR remote/22597
diff --git a/gdb/testsuite/gdb.base/continue-after-aborted-step-over.c b/gdb/testsuite/gdb.base/continue-after-aborted-step-over.c
new file mode 100644
index 00000000000..7252648084e
--- /dev/null
+++ b/gdb/testsuite/gdb.base/continue-after-aborted-step-over.c
@@ -0,0 +1,29 @@
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2018 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+void
+function (void)
+{
+}
+
+int
+main ()
+{
+  function ();
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.base/continue-after-aborted-step-over.exp b/gdb/testsuite/gdb.base/continue-after-aborted-step-over.exp
new file mode 100644
index 00000000000..297cb638587
--- /dev/null
+++ b/gdb/testsuite/gdb.base/continue-after-aborted-step-over.exp
@@ -0,0 +1,87 @@
+# Copyright 2018 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This testcase is a regression test for a regression in the in-line
+# step-over machinery.  If a resumption that starts a step-over
+# failed, a following resumption would make GDB hang forever:
+#
+#  (gdb) b *0
+#  Breakpoint 2 at 0x0
+#  continue
+#  Continuing.
+#  Warning:
+#  Cannot insert breakpoint 2.
+#  Cannot access memory at address 0x0
+#
+#  Command aborted.
+#  delete breakpoints
+#  Delete all breakpoints? (y or n) y
+#  (gdb) b function
+#  Breakpoint 3 at 0x40048b: file test.c, line 33.
+#  continue
+#  Continuing.
+#  *GDB hangs forever*
+
+standard_testfile
+
+if {[build_executable "failed to prepare" $testfile $srcfile debug]} {
+    return -1
+}
+
+# DISPLACED indicates whether to use displaced-stepping.
+proc do_test {displaced} {
+    global gdb_prompt decimal
+    global srcfile binfile
+
+    clean_restart $binfile
+
+    gdb_test_no_output "set displaced-stepping $displaced"
+
+    if ![runto_main] {
+ fail "run to main"
+ return -1
+    }
+
+    # We rely on not being able to set a breakpoint at 0, as proxy for
+    # any kind of breakpoint insertion failure.  If we can examine
+    # what's at memory address 0, it is possible that we could also
+    # execute it.
+    if [is_address_zero_readable] {
+ untested "memory at address 0 is possibly executable"
+ return
+    }
+
+    # Set a breakpoint that fails to insert.
+    gdb_test "b *0" "Breakpoint $decimal at 0x0"
+
+    gdb_test "continue" \
+ "Command aborted\\." \
+ "continue aborts"
+
+    # Delete the "bad" breakpoint and try continuing again.
+    delete_breakpoints
+    gdb_test "b function" "Breakpoint $decimal .*$srcfile.*"
+
+    gdb_test "continue" \
+ "Breakpoint $decimal, function \\(\\) at .*$srcfile:.*" \
+ "continue to function"
+}
+
+# This testcase exercises a regression with the in-line step-over
+# machinery.  So make sure this runs with displaced stepping disabled,
+# and for good measure, also try with displaced stepping enabled.
+foreach_with_prefix displaced-stepping {"off" "on"} {
+    do_test ${displaced-stepping}
+}
--
2.14.3

Reply | Threaded
Open this post in threaded view
|

Re: [pushed + testcase] Re: [PATCH] Fix GDB hang with remote after error from resume

Andreas Arnez-2
On Fri, Jan 12 2018, Pedro Alves wrote:

> I've pushed in your patch to both master and 8.1, along with this
> follow up commit adding a testcase.

Great, thanks!

Sergio, now that this is in, we should be able to re-activate the
reporting for the s390 (and Power) Buildbots, right?

--
Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [pushed + testcase] Re: [PATCH] Fix GDB hang with remote after error from resume

Yao Qi
On Mon, Jan 15, 2018 at 2:43 PM, Andreas Arnez <[hidden email]> wrote:
>
> Sergio, now that this is in, we should be able to re-activate the
> reporting for the s390 (and Power) Buildbots, right?
>

I am not sure ppc64 issue is resolved.  See
https://sourceware.org/ml/gdb-patches/2017-12/msg00414.html
Note that I had a patch to fix fails in
gdb.threads/process-dies-while-handling-bp.exp
in this thread, but the tests with gdbserver still hang in some watchpoint
tests.

--
Yao (齐尧)
Reply | Threaded
Open this post in threaded view
|

Re: [pushed + testcase] Re: [PATCH] Fix GDB hang with remote after error from resume

Sergio Durigan Junior
In reply to this post by Andreas Arnez-2
On Monday, January 15 2018, Andreas Arnez wrote:

> On Fri, Jan 12 2018, Pedro Alves wrote:
>
>> I've pushed in your patch to both master and 8.1, along with this
>> follow up commit adding a testcase.
>
> Great, thanks!
>
> Sergio, now that this is in, we should be able to re-activate the
> reporting for the s390 (and Power) Buildbots, right?

Thanks for taking care of this, Andreas.

I had to cancel all pending builds for the 3 s390x builders, because
they had more than 130 builds in the queue each.  Now, I'll monitor the
next builds and check if the problem still persists.  If not, then I can
certainly re-activate the notifications for s390x.

As for the PPC builders, Yao mentioned that there are probably other
issues to be solved.  I know Edjunior is also testing your fix on PPC as
I write this e-mail, so I'll wait for their opinions here.

Thanks!

--
Sergio
GPG key ID: 237A 54B1 0287 28BF 00EF  31F4 D0EB 7628 65FC 5E36
Please send encrypted e-mail if possible
http://sergiodj.net/
Reply | Threaded
Open this post in threaded view
|

Re: [pushed + testcase] Re: [PATCH] Fix GDB hang with remote after error from resume

Andreas Arnez-2
In reply to this post by Yao Qi
On Mon, Jan 15 2018, Yao Qi wrote:

> On Mon, Jan 15, 2018 at 2:43 PM, Andreas Arnez <[hidden email]> wrote:
>>
>> Sergio, now that this is in, we should be able to re-activate the
>> reporting for the s390 (and Power) Buildbots, right?
>>
>
> I am not sure ppc64 issue is resolved.  See
> https://sourceware.org/ml/gdb-patches/2017-12/msg00414.html
> Note that I had a patch to fix fails in
> gdb.threads/process-dies-while-handling-bp.exp
> in this thread, but the tests with gdbserver still hang in some watchpoint
> tests.

Hm, too bad.  Well, since this type of hang doesn't seem to occur on
s390, we can at least try re-activating s390 then.

--
Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [pushed + testcase] Re: [PATCH] Fix GDB hang with remote after error from resume

Yao Qi
On Mon, Jan 15, 2018 at 3:22 PM, Andreas Arnez <[hidden email]> wrote:
>
> Hm, too bad.  Well, since this type of hang doesn't seem to occur on
> s390, we can at least try re-activating s390 then.
>

Yeah, I agree!

--
Yao (齐尧)
Reply | Threaded
Open this post in threaded view
|

Re: [pushed + testcase] Re: [PATCH] Fix GDB hang with remote after error from resume

Edjunior Machado
In reply to this post by Yao Qi
Hi all,

On Mon, Jan 15, 2018 at 3:52 PM, Yao Qi <[hidden email]> wrote:

> On Mon, Jan 15, 2018 at 2:43 PM, Andreas Arnez <[hidden email]>
> wrote:
> >
> > Sergio, now that this is in, we should be able to re-activate the
> > reporting for the s390 (and Power) Buildbots, right?
> >
>
> I am not sure ppc64 issue is resolved.  See
> https://sourceware.org/ml/gdb-patches/2017-12/msg00414.html
> Note that I had a patch to fix fails in
> gdb.threads/process-dies-while-handling-bp.exp
> in this thread, but the tests with gdbserver still hang in some watchpoint
> tests.
>
>
Andreas' patch does fix the gdb.base/watchpoint.exp hang on ppc64*
native-gdbserver, but as Yao mentioned, it is still hanging on
gdb.threads/process-dies-while-handling-bp.exp without his fix (
https://sourceware.org/ml/gdb-patches/2017-12/msg00434.html).

That said, just tested upstream testsuite, with Andreas' patch included,
plus Yao's aforementioned patch and native-gdbserver testrun no longer
hangs. These are the results I got with native-gdbserver on Fedora26:

ppc64le:
                === gdb Summary ===

# of expected passes            50410
# of unexpected failures        550
# of unexpected successes       3
# of expected failures          58
# of unknown successes          4
# of known failures             64
# of unresolved testcases       5
# of untested testcases         49
# of unsupported tests          171
/home/fedora/binutils-gdb.git/build/gdb/gdb version  8.1.50.20180115-git
-nw -nx -data-directory
/home/fedora/binutils-gdb.git/build/gdb/testsuite/../data-directory  -ex
"set auto-connect-native-target off"

ppc64:
                === gdb Summary ===

# of expected passes            50133
# of unexpected failures        1109
# of unexpected successes       3
# of expected failures          58
# of unknown successes          4
# of known failures             65
# of unresolved testcases       6
# of untested testcases         51
# of unsupported tests          162
/home/fedora/binutils-gdb.git/build/gdb/gdb version  8.1.50.20180115-git
-nw -nx -data-directory
/home/fedora/binutils-gdb.git/build/gdb/testsuite/../data-directory  -ex
"set auto-connect-native-target off"

(FWIW, at first glance, gdb.linespec/cpcompletion.exp in ppc64 BE seems to
be one of the main culprits for the whole testrun taking longer (~1h) and
with considerably more failures than in ppc64le, aparently due to the
ppc64's function descriptor.)

Please let me know if you need any additional testing.

Thanks and regards,
Edjunior
Reply | Threaded
Open this post in threaded view
|

Re: [pushed + testcase] Re: [PATCH] Fix GDB hang with remote after error from resume

Pedro Alves-7
On 01/15/2018 05:37 PM, Edjunior Machado wrote:
> (FWIW, at first glance, gdb.linespec/cpcompletion.exp in ppc64 BE seems to
> be one of the main culprits for the whole testrun taking longer (~1h) and
> with considerably more failures than in ppc64le, aparently due to the
> ppc64's function descriptor.)

Sergio, would it possible to include the time it took to run the
testsuite in the messages sent to gdb-testers?  As is, such new
long cascading timeouts are introduced without one realizing,
unless you go look at the builds' pages one by one.

Ideally we'd even get a notification if time jumped
seemingly too much between builds.

And IWBN if the gdb buildbot web frontend had some kind of graph
tracking/plotting test run duration over time (build# for a
given builder), so we could just open some URL and
notice the spikes.  </pie in the sky>.

Thanks,
Pedro Alves
Reply | Threaded
Open this post in threaded view
|

Re: [pushed + testcase] Re: [PATCH] Fix GDB hang with remote after error from resume

Sergio Durigan Junior
On Monday, January 15 2018, Pedro Alves wrote:

> On 01/15/2018 05:37 PM, Edjunior Machado wrote:
>> (FWIW, at first glance, gdb.linespec/cpcompletion.exp in ppc64 BE seems to
>> be one of the main culprits for the whole testrun taking longer (~1h) and
>> with considerably more failures than in ppc64le, aparently due to the
>> ppc64's function descriptor.)
>
> Sergio, would it possible to include the time it took to run the
> testsuite in the messages sent to gdb-testers?  As is, such new
> long cascading timeouts are introduced without one realizing,
> unless you go look at the builds' pages one by one.

I'll see about that.  There must be a way, because the web interface
provides this information, but I don't remember seeing it before.

> Ideally we'd even get a notification if time jumped
> seemingly too much between builds.

This would take more time to implement, but if there is a way to obtain
the build time, then it should be possible.

> And IWBN if the gdb buildbot web frontend had some kind of graph
> tracking/plotting test run duration over time (build# for a
> given builder), so we could just open some URL and
> notice the spikes.  </pie in the sky>.

I'll leave this pie in the sky :-).

--
Sergio
GPG key ID: 237A 54B1 0287 28BF 00EF  31F4 D0EB 7628 65FC 5E36
Please send encrypted e-mail if possible
http://sergiodj.net/