Skip to content

Commit 3398744

Browse files
[lldb][Docs] Additions to debuging LLDB page (#65635)
Adds the following: * A note that you can use attaching to debug the right lldb-server process, though there are drawbacks. * A section on debugging the remote protocol. * Reducing bugs, including reducing ptrace bugs to remove the need for LLDB. I've added a standlone ptrace program to the examples folder because: * There's no better place to put it. * Adding it to the page seems like wasting space, and would be harder to update. * I link to Eli Bendersky's classic blog on the subject, but we are safer with our own example as well. * Eli's example is for 32 bit Intel, AArch64 is more common these days. * It's easier to show the software breakpoint steps in code than explain it (though I still do that in the text). * It was living on my laptop not helping anyone so I think it's good to have it upstream for others, including future me.
1 parent 96122b5 commit 3398744

File tree

2 files changed

+428
-0
lines changed

2 files changed

+428
-0
lines changed

lldb/docs/resources/debugging.rst

Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,11 @@ automatically debug the ``gdbserver`` process as it's created. However this
195195
author has not been able to get either to work in this scenario so we suggest
196196
making a more specific command wherever possible instead.
197197

198+
Another option is to let ``lldb-server`` start up, then attach to the process
199+
that's interesting to you. It's less automated and won't work if the bug occurs
200+
during startup. However it is a good way to know you've found the right one,
201+
then you can take its command line and run that directly.
202+
198203
Output From ``lldb-server``
199204
***************************
200205

@@ -258,3 +263,320 @@ then ``lldb B`` to trigger ``lldb-server B`` to go into that code and hit the
258263
breakpoint. ``lldb-server A`` is only here to let us debug ``lldb-server B``
259264
remotely.
260265

266+
Debugging The Remote Protocol
267+
-----------------------------
268+
269+
LLDB mostly follows the `GDB Remote Protocol <https://sourceware.org/gdb/onlinedocs/gdb/Remote-Protocol.html>`_
270+
. Where there are differences it tries to handle both LLDB and GDB behaviour.
271+
272+
LLDB does have extensions to the protocol which are documented in
273+
`lldb-gdb-remote.txt <https://github.com/llvm/llvm-project/blob/main/lldb/docs/lldb-gdb-remote.txt>`_
274+
and `lldb/docs/lldb-platform-packets.txt <https://github.com/llvm/llvm-project/blob/main/lldb/docs/lldb-platform-packets.txt>`_.
275+
276+
Logging Packets
277+
***************
278+
279+
If you just want to observe packets, you can enable the ``gdb-remote packets``
280+
log channel.
281+
282+
::
283+
284+
(lldb) log enable gdb-remote packets
285+
(lldb) run
286+
lldb < 1> send packet: +
287+
lldb history[1] tid=0x264bfd < 1> send packet: +
288+
lldb < 19> send packet: $QStartNoAckMode#b0
289+
lldb < 1> read packet: +
290+
291+
You can do this on the ``lldb-server`` end as well by passing the option
292+
``--log-channels "gdb-remote packets"``. Then you'll see both sides of the
293+
connection.
294+
295+
Some packets may be printed in a nicer way than others. For example XML packets
296+
will print the literal XML, some binary packets may be decoded. Others will just
297+
be printed unmodified. So do check what format you expect, a common one is hex
298+
encoded bytes.
299+
300+
You can enable this logging even when you are connecting to an ``lldb-server``
301+
in platform mode, this protocol is used for that too.
302+
303+
Debugging Packet Exchanges
304+
**************************
305+
306+
Say you want to make ``lldb`` send a packet to ``lldb-server``, then debug
307+
how the latter builds its response. Maybe even see how ``lldb`` handles it once
308+
it's sent back.
309+
310+
That all takes time, so LLDB will likely time out and think the remote has gone
311+
away. You can change the ``plugin.process.gdb-remote.packet-timeout`` setting
312+
to prevent this.
313+
314+
Here's an example, first we'll start an ``lldb-server`` being debugged by
315+
``lldb``. Placing a breakpoint on a packet handler we know will be hit once
316+
another ``lldb`` connects.
317+
318+
::
319+
320+
$ lldb -- lldb-server gdbserver :1234 -- /tmp/test.o
321+
<...>
322+
(lldb) b GDBRemoteCommunicationServerCommon::Handle_qSupported
323+
Breakpoint 1: where = <...>
324+
(lldb) run
325+
<...>
326+
327+
Next we connect another ``lldb`` to this, with a timeout of 5 minutes:
328+
329+
::
330+
331+
$ lldb /tmp/test.o
332+
<...>
333+
(lldb) settings set plugin.process.gdb-remote.packet-timeout 300
334+
(lldb) gdb-remote 1234
335+
336+
Doing so triggers the breakpoint in ``lldb-server``, bringing us back into
337+
``lldb``. Now we've got 5 minutes to do whatever we need before LLDB decides
338+
the connection has failed.
339+
340+
::
341+
342+
* thread #1, name = 'lldb-server', stop reason = breakpoint 1.1
343+
frame #0: 0x0000aaaaaacc6848 lldb-server<...>
344+
lldb-server`lldb_private::process_gdb_remote::GDBRemoteCommunicationServerCommon::Handle_qSupported:
345+
-> 0xaaaaaacc6848 <+0>: sub sp, sp, #0xc0
346+
<...>
347+
(lldb)
348+
349+
Once you're done simply ``continue`` the ``lldb-server``. Back in the other
350+
``lldb``, the connection process will continue as normal.
351+
352+
::
353+
354+
Process 2510266 stopped
355+
* thread #1, name = 'test.o', stop reason = signal SIGSTOP
356+
frame #0: 0x0000fffff7fcd100 ld-2.31.so`_start
357+
ld-2.31.so`_start:
358+
-> 0xfffff7fcd100 <+0>: mov x0, sp
359+
<...>
360+
(lldb)
361+
362+
Reducing Bugs
363+
-------------
364+
365+
This section covers reducing a bug that happens in LLDB itself, or where you
366+
suspect that LLDB causes something else to behave abnormally.
367+
368+
Since bugs vary wildly, the advice here is general and incomplete. Let your
369+
instincts guide you and don't feel the need to try everything before reporting
370+
an issue or asking for help. This is simply inspiration.
371+
372+
Reduction
373+
*********
374+
375+
The first step is to reduce uneeded compexity where it is cheap to do so. If
376+
something is easily removed or frozen to a cerain value, do so. The goal is to
377+
keep the failure mode the same, with fewer dependencies.
378+
379+
This includes, but is not limited to:
380+
381+
* Removing test cases that don't crash.
382+
* Replacing dynamic lookups with constant values.
383+
* Replace supporting functions with stubs that do nothing.
384+
* Moving the test case to less unqiue system. If your machine has an exotic
385+
extension, try it on a readily available commodity machine.
386+
* Removing irrelevant parts of the test program.
387+
* Reproducing the issue without using the LLDB test runner.
388+
* Converting a remote debuging scenario into a local one.
389+
390+
Now we hopefully have a smaller reproducer than we started with. Next we need to
391+
find out what components of the software stack might be failing.
392+
393+
Some examples are listed below with suggestions for how to investigate them.
394+
395+
* Debugger
396+
397+
* Use a `released version of LLDB <https://github.com/llvm/llvm-project/releases>`_.
398+
399+
* If on MacOS, try the system ``lldb``.
400+
401+
* Try GDB or any other system debugger you might have e.g. Microsoft Visual
402+
Studio.
403+
404+
* Kernel
405+
406+
* Start a virtual machine running a different version. ``qemu-system`` is
407+
useful here.
408+
409+
* Try a different physical system running a different version.
410+
411+
* Remember that for most kernels, userspace crashing the kernel is always a
412+
kernel bug. Even if the userspace program is doing something unconventional.
413+
So it could be a bug in the application and the kernel.
414+
415+
* Compiler and compiler options
416+
417+
* Try other versions of the same compiler or your system compiler.
418+
419+
* Emit older versions of DWARF info, particularly DWARFv4 to v5, some tools
420+
did/do not understand the new constructs.
421+
422+
* Reduce optimisation options as much as possible.
423+
424+
* Try all the language modes e.g. C++17/20 for C++.
425+
426+
* Link against LLVM's libcxx if you suspect a bug involving the system C++
427+
library.
428+
429+
* For languages other than C/C++ e.g. Rust, try making an equivalent program
430+
in C/C++. LLDB tends to try to fit other languages into a C/C++ mould, so
431+
porting the program can make triage and reporting much easier.
432+
433+
* Operating system
434+
435+
* Use docker to try various versions of Linux.
436+
437+
* Use ``qemu-system`` to emulate other operating systems e.g. FreeBSD.
438+
439+
* Architecture
440+
441+
* Use `QEMU user space emulation <https://www.qemu.org/docs/master/user/main.html>`_
442+
to quickly test other architectures. Note that ``lldb-server`` cannot be used
443+
with this as the ptrace APIs are not emulated.
444+
445+
* If you need to test a big endian system use QEMU to emulate s390x (user
446+
space emulation for just ``lldb``, ``qemu-system`` for testing
447+
``lldb-server``).
448+
449+
.. note:: When using QEMU you may need to use the built in GDB stub, instead of
450+
``lldb-server``. For example if you wanted to debug ``lldb`` running
451+
inside ``qemu-user-s390x`` you would connect to the GDB stub provided
452+
by QEMU.
453+
454+
The same applies if you want to see how ``lldb`` would debug a test
455+
program that is running on s390x. It's not totally accurate because
456+
you're not using ``lldb-server``, but this is fine for features that
457+
are mostly implemented in ``lldb``.
458+
459+
If you are running a full system using ``qemu-system``, you likely
460+
want to connect to the ``lldb-server`` running within the userspace
461+
of that system.
462+
463+
If your test program is bare metal (meaning it requires no supporting
464+
operating system) then connect to the built in GDB stub. This can be
465+
useful when testing embedded systems or kernel debugging.
466+
467+
Reducing Ptrace Related Bugs
468+
****************************
469+
470+
This section is written Linux specific but the same can likely be done on
471+
other Unix or Unix like operating systems.
472+
473+
Sometimes you will find ``lldb-server`` doing something with ptrace that causes
474+
a problem. Your reproducer involves running ``lldb`` as well, this is not going
475+
to go over well with kernel and is generally more difficult to explain if you
476+
want to get help with it.
477+
478+
If you think you can get your point across without this, no need. If you're
479+
pretty sure you have for example found a Linux Kernel bug, doing this greatly
480+
increases the chances it'll get fixed.
481+
482+
We'll remove the LLDB dependency by making a smaller standalone program that
483+
does the same actions. Starting with a skeleton program that forks and debugs
484+
the inferior process.
485+
486+
The program presented `here <https://eli.thegreenplace.net/2011/01/23/how-debuggers-work-part-1>`_
487+
(`source <https://github.com/eliben/code-for-blog/blob/master/2011/simple_tracer.c>`_)
488+
is a great starting point. There is also an AArch64 specific example in
489+
`the LLDB examples folder <https://github.com/llvm/llvm-project/tree/main/lldb/examples/ptrace_example.c>`_.
490+
491+
For either, you'll need to modify that to fit your architecture. An tip for this
492+
is to take any constants used in it, find in which function(s) they are used in
493+
LLDB and then you'll find the equivalent constants in the same LLDB functions
494+
for your architecture.
495+
496+
Once that is running as expected we can convert ``lldb-server``'s into calls in
497+
this program. To get a log of those, run ``lldb-server`` with
498+
``--log-channels "posix ptrace"``. You'll see output like:
499+
500+
::
501+
502+
$ lldb-server gdbserver :1234 --log-channels "posix ptrace" -- /tmp/test.o
503+
1694099878.829990864 <...> ptrace(16896, 2659963, 0x0000000000000000, 0x000000000000007E, 0)=0x0
504+
1694099878.830722332 <...> ptrace(16900, 2659963, 0x0000FFFFD14BF7CC, 0x0000FFFFD14BF7D0, 16)=0x0
505+
1694099878.831967115 <...> ptrace(16900, 2659963, 0x0000FFFFD14BF66C, 0x0000FFFFD14BF630, 16)=0xffffffffffffffff
506+
1694099878.831982136 <...> ptrace() failed: Invalid argument
507+
Launched '/tmp/test.o' as process 2659963...
508+
509+
Each call is logged with its parameters and its result as the ``=`` on the end.
510+
511+
From here you will need to use a combination of the `ptrace documentation <https://man7.org/linux/man-pages/man2/ptrace.2.html>`_
512+
and Linux Kernel headers (``uapi/linux/ptrace.h`` mainly) to figure out what
513+
the calls are.
514+
515+
The most important parameter is the first, which is the request number. In the
516+
example above ``16896``, which is hex ``0x4200``, is ``PTRACE_SETOPTIONS``.
517+
518+
Luckily, you don't usually have to figure out all those early calls. Our
519+
skeleton program will be doing all that, successfully we hope.
520+
521+
What you should do is record just the interesting bit to you. Let's say
522+
something odd is happening when you read the ``tpidr`` register (this is an
523+
AArch64 register, just for example purposes).
524+
525+
First, go to the ``lldb-server`` terminal and press enter a few times to put
526+
some blank lines after the last logging output.
527+
528+
Then go to your ``lldb`` and:
529+
530+
::
531+
532+
(lldb) register read tpidr
533+
tpidr = 0x0000fffff7fef320
534+
535+
You'll see this from ``lldb-server``:
536+
537+
::
538+
539+
<...> ptrace(16900, 2659963, 0x0000FFFFD14BF6CC, 0x0000FFFFD14BF710, 8)=0x0
540+
541+
If you don't see that, it may be because ``lldb`` has cached it. The easiest way
542+
to clear that cache is to step. Remember that some registers are read every
543+
step, so you'll have to adjust depending on the situation.
544+
545+
Assuming you've got that line, you would look up what ``116900`` is. This is
546+
``0x4204`` in hex, which is ``PTRACE_GETREGSET``. As we expected.
547+
548+
The following parameters are not as we might expect because what we log is a bit
549+
different from the literal ptrace call. See your platform's definition of
550+
``PtraceWrapper`` for the exact form.
551+
552+
The point of all this is that by doing a single action you can get a few
553+
isolated ptrace calls and you can then fill in the blanks and write
554+
equivalent calls in the skeleton program.
555+
556+
The final piece of this is likely breakpoints. Assuming your bug does not
557+
require a hardware breakpoint, you can get software breakpoints by inserting
558+
a break instruction into the inferior's code at compile time. Usually by using
559+
an architecture specific assembly statement, as you will need to know exactly
560+
how many instructions to overwrite later.
561+
562+
Doing it this way instead of exactly copying what LLDB does will save a few
563+
ptrace calls. The AArch64 example program shows how to do this.
564+
565+
* The inferior contains ``BRK #0`` then ``NOP``.
566+
* 2 4 byte instructins means 8 bytes of data to replace, which matches the
567+
minimum size you can write with ``PTRACE_POKETEXT``.
568+
* The inferior runs to the ``BRK``, which brings us into the debugger.
569+
* The debugger reads ``PC`` and writes ``NOP`` then ``NOP`` to the location
570+
pointed to by ``PC``.
571+
* The debugger then single steps the inferior to the next instruction
572+
(this is not required in this specific scenario, you could just continue but
573+
it is included because this more cloesly matches what ``lldb`` does).
574+
* The debugger then continues the inferior.
575+
* The inferior exits, and the whole program exits.
576+
577+
Using this technique you can emulate the usual "run to main, do a thing" type
578+
reproduction steps.
579+
580+
Finally, that "thing" is the ptrace calls you got from the ``lldb-server`` logs.
581+
Add those to the debugger function and you now have a reproducer that doesn't
582+
need any part of LLDB.

0 commit comments

Comments
 (0)