thecodingidiot.com

Building CIt Crashes

It Crashes

The program works on test.txt. Try it on a filename that does not exist:

./sort missing.txt
Segmentation fault (core dumped)

A segmentation fault — SIGSEGV — happens when a program tries to read or write memory it is not allowed to touch. The kernel kills the process and the shell prints the message above. No clue where in the code it happened. That is what gdb is for.

Running under gdb

gdb is the GNU debugger. Launch it on the binary:

gdb ./sort

It opens an interactive prompt:

(gdb)

This is gdb's own shell. From here you control the program: tell it when to start, where to pause, what to print, when to continue. The five commands we need to find a crash:

CommandWhat it does
run (r)Start the program. Arguments after run go to the program.
backtrace (bt)Show the chain of function calls that led to where the program is now. Most useful at a crash.
frame N (f N)Switch the prompt's context to frame N from the backtrace, so print and other commands act on that frame's variables.
print (p)Print the value of a variable or expression.
quit (q)Exit gdb.

gdb has many more commands — break to pause at a chosen line, step and next to walk a running program one line at a time, continue to resume after a breakpoint, and dozens more. The c-tier returns to them when stepping through code that runs without crashing matters. For a segfault, the five above are enough.

For a crash the only command we need right now is run. Tell gdb to start the program with the same argument we used at the shell:

(gdb) run missing.txt

The program runs until it crashes, and gdb catches the crash mid-flight:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7e1a234 in _IO_fgets () from /lib/x86_64-linux-gnu/libc.so.6

Two things to notice. The signal is SIGSEGV — the kernel explicitly told the process it touched bad memory. And the function gdb shows is _IO_fgets, deep inside libc. We did not crash in our code — we crashed in a standard-library call we made.

The backtrace

backtrace (or bt for short) shows the chain of function calls that led to the current location:

(gdb) bt
#0  0x00007ffff7e1a234 in _IO_fgets () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000055555555525e in main (argc=2, argv=0x7fffffffe148) at sort.c:31

The numbered lines are the call stack, with frame #0 being where the crash happened and each higher number being who called who. The frame we care about is #1 — that is our code. sort.c:31 is the line that made the call to fgets.

Switch to that frame to look at the source:

(gdb) frame 1
#1  0x000055555555525e in main (argc=2, argv=0x7fffffffe148) at sort.c:31
31      while (fgets(buf, sizeof(buf), in)) {

This is the line in our code. We called fgets(buf, sizeof(buf), in). It crashed inside the libc implementation. So either buf is bad, sizeof(buf) is bad (it is not — it is a constant), or in is bad. Let us print them:

(gdb) print buf
$1 = '\000' <repeats 4095 times>
(gdb) print in
$2 = (FILE *) 0x0

buf is fine — fgets is allowed to write into uninitialised memory; that is what we are asking it to do. in is 0x0NULL.

The bug

Look back at the code:

in = fopen(argv[1], "r");

What does fopen do when the file cannot be opened? Linux has a built-in answer to questions like that: the man command. Type man <name> and you get the manual page for any standard tool or library function. Pages live in numbered sections: section 1 is shell commands (man 1 ls), section 3 is the C standard library (man 3 fopen), and there are a handful of others. From here on you will reach for man constantly as you write C — every standard-library function has its own page describing what it takes, what it returns, and what can go wrong.

man opens the page in a pager — less, the same one f01/03 introduced you to alongside vim, with the same vi-style keys: j and k to scroll line by line, Space to page forward, / to search for a word, and q to quit. The same keys work in any pager you meet on Linux, including the one git log opens.

Run man 3 fopen now and scroll to the RETURN VALUE section:

Upon successful completion fopen()... return a FILE pointer.
Otherwise, NULL is returned and errno is set to indicate the
error.

There is the answer. fopen returns NULL when it cannot open the file — wrong path, no permission, the file is a directory, anything. We never checked the return value. We handed a NULL FILE * straight to fgets, which tried to read from it as if it were a real file, and the program crashed.

This is a beginner mistake every C programmer makes. The fix is two lines:

in = fopen(argv[1], "r");
if (!in) {
    fprintf(stderr, "cannot open %s\n", argv[1]);
    return (1);
}

!in is shorthand for "the pointer is NULL." If it is, print a sensible error and exit with status 1.

Quit gdb with quit (or just hit Ctrl+D) and recompile:

gcc -Wall -Wextra -g sort.c -o sort
./sort missing.txt

Now the program prints cannot open missing.txt and exits cleanly. Run it on test.txt again and the sorted output still works.

One bug down, two to go.

What you actually learned

gdb is a giant tool with hundreds of commands. The handful you just used is enough to handle most crashes:

  1. gdb ./binary to enter.
  2. run <args> to start.
  3. When it crashes, bt to see where.
  4. frame N to switch to your code's frame.
  5. print <var> to inspect what went wrong.
  6. quit when done.

You will use this same loop every time something segfaults for the rest of your career. The c-tier covers gdb again with breakpoints and stepping when it matters. For now, the program crashed, gdb showed you the line, you fixed it is the entire skill.