Skip to content

Improve diagnostics and stability during BUGCHECK #7270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/jrd/cch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2094,7 +2094,7 @@ bool CCH_rollover_to_shadow(thread_db* tdbb, Database* dbb, jrd_file* file, cons
}


void CCH_shutdown(thread_db* tdbb)
void CCH_shutdown(thread_db* tdbb, bool bugcheck)
{
/**************************************
*
Expand Down Expand Up @@ -2144,7 +2144,7 @@ void CCH_shutdown(thread_db* tdbb)
bcb_repeat* tail = bcb->bcb_rpt;
const bcb_repeat* const end = tail + bcb->bcb_count;

if (tail && tail->bcb_bdb)
if (tail && tail->bcb_bdb && !bugcheck)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also avoids flush, is it intended ? If yes, is it really good thought out ?

Copy link
Contributor Author

@ilya071294 ilya071294 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CCH_flush wouldn't be called anyway because DBB_bugcheck flag is already set at the moment, and LongJump::raise() is called in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.
In this case - is bugcheck agrument really necessary ?
Does DBB_bugcheck flag might be used instead ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it is not clear what happens with backup lock if clear_dirty_flag_and_nbak_state() not called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does DBB_bugcheck flag might be used instead ?

I guess no because later CCH_shutdown will be called again in JRD_shutdown_database. This time with bugcheck == false, and clear_dirty_flag_and_nbak_state() will be finally called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, could you explain - what problem with precedence relationship you trying to fix here ?

BDB_db_dirty and BDB_dirty are cleared but pages are not flushed. In concurrent environment other threads may want to write some pages according to the precedence and they may assume that these pages are written but actually they are not.

Did you test it explicitly ?
For example, if backup state remains locked in CS - it could hung all other attachments.

I did and clear_dirty_flag_and_nbak_state() is certainly called during JRD_shutdown_database (both CS and SS). Is there any case where we can get better results from calling clear_dirty_flag_and_nbak_state() earlier? The old code seems good for CS but for SS I don't see how it can be safely called before other attachments/threads are stopped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this made me to look at the issue from another side.

AFAIU, the goal of the immediate call of CCH_shutdown() when database is bug-checked is to stop all database IO ASAP and not allow to write anything. Note, page cache is not flushed and database file is closed. Thus, even in concurrent environment, no other thread should be able to write any page. So, the real problem is - how to effectively disable IO after bugcheck (and don't spam firebird.log with messages about invalid file handle), IMHO.

On Windows, I would try to use jrd_file::fil_ext_lock to block all database IO. Of course, PIO should add check for DBB_bugcheck flag after acquiring jrd_file::fil_ext_lock.
POSIX implementation have no this lock, but I see no problem to add it there.
Probably, PIO_read() should be allowed after bugcheck and database file should not be closed immediately (as we already prohibit any writes).

Looks like this is out of scope of this ticket, btw. But it is always better to fix root issue, IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the real problem is - how to effectively disable IO after bugcheck (and don't spam firebird.log with messages about invalid file handle), IMHO.

How about abort()? Core dump will be a bonus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the real problem is - how to effectively disable IO after bugcheck (and don't spam firebird.log with messages about invalid file handle), IMHO.

How about abort()? Core dump will be a bonus.

abort() is called when BugcheckAbort is set to true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, the goal of the immediate call of CCH_shutdown() when database is bug-checked is to stop all database IO ASAP and not allow to write anything. Note, page cache is not flushed and database file is closed.

  1. I wonder why a database file is closed after releasing the locks. Can we change the order in a case of BUGCHECK and close it before?
  2. If we close a database file without synchronizing with other threads, can we consider it safe? Especially when other threads are executing PIO_write at the moment.

POSIX implementation have no this lock, but I see no problem to add it there.

If we really need this lock then changes from #8146 may help with it.

{
try
{
Expand Down
2 changes: 1 addition & 1 deletion src/jrd/cch_proto.h
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ bool CCH_prefetch_pages(Jrd::thread_db*);
void CCH_release(Jrd::thread_db*, Jrd::win*, const bool);
void CCH_release_exclusive(Jrd::thread_db*);
bool CCH_rollover_to_shadow(Jrd::thread_db* tdbb, Jrd::Database* dbb, Jrd::jrd_file*, const bool);
void CCH_shutdown(Jrd::thread_db*);
void CCH_shutdown(Jrd::thread_db*, bool bugcheck = false);
void CCH_unwind(Jrd::thread_db*, const bool);
bool CCH_validate(Jrd::win*);
void CCH_flush_ast(Jrd::thread_db*);
Expand Down
53 changes: 32 additions & 21 deletions src/jrd/err.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ using namespace Firebird;
//#define JRD_FAILURE_UNKNOWN "<UNKNOWN>" // Used when buffer fails


static void internal_error(ISC_STATUS status, int number, const TEXT* file = NULL, int line = 0);
static void get_internal_error_msg(TEXT* errmsg_buf, size_t errmsg_buf_size, int number, const TEXT* file = NULL, int line = 0);
static void post_nothrow(const unsigned lenToAdd, const ISC_STATUS* toAdd, FbStatusVector* statusVector);


Expand All @@ -69,13 +69,10 @@ void ERR_bugcheck(int number, const TEXT* file, int line)
* Things seem to be going poorly today.
*
**************************************/
thread_db* const tdbb = JRD_get_thread_data();
Database* const dbb = tdbb->getDatabase();

dbb->dbb_flags |= DBB_bugcheck;
CCH_shutdown(tdbb);
TEXT errmsg[MAX_ERRMSG_LEN + 1];
get_internal_error_msg(errmsg, sizeof(errmsg), number, file, line);

internal_error(isc_bug_check, number, file, line);
ERR_bugcheck_msg(errmsg);
}


Expand All @@ -94,10 +91,23 @@ void ERR_bugcheck_msg(const TEXT* msg)
thread_db* const tdbb = JRD_get_thread_data();
Database* const dbb = tdbb->getDatabase();

Arg::StatusVector status_vector;
status_vector << Arg::Gds(isc_bug_check) << Arg::Str(msg);
FbLocalStatus status;
status_vector.copyTo(&status);

// It's important to put the message into the log before any
// further actions because there is always a risk of crashing.
iscDbLogStatus(dbb->dbb_filename.nullStr(), &status);

dbb->dbb_flags |= DBB_bugcheck;
CCH_shutdown(tdbb);
CCH_shutdown(tdbb, true);

ERR_post(Arg::Gds(isc_bug_check) << Arg::Str(msg));
if (Config::getBugcheckAbort())
abort();

ERR_post_nothrow(status_vector);
status_exception::raise(tdbb->tdbb_status_vector);
}


Expand All @@ -117,7 +127,10 @@ void ERR_soft_bugcheck(int number, const TEXT* file, int line)
**************************************/

fb_assert(false);
internal_error(isc_bug_check, number, file, line);
TEXT errmsg[MAX_ERRMSG_LEN + 1];
get_internal_error_msg(errmsg, sizeof(errmsg), number, file, line);

ERR_post(Arg::Gds(isc_bug_check) << Arg::Str(errmsg));
}


Expand All @@ -133,8 +146,10 @@ void ERR_corrupt(int number)
* Things seem to be going poorly today.
*
**************************************/
TEXT errmsg[MAX_ERRMSG_LEN + 1];
get_internal_error_msg(errmsg, sizeof(errmsg), number);

internal_error(isc_db_corrupt, number);
ERR_post(Arg::Gds(isc_db_corrupt) << Arg::Str(errmsg));
}


Expand Down Expand Up @@ -413,7 +428,7 @@ void ERR_build_status(FbStatusVector* status_vector, const Arg::StatusVector& v)
}


static void internal_error(ISC_STATUS status, int number, const TEXT* file, int line)
static void get_internal_error_msg(TEXT* errmsg_buf, size_t errmsg_buf_size, int number, const TEXT* file, int line)
{
/**************************************
*
Expand All @@ -425,12 +440,10 @@ static void internal_error(ISC_STATUS status, int number, const TEXT* file, int
* Things seem to be going poorly today.
*
**************************************/
TEXT errmsg[MAX_ERRMSG_LEN + 1];
if (gds__msg_lookup(0, FB_IMPL_MSG_FACILITY_JRD_BUGCHK, number, errmsg_buf_size, errmsg_buf, NULL) < 1)
strcpy(errmsg_buf, "Internal error code");

if (gds__msg_lookup(0, FB_IMPL_MSG_FACILITY_JRD_BUGCHK, number, sizeof(errmsg), errmsg, NULL) < 1)
strcpy(errmsg, "Internal error code");

const size_t len = strlen(errmsg);
const size_t len = strlen(errmsg_buf);

if (file)
{
Expand All @@ -444,12 +457,10 @@ static void internal_error(ISC_STATUS status, int number, const TEXT* file, int
break;
}
}
fb_utils::snprintf(errmsg + len, sizeof(errmsg) - len,
fb_utils::snprintf(errmsg_buf + len, errmsg_buf_size - len,
" (%d), file: %s line: %d", number, ptr, line);
}
else {
fb_utils::snprintf(errmsg + len, sizeof(errmsg) - len, " (%d)", number);
fb_utils::snprintf(errmsg_buf + len, errmsg_buf_size - len, " (%d)", number);
}

ERR_post(Arg::Gds(status) << Arg::Str(errmsg));
}
19 changes: 11 additions & 8 deletions src/jrd/os/posix/unix.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ static jrd_file* seek_file(jrd_file*, BufferDesc*, FB_UINT64*, FbStatusVector*);
static jrd_file* setup_file(Database*, const PathName&, const int, const bool, const bool, const bool);
static void lockDatabaseFile(int& desc, const bool shareMode, const bool temporary,
const char* fileName, ISC_STATUS operation);
static bool unix_error(const TEXT*, const jrd_file*, ISC_STATUS, FbStatusVector* = NULL);
static bool unix_error(const TEXT*, const jrd_file*, ISC_STATUS, FbStatusVector* = NULL, bool print_errno = true);
static bool block_size_error(const jrd_file*, off_t, FbStatusVector* = NULL);
#if !(defined HAVE_PREAD && defined HAVE_PWRITE)
static SLONG pread(int, SCHAR*, SLONG, SLONG);
Expand Down Expand Up @@ -480,7 +480,7 @@ ULONG PIO_get_number_of_pages(const jrd_file* file, const USHORT pagesize)
**************************************/

if (file->fil_desc == -1)
unix_error("fstat", file, isc_io_access_err);
unix_error("PIO_get_number_of_pages", file, isc_io_access_err, NULL, false);

struct STAT statistics;
if (os_utils::fstat(file->fil_desc, &statistics))
Expand Down Expand Up @@ -546,7 +546,7 @@ void PIO_header(thread_db* tdbb, UCHAR* address, int length)
jrd_file* file = pageSpace->file;

if (file->fil_desc == -1)
unix_error("PIO_header", file, isc_io_read_err);
unix_error("PIO_header", file, isc_io_read_err, NULL, false);

for (i = 0; i < IO_RETRY; i++)
{
Expand Down Expand Up @@ -752,7 +752,7 @@ bool PIO_read(thread_db* tdbb, jrd_file* file, BufferDesc* bdb, Ods::pag* page,
FB_UINT64 offset;

if (file->fil_desc == -1)
return unix_error("read", file, isc_io_read_err, status_vector);
return unix_error("PIO_read", file, isc_io_read_err, status_vector, false);

Database* const dbb = tdbb->getDatabase();

Expand Down Expand Up @@ -804,7 +804,7 @@ bool PIO_write(thread_db* tdbb, jrd_file* file, BufferDesc* bdb, Ods::pag* page,
FB_UINT64 offset;

if (file->fil_desc == -1)
return unix_error("write", file, isc_io_write_err, status_vector);
return unix_error("PIO_write", file, isc_io_write_err, status_vector, false);

Database* const dbb = tdbb->getDatabase();

Expand Down Expand Up @@ -860,7 +860,7 @@ static jrd_file* seek_file(jrd_file* file, BufferDesc* bdb, FB_UINT64* offset,

if (file->fil_desc == -1)
{
unix_error("lseek", file, isc_io_access_err, status_vector);
unix_error("seek_file", file, isc_io_access_err, status_vector, false);
return 0;
}

Expand Down Expand Up @@ -1015,7 +1015,7 @@ static void lockDatabaseFile(int& desc, const bool share, const bool temporary,

static bool unix_error(const TEXT* string,
const jrd_file* file, ISC_STATUS operation,
FbStatusVector* status_vector)
FbStatusVector* status_vector, bool print_errno)
{
/**************************************
*
Expand All @@ -1030,7 +1030,10 @@ static bool unix_error(const TEXT* string,
**************************************/
Arg::Gds err(isc_io_error);
err << string << file->fil_string <<
Arg::Gds(operation) << Arg::Unix(errno);
Arg::Gds(operation);

if (print_errno)
err << Arg::Unix(errno);

if (!status_vector)
ERR_post(err);
Expand Down