Skip to content

Classic queue v2 can crash when used with mirroring #6176

Closed
@mkuratczyk

Description

@mkuratczyk

In 3.11, when classic queues v2 are used with mirroring, they can crash like this:

** Reason for termination ==
** {{badmatch,-1},
    [{rabbit_mirror_queue_slave,update_delta,2,
                                [{file,"rabbit_mirror_queue_slave.erl"},
                                 {line,1081}]},
     {rabbit_mirror_queue_slave,process_instruction,2,
                                [{file,"rabbit_mirror_queue_slave.erl"},
                                 {line,1009}]},
     {rabbit_mirror_queue_slave,handle_cast,2,
                                [{file,"rabbit_mirror_queue_slave.erl"},
                                 {line,299}]},
     {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1067}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

  crasher:
    initial call: rabbit_prequeue:init/1
    pid: <0.12215.0>
    registered_name: []
    exception exit: {{badmatch,-1},
                     [{rabbit_mirror_queue_slave,update_delta,2,
                          [{file,"rabbit_mirror_queue_slave.erl"},
                           {line,1081}]},
                      {rabbit_mirror_queue_slave,process_instruction,2,
                          [{file,"rabbit_mirror_queue_slave.erl"},
                           {line,1009}]},
                      {rabbit_mirror_queue_slave,handle_cast,2,
                          [{file,"rabbit_mirror_queue_slave.erl"},{line,299}]},
                      {gen_server2,handle_msg,2,
                          [{file,"gen_server2.erl"},{line,1067}]},
                      {proc_lib,init_p_do_apply,3,
                          [{file,"proc_lib.erl"},{line,240}]}]}
      in function  gen_server2:terminate/3 (gen_server2.erl, line 1183)
    ancestors: [<0.12062.0>,<0.1605.0>,<0.1588.0>,<0.1587.0>,
                  rabbit_vhost_sup_sup,rabbit_sup,<0.1313.0>]
    message_queue_len: 0
    messages: []
    links: [<0.12062.0>,<0.12216.0>]
    dictionary: [{credit_blocked,[]},
                  {process_name,
                      {rabbit_mirror_queue_slave,
                          {resource,<<"/">>,queue,
                              <<"cmq-publish_then_consume">>}}},
                  {guid,{{548389537,174457376,3517217523,4084119289},1}},
                  {explicit_gc_run_operation_threshold,1000},
                  {rand_seed,
                      {#{jump => #Fun<rand.3.34006561>,
                         max => 288230376151711743,
                         next => #Fun<rand.5.34006561>,type => exsplus},
                       [247167938309581976|185153570766432085]}},
                  {credit_flow_default_credit,{400,200}}]
    trap_exit: true
    status: running
    heap_size: 18481566
    stack_size: 35
    reductions: 56778258
  neighbours:
    neighbour:
      pid: <0.12216.0>
      registered_name: []
      initial call: gm:init/1
      current_function: {gen_server2,process_next_msg,1}
      ancestors: [<0.12215.0>,<0.12062.0>,<0.1605.0>,<0.1588.0>,<0.1587.0>,
                  rabbit_vhost_sup_sup,rabbit_sup,<0.1313.0>]
      message_queue_len: 0
      links: [<0.12215.0>]
      trap_exit: false
      status: waiting
      heap_size: 610
      stack_size: 11
      reductions: 115417
      current_stacktrace: [{gen_server2,process_next_msg,1,
                               [{file,"gen_server2.erl"},{line,673}]},
                  {proc_lib,init_p_do_apply,3,
                            [{file,"proc_lib.erl"},{line,240}]}]
    supervisor: {<0.12062.0>,rabbit_amqqueue_sup}
    errorContext: child_terminated
    reason: {{badmatch,-1},
             [{rabbit_mirror_queue_slave,update_delta,2,
                  [{file,"rabbit_mirror_queue_slave.erl"},{line,1081}]},
              {rabbit_mirror_queue_slave,process_instruction,2,
                  [{file,"rabbit_mirror_queue_slave.erl"},{line,1009}]},
              {rabbit_mirror_queue_slave,handle_cast,2,
                  [{file,"rabbit_mirror_queue_slave.erl"},{line,299}]},
              {gen_server2,handle_msg,2,
                  [{file,"gen_server2.erl"},{line,1067}]},
              {proc_lib,init_p_do_apply,3,
                  [{file,"proc_lib.erl"},{line,240}]}]}
    offender: [{pid,<0.12215.0>},
               {id,rabbit_amqqueue},
               {mfargs,
                   {rabbit_prequeue,start_link,
                       [{amqqueue,
                            {resource,<<"/">>,queue,
                                <<"cmq-publish_then_consume">>},
                            true,false,none,
                            [{<<"x-max-length">>,long,100000},
                             {<<"x-queue-mode">>,longstr,<<"lazy">>},
                             {<<"x-queue-version">>,long,2}],
                            <48209.8417.0>,[],[],[],
                            [{vhost,<<"/">>},
                             {name,<<"ha">>},
                             {pattern,<<"cmq.*">>},
                             {'apply-to',<<"queues">>},
                             {definition,
                                 [{<<"ha-mode">>,<<"exactly">>},
                                  {<<"ha-params">>,3},
                                  {<<"ha-sync-mode">>,<<"automatic">>}]},
                             {priority,1}],
                            undefined,
                            [{<48209.8420.0>,<48209.8417.0>}],
                            [],live,0,[],<<"/">>,
                            #{user => <<"guest">>},
                            rabbit_classic_queue,#{}},
                        slave,<0.12061.0>]}},
               {restart_type,intrinsic},
               {shutdown,600000},
               {child_type,worker}]

Steps to reproduce:

  1. git checkout v3.11.x
  2. bazel run start-cluster
  3. rabbitmqctl -n rabbit-0 set_policy --apply-to queues --priority 1 ha "cmq.*" '{"ha-mode":"exactly", "ha-params": 3, "ha-sync-mode": "automatic"}'
  4. Publish messages to a queue and then consume them:
perf-test -x 1 -y 0 -c 3000 -u cmq-publish_then_consume -qa x-max-length=100000,x-queue-version=2,x-queue-mode=lazy -C 100000 -ad false -f persistent
perf-test -x 0 -y 1 -u cmq-publish_then_consume -qa x-max-length=100000,x-queue-version=2,x-queue-mode=lazy -ad false -f persistent

The issue doesn't seem to occur with CQv1. With v2, it occurs both with lazy and non-lazy, but it seems easier to reproduce with lazy. In both cases, given enough messages in the queue (the above 100k is generally not sufficient but just change it to 5M or something), it leads to a memory spike. Especially with non-lazy CQv2, that spike often leads to a memory alarm or even OOMkill. It looks like this:
crash

I was not able to reproduce using main.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions