Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
Databricks just released 2 new models called DBRX (base and instruct). They have their own architecture:
{
"architectures": [
"DbrxForCausalLM"
],
"attn_config": {
"clip_qkv": 8,
"kv_n_heads": 8,
"model_type": "",
"rope_theta": 500000
},
"auto_map": {
"AutoConfig": "configuration_dbrx.DbrxConfig",
"AutoModelForCausalLM": "modeling_dbrx.DbrxForCausalLM"
},
"d_model": 6144,
"emb_pdrop": 0.0,
"ffn_config": {
"ffn_hidden_size": 10752,
"model_type": "",
"moe_jitter_eps": 0,
"moe_loss_weight": 0.05,
"moe_num_experts": 16,
"moe_top_k": 4
},
"initializer_range": 0.02,
"max_seq_len": 32768,
"model_type": "dbrx",
"n_heads": 48,
"n_layers": 40,
"output_router_logits": false,
"resid_pdrop": 0.0,
"router_aux_loss_coef": 0.05,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.38.2",
"use_cache": true,
"vocab_size": 100352
}
Motivation
These models are superior to the predecessors like Llama-2 or Mixtral (even though they are larger), the community can really benefit from these two and the fine-tuned models that come after.
https://huggingface.co/databricks/dbrx-instruct
Possible Implementation
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
python llama.cpp/convert-hf-to-gguf.py
Traceback (most recent call last):
File "/llama.cpp/convert-hf-to-gguf.py", line 2099, in <module>
main()
File "/llama.cpp/convert-hf-to-gguf.py", line 2079, in main
model_class = Model.from_model_architecture(hparams["architectures"][0])
File "/llama.cpp/convert-hf-to-gguf.py", line 215, in from_model_architecture
raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'DbrxForCausalLM' not supported!
python llama.cpp/convert.py
File "/llama.cpp/convert.py", line 1486, in <module>
main()
File "/llama.cpp/convert.py", line 1422, in main
model_plus = load_some_model(args.model)
File "/llama.cpp/convert.py", line 1291, in load_some_model
model_plus = merge_multifile_models(models_plus)
File "/llama.cpp/convert.py", line 747, in merge_multifile_models
model = merge_sharded([mp.model for mp in models_plus])
File "/llama.cpp/convert.py", line 726, in merge_sharded
return {name: convert(name) for name in names}
File "/llama.cpp/convert.py", line 726, in <dictcomp>
return {name: convert(name) for name in names}
File "/llama.cpp/convert.py", line 701, in convert
lazy_tensors: list[LazyTensor] = [model[name] for model in models]
File "/llama.cpp/convert.py", line 701, in <listcomp>
lazy_tensors: list[LazyTensor] = [model[name] for model in models]
KeyError: 'transformer.blocks.0.ffn.experts.mlp.w1'
Dbrx is a mixture-of-experts model, which each FFN is divided into 16 experts and only 4 are activated at any given time. We build on MegaBlocks
https://github.com/databricks/megablocks