Skip to content

Enhancement: Codellama FIM tokenization #2818

Closed
@apaz-cli

Description

@apaz-cli

I assume that the project will want to support Fill In Middle (FIM) tokenization to work with the codellama models. How will this be accomplished?

Reading the codellama paper (https://arxiv.org/abs/2308.12950), here's what they say about FIM:

We extend Llama 2’s tokenizer with four special tokens that mark the beginning of the prefix,
the middle part or the suffix, and the end of the infilling span. To limit the distribution shift
between autoregressive and infilling training, we suppress the implicit leading space that
SentencePiece tokenizers add upon encoding the middle part and the suffix (Kudo & Richardson, 2018).
In SPM format, we concatenate the prefix and the middle part before encoding to tokens.
Note that our model doesn’t encounter split subtokens in the SPM format while it does in the PSM format.

In the addendum of the paper, they suggest to use PSM format over SPM format, or to use SPM format with token healing. PSM seems more sensible to me, at least initially.

So, about those four tokens. Here they are.
https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/tokenizer.py#L28-L31
Their values according to the tokenizer are:

self.prefix_id = 32007
self.middle_id = 32009
self.suffix_id = 32008
self.eot_id = 32010

With this, it should be possible to stitch together FIM functionality from the project's existing capabilities. I'm working on it, PR probably forthcoming.

My questions are:

  • Are there any considerations for what the FIM API should look like?
    • How do you select the middle? Do you include <MID> and such in the query string, or just a pointer into a list of tokens or characters?
  • What is the llama.cpp equivalent of sp_model.piece_to_id("▁<MID>")?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions