Skip to content

Adding EOS Tokens to Qwen Models #2512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ariG23498
Copy link

Fix #2481

Current status 👇

from torchtune.models.qwen2 import Qwen2Tokenizer
from torchtune.data import Message

messages = [
    Message(role="user", content="Hello world!", masked=True),
    Message(role="assistant", content="How are you?", masked=False),
]

tokenizer = Qwen2Tokenizer(
    path="Qwen2-0.5B-Instruct/vocab.json",
    merges_file="Qwen2-0.5B-Instruct/merges.txt",
)
tokenized_text = tokenizer.tokenize_messages(messages)
print(tokenized_text)


# (
# [None, 151644, 872, 198, 9707, 1879, 0, 151645, 198, 151644, 77091, 198, 4340, 525, 498, 30, 151645, 198, 151645],
# [True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, True]
# )

Not very sure why None comes as a token. Am I missing something obvious here? Any help would be great.

After the initial review, I will add tests.

Copy link

pytorch-bot bot commented Mar 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2512

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3d979b7 with merge base fb6d4cd (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 18, 2025
@ariG23498
Copy link
Author

A gentle ping on this.

@joecummings
Copy link
Contributor

Thanks for the ping, sorry this fell through the cracks!

The none is coming from the mis-use of the bos_id here. bos_id is not actually defined for Qwen2 so it's showing up as none.

@ariG23498
Copy link
Author

Interesting.

Can I get a quick review on the changes I have made so far, just to be sure about my implementation.

@joecummings
Copy link
Contributor

Interesting.

Can I get a quick review on the changes I have made so far, just to be sure about my implementation.

Code looks nice and clean and I don't see any obvious errors (with the exception of the bos_id thing I already pointed out to you). Of course, the real test will be the unit tests :)

@ysurs
Copy link

ysurs commented May 4, 2025

Hey @ariG23498 , @joecummings , I am new to torchtune and OSS, I was thinking of taking Qwen2_5. This PR will solve the issue for both Qwen2 and qwen2_5 ?

I think qwen2_5's tokenizer is inheriting from qwen2, so no need to implement for qwen2_5?

@ariG23498
Copy link
Author

@joecummings I think the PR is in a good state to review. The tests pass on my end.

@ysurs I checked the Qwen 2.5 tokenizer, it does inherit Qwen 2's tokenizer, but the edits still need to be made I guess, as the current tokenizers API would have changed.

@codecov-commenter
Copy link

codecov-commenter commented May 7, 2025

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 60.70%. Comparing base (fb6d4cd) to head (3d979b7).

Files with missing lines Patch % Lines
torchtune/models/qwen2/_tokenizer.py 93.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2512      +/-   ##
==========================================
- Coverage   64.26%   60.70%   -3.56%     
==========================================
  Files         427      427              
  Lines       25988    25997       +9     
==========================================
- Hits        16700    15781     -919     
- Misses       9288    10216     +928     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ariG23498
Copy link
Author

Hi @joecummings,

The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?

@joecummings
Copy link
Contributor

Hi @joecummings,

The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?

If you look at the contributing guide, you should be able to run these tests very quickly on your local machine to ensure that you're passing the linter.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @ariG23498! Two small comments, but other than that this looks good to me

mask.append(mask[-1])
if add_end_tokens:
tokens = tokens + [self.eos_id]
mask = mask + [mask[-1] if mask else True]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob q: why is this different than what we're doing for Llama3? ref

Copy link
Author

@ariG23498 ariG23498 May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we added True to mask if there was an end token this part of the test fails

expected_mask = [True] * 67 + [False] * 121

due to the fact that it searches for False.

Thanks for catching this, I wanted to ask about the test itself, but forgot.

Do you think I should change this test?

@joecummings joecummings requested a review from ebsmothers May 13, 2025 13:55
@ariG23498
Copy link
Author

Hi folks!

Let me know if I need to work on something to get this to the finishing line 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add add_end_token to the Qwen Models
6 participants