New embedding quant fusion #10325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

facebook-github-bot merged 3 commits into pytorch:main from metascroy:export-D73381542

Apr 23, 2025

Contributor

metascroy commented Apr 21, 2025

Summary:
The diff adds new quant fusion passes to recognize 2, 4, and 8 bit quantized embeedings (per group and per channel) and fuses them to ExecuTorch kernels. This makes torchao's quantize_ integrate with ExecuTorch:

 quantize_(
    model,
    IntxWeightOnlyConfig(weight_dtype=torch.int4, granularity=PerGroup(32)),
    lambda m, fqn: isinstance(m, torch.nn.Embedding)
)

# lower model to executorch

For the model to lower, we need to run QuantFusionPass. For subbyte, we also need to run constant_prop_pass. (See new unit tests for examples). In follow-up diffs, we will enable these passes by default in to_executorch before the memory passing and out-variant passes.

Differential Revision: D73381542

metascroy requested review from JacobSzwejbka, tarun292 and larryliu0820 as code owners

April 21, 2025 18:38

pytorch-bot bot commented Apr 21, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10325

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit d88601f with merge base ad1b154 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for backends/cadence/aot/tests/test_replace_ops_passes.py:
pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t e5670f82a4d9f553cc58dbd3d84cc8ef52e56345121f69fcda6b9b5395d663fe /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Apr 21, 2025

This pull request was exported from Phabricator. Differential Revision: D73381542

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented Apr 21, 2025

@metascroy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

metascroy added the topic: not user facing label

metascroy requested a review from GregoryComer

April 21, 2025 23:59

kimishpatel reviewed

View reviewed changes

exir/passes/_quant_patterns_and_replacements.py Outdated

+                      weight_1 = weight_view[:, :, 1] << 2
+                      weight_2 = weight_view[:, :, 2] << 4
+                      weight_3 = weight_view[:, :, 3] << 6
+                      packed_weight = weight_0 + weight_1 + weight_2 + weight_3

Contributor

kimishpatel Apr 22, 2025

nit: Just do bitwise OR.

kimishpatel reviewed

View reviewed changes

exir/passes/_quant_patterns_and_replacements.py Outdated

+                      weight_view = weight_range_shifted.view(
+                          weight.shape[0], weight.shape[1] // 2, 2
+                      )
+                      weight_even = weight_view[:, :, 0] * 16  # left shift 4

Contributor

kimishpatel Apr 22, 2025

Why * 16 here but shift by 4 above?

Contributor Author

metascroy Apr 22, 2025

Isn't * 16 the same as shift by 4?

I just copied the code from here: https://github.com/pytorch/executorch/blob/main/examples/models/llama/source_transformation/quantize.py#L659-L683

But I can clean it up and change to using left shift.

Contributor

kimishpatel Apr 22, 2025

yeah was just highlighting for consistency

kimishpatel reviewed

View reviewed changes

exir/passes/_quant_patterns_and_replacements.py Outdated

+                      )
+                      weight_even = weight_view[:, :, 0] * 16  # left shift 4
+                      weight_odd = weight_view[:, :, 1]
+                      packed_weight = weight_even + weight_odd

Contributor

kimishpatel Apr 22, 2025

bitwise OR

kimishpatel reviewed

View reviewed changes

exir/passes/_quant_patterns_and_replacements.py Show resolved Hide resolved

kimishpatel reviewed

View reviewed changes

exir/passes/_quant_patterns_and_replacements.py

		@@ -296,6 +362,22 @@ def embedding_2bit_dtype(
		return torch.ops.aten.embedding.default(weight, indices)


		@register_fake("quantized_decomposed::embedding_2bit.dtype")

Contributor

kimishpatel Apr 22, 2025

this is needed?

Contributor Author

metascroy Apr 22, 2025

I think so? You cannot dynamo trace code that doesn't have meta kernels registered.

Contributor

kimishpatel Apr 22, 2025

yeah sorry I think I asked because I wasnt sure how it was working before without it. But yeah you do need meta kernel

kimishpatel reviewed

View reviewed changes

exir/passes/_quant_patterns_and_replacements.py

Comment on lines +548 to +550

+                  num_embeddings, packed_embedding_dim = weight.shape
+                  embedding_dim = packed_embedding_dim * 2
+                  embedding = torch.nn.Embedding(num_embeddings, embedding_dim, device=weight.device)

Contributor

kimishpatel Apr 22, 2025

very small nit: small refactor can abstract these 3 lines out.

kimishpatel reviewed

View reviewed changes

exir/passes/_quant_patterns_and_replacements.py

+                  def embedding_byte_dtype_pattern(
+                      indices, int_data, group_size, scale, zero_point, output_dtype
+                  ):
+                      dq = torch.ops.torchao.dequantize_affine.default(

Contributor

kimishpatel Apr 22, 2025

what does "INT" mean here?

Contributor Author

metascroy Apr 22, 2025

It's the zero point domain. It means the zero points are integers. But this is an arg torchao is going to change in their quant primitives when they clean them up.

kimishpatel reviewed

View reviewed changes

exir/passes/_quant_patterns_and_replacements.py

+                      return torch.ops.aten.embedding.default(dq, indices)
+                  def embedding_byte_replacement(indices, int_data, group_size, scale, zero_point):
+                      zero_point_dtype_cast = torch.ops.aten.to.dtype(zero_point, scale.dtype)

Contributor

kimishpatel Apr 22, 2025

embedding byte ops take float zero point?

Contributor Author

metascroy Apr 22, 2025

The ones in the ET kernels do: https://github.com/pytorch/executorch/blob/main/kernels/quantized/cpu/embeddingxb.cpp#L200

kimishpatel reviewed

View reviewed changes

exir/tests/TARGETS

@@ @@ -298,6 +298,8 @@ python_unittest( @@
                       "//caffe2:torch",
                       "//executorch/exir:lib",
                       "//executorch/exir/passes:quant_fusion_pass",
+                      "//pytorch/ao:torchao",
+                      "//executorch/exir/passes:constant_prop_pass",

Contributor

kimishpatel Apr 22, 2025

shouldnt you call const prop in the QuantFusionPass?

Contributor Author

metascroy Apr 22, 2025 •

edited

Loading

Unfortunately QuantFusionPass works on a graph module and const_prop_pass works on an exported_program. I thought we could enable both in to_executorch by default, rather than have users call the passes separately like is done in the unit test?

Contributor

kimishpatel Apr 22, 2025

I think const prop has some nuances which may make it non-trivial for excample for q/dq nodes.

My concern here would be the perf cliff for the uninitiated.

Can we just not const prop manually? instead of even introducing this op in the graph in the first place?

Contributor Author

metascroy Apr 22, 2025

Manual const propagation would still require updating the signature of the exported program because we're changing the weights. So it couldn't be done on the graph module.

In terms of perf cliff, it will not lower to ExecuTorch without const propagation because the pack embedding op has no out-variant.

Contributor

kimishpatel Apr 23, 2025

I see. ok. but lets plan to have this cost prop done appropriately. I highly doubt that this can be done transparently. For example for quantized models' have dq on weights that might get const propagated

kimishpatel reviewed

View reviewed changes

exir/tests/test_quant_fusion_pass.py Outdated

+                          self._test_embedding_torchao(bit_width, test_dtype_variant, test_per_group)
+                  def _test_embedding_torchao(
+                      self, bit_width: int, test_dtype_variant: bool, test_per_group: bool

Contributor

kimishpatel Apr 22, 2025

test_dtype_variant as bool feels like a bit of misnomer

Contributor Author

metascroy Apr 22, 2025

I'll change to use_dtype_variant?

Contributor

kimishpatel Apr 22, 2025

i think i misunderstood this probably because it was a bit late when i was reviewin git. You just meant to check .dtype variant of the op and I thought you were checking fp16 vs fp32 vs other dtype variants

kimishpatel requested changes

View reviewed changes

Contributor

kimishpatel left a comment

sending back for question on const prop

metascroy and others added 3 commits

April 22, 2025 19:41


          New embedding quant fusion

21f5382

Summary:
The diff adds new quant fusion passes to recognize 2, 4, and 8 bit quantized embeedings (per group and per channel) and fuses them to ExecuTorch kernels.  This makes torchao's quantize_ integrate with ExecuTorch:


```
 quantize_(
    model,
    IntxWeightOnlyConfig(weight_dtype=torch.int4, granularity=PerGroup(32)),
    lambda m, fqn: isinstance(m, torch.nn.Embedding)
)

# lower model to executorch
```

For the model to lower, we need to run QuantFusionPass.  For subbyte, we also need to run constant_prop_pass.  (See new unit tests for examples).  In follow-up diffs, we will enable these passes by default in to_executorch before the memory passing and out-variant passes.

Differential Revision: D73381542


          lint

a7cee32

up

d88601f

metascroy force-pushed the export-D73381542 branch from ff15175 to d88601f Compare

April 23, 2025 02:42

Contributor

facebook-github-bot commented Apr 23, 2025

@metascroy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kimishpatel approved these changes

View reviewed changes

Contributor

kimishpatel left a comment

On const prop question, lets follow up in a separate pr

facebook-github-bot merged commit 28ee6f5 into pytorch:main

82 of 86 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

kimishpatel kimishpatel approved these changes

JacobSzwejbka Awaiting requested review from JacobSzwejbka JacobSzwejbka is a code owner

tarun292 Awaiting requested review from tarun292 tarun292 is a code owner

larryliu0820 Awaiting requested review from larryliu0820 larryliu0820 is a code owner

GregoryComer Awaiting requested review from GregoryComer

Labels

CLA Signed fb-exported topic: not user facing