-
Notifications
You must be signed in to change notification settings - Fork 270
k quant #2169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
k quant #2169
Changes from 2 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
4b610c7
k quant
jiafatom 6015feb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c3318cf
k quant
jiafatom 115dee1
Merge branch 'k_quant' of https://github.com/jiafatom/neural-compress…
jiafatom 1b3518a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4542a33
Merge branch 'k_quant' of https://github.com/jiafatom/neural-compress…
jiafatom d91b4e5
Merge branch 'k_quant' of https://github.com/jiafatom/neural-compress…
jiafatom de4f7f0
Merge branch 'k_quant' of https://github.com/jiafatom/neural-compress…
jiafatom 0a1a0d4
test
jiafatom 99f10df
Merge branch 'k_quant' of https://github.com/jiafatom/neural-compress…
jiafatom 903604f
Merge branch 'int8_new' into k_quant
jiafatom 0440905
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -247,6 +247,169 @@ def quant_tensor(data, num_bits=4, group_size=32, scheme="asym", dtype="int", ra | |
return q_weight, scale, zero_point | ||
|
||
|
||
def quant_tensor_k_quant_cpu(data, num_bits=4, group_size=32): | ||
"""Quantize tensor per group based on k quant. | ||
Ref: https://github.com/ggml-org/llama.cpp/blob/64eda5deb9859e87a020e56bab5d2f9ca956f1de/ggml/src/ggml-quants.c | ||
|
||
Args: | ||
data : input weight | ||
num_bits (int, optional): num_bits. Defaults to 4. | ||
group_size (int, optional): how many elements share one scale/zp. Defaults to 4. | ||
|
||
Returns: | ||
output: quantized weight | ||
scale: scale | ||
zero_point: zero point | ||
""" | ||
data = np.reshape(data, (-1, group_size)).astype(np.float32) # (nb, group_size) | ||
maxq = 2**num_bits - 1 | ||
minq = 0 | ||
sum_x2 = np.sum(data**2, axis=1, keepdims=True) # (nb, 1) | ||
av_x = np.sqrt(sum_x2 / group_size) # (nb, 1) | ||
weights = np.add(av_x, np.abs(data)) # (nb, group_size) | ||
rmin = np.min(data, axis=1, keepdims=True) # (nb, 1) | ||
rmax = np.max(data, axis=1, keepdims=True) # (nb, 1) | ||
sum_w = np.sum(weights, axis=1, keepdims=True) # (nb, 1) | ||
sum_x = np.sum(weights * data, axis=1, keepdims=True) # (nb, group_size) | ||
iscale = np.ones(rmax.shape, dtype=data.dtype) # (nb, 1) | ||
mask = rmin != rmax | ||
iscale[mask] = (maxq - minq) / (rmax[mask] - rmin[mask]) | ||
scale = 1 / iscale | ||
quant_data = np.clip(np.round(iscale * (data - rmin)), minq, maxq) # (nb, group_size) | ||
diff = scale * quant_data + rmin - data # (nb, group_size) | ||
best_mad = np.sum(weights * diff**2, axis=1, keepdims=True) # (nb, 1) | ||
nstep = 20 | ||
rdelta = 0.1 | ||
# nstep * rdelta = -2 * rrmin, maxq - minq = 2**num_bits - 1 | ||
rrmin = -1 | ||
for is_ in range(nstep): | ||
iscale_new = np.ones(rmax.shape, dtype=data.dtype) # (nb, 1) | ||
factor = np.array([rrmin + rdelta * is_ + maxq - minq]).astype(data.dtype)[0] | ||
mask = rmin != rmax | ||
iscale_new[mask] = factor / (rmax[mask] - rmin[mask]) | ||
quant_data_new = np.clip(np.round(iscale_new * (data - rmin)), minq, maxq) # (nb, group_size) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think maybe there's an issue with the algorithm. Since GGUF supports float zero points, rmin is subtracted in this line. However, in INC, only integer zero points are supported, so I think rmin should be replaced by the zero point (zp). |
||
mul_weights_quant_data_new = weights * quant_data_new | ||
sum_l = np.sum(mul_weights_quant_data_new, axis=1, keepdims=True) # (nb, 1) | ||
sum_l2 = np.sum(mul_weights_quant_data_new * quant_data_new, axis=1, keepdims=True) # (nb, 1) | ||
sum_xl = np.sum(mul_weights_quant_data_new * data, axis=1, keepdims=True) # (nb, 1) | ||
D = np.subtract(sum_w * sum_l2, sum_l**2) # (nb, 1) | ||
|
||
this_scale = (sum_w * sum_xl - sum_x * sum_l) / D # (nb, 1) | ||
this_min = (sum_l2 * sum_x - sum_l * sum_xl) / D # (nb, 1) | ||
|
||
diff = this_scale * quant_data_new + this_min - data # (nb, group_size) | ||
mad = np.sum(weights * diff**2, axis=1, keepdims=True) # (nb, 1) | ||
|
||
mad_1 = np.array(mad) | ||
best_mad_1 = np.array(best_mad) | ||
idx_to_replace = np.where(mad_1 < best_mad_1)[0] | ||
quant_data[idx_to_replace, :] = quant_data_new[idx_to_replace, :] | ||
best_mad[idx_to_replace] = mad[idx_to_replace] | ||
scale[idx_to_replace] = this_scale[idx_to_replace] | ||
rmin[idx_to_replace] = this_min[idx_to_replace] | ||
|
||
zero_point = np.clip(((-rmin) / scale).round(), 0, maxq).astype("uint8") | ||
scale = scale.astype(np.float64) | ||
q_weight = np.empty_like(data, dtype=scale.dtype) | ||
np.divide(data, scale, out=q_weight) | ||
np.add(q_weight, zero_point, out=q_weight) | ||
np.round(q_weight, out=q_weight) | ||
np.clip(q_weight, minq, maxq, out=q_weight) | ||
|
||
return q_weight, scale, zero_point | ||
|
||
|
||
def quant_tensor_k_quant_cuda(data, num_bits=4, group_size=32): | ||
"""Quantize tensor per group based on k quant. | ||
Ref: https://github.com/ggml-org/llama.cpp/blob/64eda5deb9859e87a020e56bab5d2f9ca956f1de/ggml/src/ggml-quants.c | ||
|
||
Args: | ||
data : input weight | ||
num_bits (int, optional): num_bits. Defaults to 4. | ||
group_size (int, optional): how many elements share one scale/zp. Defaults to 4. | ||
|
||
Returns: | ||
output: quantized weight | ||
scale: scale | ||
zero_point: zero point | ||
""" | ||
try: | ||
import cupy as cp | ||
import torch | ||
|
||
if torch.cuda.is_available(): | ||
data = cp.asarray(data) | ||
data = data.reshape((-1, group_size)).astype(np.float32) # (nb, group_size) | ||
nb = data.shape[0] | ||
maxq = 2**num_bits - 1 | ||
minq = 0 | ||
sum_x2 = np.sum(data**2, axis=1, keepdims=True) # (nb, 1) | ||
av_x = np.sqrt(sum_x2 / group_size) # (nb, 1) | ||
weights = np.add(av_x, np.abs(data)) # (nb, group_size) | ||
rmin = np.min(data, axis=1, keepdims=True) # (nb, 1) | ||
rmax = np.max(data, axis=1, keepdims=True) # (nb, 1) | ||
sum_w = np.sum(weights, axis=1, keepdims=True) # (nb, 1) | ||
sum_x = np.sum(weights * data, axis=1, keepdims=True) # (nb, group_size) | ||
iscale = cp.ones(rmax.shape, dtype=data.dtype) # (nb, 1) | ||
mask = rmin != rmax | ||
iscale[mask] = (maxq - minq) / (rmax[mask] - rmin[mask]) | ||
scale = 1 / iscale | ||
quant_data = np.clip(np.round(iscale * (data - rmin)), minq, maxq) # (nb, group_size) | ||
diff = scale * quant_data + rmin - data # (nb, group_size) | ||
best_mad = np.sum(weights * diff**2, axis=1, keepdims=True) # (nb, 1) | ||
jiafatom marked this conversation as resolved.
Show resolved
Hide resolved
|
||
nstep = 20 | ||
rdelta = 0.1 | ||
rrmin = -1 | ||
for is_ in range(nstep): | ||
iscale_new = cp.ones(rmax.shape, dtype=data.dtype) # (nb, 1) | ||
factor = cp.array([rrmin + rdelta * is_ + maxq - minq]).astype(data.dtype)[0] | ||
mask = rmin != rmax | ||
iscale_new[mask] = factor / (rmax[mask] - rmin[mask]) | ||
quant_data_new = np.clip(np.round(iscale_new * (data - rmin)), minq, maxq) # (nb, group_size) | ||
mul_weights_quant_data_new = weights * quant_data_new | ||
sum_l = np.sum(mul_weights_quant_data_new, axis=1, keepdims=True) # (nb, 1) | ||
sum_l2 = np.sum(mul_weights_quant_data_new * quant_data_new, axis=1, keepdims=True) # (nb, 1) | ||
sum_xl = np.sum(mul_weights_quant_data_new * data, axis=1, keepdims=True) # (nb, 1) | ||
D = np.subtract(sum_w * sum_l2, sum_l**2) # (nb, 1) | ||
|
||
this_scale = (sum_w * sum_xl - sum_x * sum_l) / D # (nb, 1) | ||
this_min = (sum_l2 * sum_x - sum_l * sum_xl) / D # (nb, 1) | ||
|
||
diff = this_scale * quant_data_new + this_min - data # (nb, group_size) | ||
mad = np.sum(weights * diff**2, axis=1, keepdims=True) # (nb, 1) | ||
|
||
mad_1 = cp.array(mad) | ||
best_mad_1 = cp.array(best_mad) | ||
idx_to_replace = np.where(mad_1 < best_mad_1)[0] | ||
quant_data[idx_to_replace, :] = quant_data_new[idx_to_replace, :] | ||
best_mad[idx_to_replace] = mad[idx_to_replace] | ||
scale[idx_to_replace] = this_scale[idx_to_replace] | ||
rmin[idx_to_replace] = this_min[idx_to_replace] | ||
|
||
zero_point = np.clip(((-rmin) / scale).round(), 0, maxq).astype("uint8") | ||
scale = scale.astype(np.float64) | ||
q_weight = np.empty_like(data, dtype=scale.dtype) | ||
np.divide(data, scale, out=q_weight) | ||
np.add(q_weight, zero_point, out=q_weight) | ||
np.round(q_weight, out=q_weight) | ||
np.clip(q_weight, minq, maxq, out=q_weight) | ||
|
||
return q_weight.get(), scale.get(), zero_point.get() | ||
else: | ||
logger.warning( | ||
"Try to use k-quant quantization on CUDA. However, CUDA is not available." | ||
"Fall back to k-quant quantization on CPU." | ||
) | ||
return quant_tensor_k_quant_cpu(data, num_bits, group_size) | ||
except ImportError: | ||
logger.info( | ||
"Now we are using k-quant quantization on cpu, which is time consuming." | ||
"Please consider install cupy to speed up on CUDA. See https://cupy.dev/" | ||
"Please also install torch to check CUDA availability." | ||
) | ||
return quant_tensor_k_quant_cpu(data, num_bits, group_size) | ||
|
||
|
||
def qdq_tensor(data, num_bits=4, group_size=32, scheme="asym", dtype="int", ratio=1.0): | ||
"""Quant dequant tensor per group. | ||
|
||
|
@@ -299,6 +462,7 @@ def rtn_quantize( | |
ratios={}, | ||
accuracy_level=0, | ||
providers=["CPUExecutionProvider"], | ||
algorithm="rtn", | ||
): | ||
"""Quant the model with round to nearst method. | ||
|
||
|
@@ -372,9 +536,13 @@ def rtn_quantize( | |
): # pragma: no cover | ||
# MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions, supported by CPU EP | ||
# MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1, supported by CPU EP AND CUDA EP | ||
q_weight, scale, zp = quant_tensor( | ||
weight.T, num_bits, group_size, scheme, "uint", ratios.get(node.input[1], 1) | ||
) | ||
if algorithm == "k_quant": | ||
q_weight, scale, zp = quant_tensor_k_quant_cuda(weight.T, num_bits, group_size) | ||
else: | ||
q_weight, scale, zp = quant_tensor( | ||
weight.T, num_bits, group_size, scheme, "uint", ratios.get(node.input[1], 1) | ||
) | ||
|
||
q_matmul_node, new_inits = make_matmul_weight_only_node( | ||
node=node, | ||
weight_shape=org_w_shape, | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.