Skip to content

Don't encode negative ints as bcf_*_vector_end or other reserved values #766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 12, 2018

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Sep 7, 2018

The VCF v4.3 spec reserved a bunch of negative integers back in 2014 to allow for future expansion of missing / end-of-vector / etc special values:

Integers may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. […] In total, eight values are reserved for future use: 0x80-0x87, 0x8000-0x8007, 0x80000000-0x80000007.

but it appears that this was never implemented in HTSlib.

In addition the previous code was avoiding bcf_int8/16_missing correctly but was wrongly encoding -127 and -32767 as bcf_int8/16_vector_end.

This patch is sufficient to fix samtools/bcftools#874 but it's possible that other code also needs modifying to properly reserve these values.

Also adds a test case exercising -127 and -128.

@pd3
Copy link
Member

pd3 commented Sep 11, 2018

Thanks for looking at this. I think the same should be done in bcf_enc_vint(). Currently it works because the condition checks for the vector_end rather than missing, but the 8 reserved values are not respected.

@jmarshall
Copy link
Member Author

You are quite right. I thought I looked at that function too, but apparently not carefully enough!
I'll amend the PR.

BCF_BT_INT8 values in the range 0x80-0x87 are reserved in VCFv4.3,
so such integers must be encoded as BCF_BT_INT16 rather than BCF_BT_INT8.
Similarly 0x8000-0x8007 is reserved in BCF_BT_INT16 so such integers must
be encoded as BCF_BT_INT32.

The range 0x80000000-0x80000007 is reserved in BCF_BT_INT32, but this
commit does not add an error case if such integers present themselves.

In particular, the previous bcf_enc_int1() code was avoiding
bcf_int8/16_missing correctly but was wrongly encoding -127 and -32767
as bcf_int8/16_vector_end.  Fixes samtools/bcftools#874.

Add a test case exercising -127 and -128.
@jmarshall
Copy link
Member Author

jmarshall commented Sep 11, 2018

bcf_enc_vint() also now fixed.

This is probably NEWS-worthy, as previously bcftools has been writing slightly incorrect .bcf files. (Previously-written files remain readable with bcftools after this bug fix.)

@pd3
Copy link
Member

pd3 commented Sep 11, 2018

Thank you

@daviesrob daviesrob merged commit 52368d6 into samtools:develop Sep 12, 2018
daviesrob added a commit that referenced this pull request Sep 12, 2018
@daviesrob
Copy link
Member

Thanks, merged with NEWS update.

@jmarshall
Copy link
Member Author

Thanks. Um, I think it's -121…-128 and -32761…-32768 that get bumped up to the next length.

Re the second NEWS paragraph: BTW I haven't investigated fully when it prints out -127 in non-vector context as hoped for and when (like in the filter expression that exposed this) it sees it as vector_end. But it looks like plain converting does indeed work. Usually, hopefully 😄

@jmarshall jmarshall deleted the reserved-negints branch September 12, 2018 16:01
@daviesrob
Copy link
Member

Yes, you're right. 0x88 and 0x8008 escape unchanged. I'll modify the text.

I checked what happens in both scalar and vector contexts. It does appear to work in both as far as writing out VCF. Probably not safe to claim anything more than that though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange bug when filtering if SVLEN=-127
3 participants