-
-
Notifications
You must be signed in to change notification settings - Fork 46.8k
Added Burrows-Wheeler transform algorithm. #1029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 15 commits
401ed00
d13eb2c
0e4e788
8c4c820
9fa20b3
8b17220
6c11811
cf4a9cf
b69db30
efa4bad
abfee10
304c09c
aa87466
f0e1631
c08c4b5
cf3df56
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
""" | ||
https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform | ||
|
||
The Burrows–Wheeler transform (BWT, also called block-sorting compression) | ||
rearranges a character string into runs of similar characters. This is useful | ||
for compression, since it tends to be easy to compress a string that has runs | ||
of repeated characters by techniques such as move-to-front transform and | ||
run-length encoding. More importantly, the transformation is reversible, | ||
without needing to store any additional data except the position of the first | ||
original character. The BWT is thus a "free" method of improving the efficiency | ||
of text compression algorithms, costing only some extra computation. | ||
""" | ||
|
||
|
||
def all_rotations(string): | ||
""" | ||
:param str string: The string that will be rotated len(string) times. | ||
:return: A list with len(string) rotations of the parameter string. | ||
:rtype: list[str] | ||
:raises TypeError: If the string parameter type is not str. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Type hints + a textual description might actually be more helpful than restructured text but if you stick with restructured text then avoid repeating things that are already in the type hints. |
||
Examples: | ||
|
||
>>> all_rotations("^BANANA|") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This comment is a doctest directive that allows you to lose the backslashes and indent the way your want because PEP8 does not advocate the use of backslashes. It says:
|
||
['^BANANA|', 'BANANA|^', 'ANANA|^B', 'NANA|^BA', 'ANA|^BAN', 'NA|^BANA',\ | ||
'A|^BANAN', '|^BANANA'] | ||
>>> all_rotations("a_asa_da_casa") | ||
['a_asa_da_casa', '_asa_da_casaa', 'asa_da_casaa_', 'sa_da_casaa_a',\ | ||
'a_da_casaa_as', '_da_casaa_asa', 'da_casaa_asa_', 'a_casaa_asa_d',\ | ||
'_casaa_asa_da', 'casaa_asa_da_', 'asaa_asa_da_c', 'saa_asa_da_ca',\ | ||
'aa_asa_da_cas'] | ||
>>> all_rotations("panamabanana") | ||
['panamabanana', 'anamabananap', 'namabananapa', 'amabananapan',\ | ||
'mabananapana', 'abananapanam', 'bananapanama', 'ananapanamab',\ | ||
'nanapanamaba', 'anapanamaban', 'napanamabana', 'apanamabanan'] | ||
>>> all_rotations(5) | ||
Traceback (most recent call last): | ||
... | ||
TypeError: The parameter string type must be str. | ||
""" | ||
if not (type(string) is str): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 says
flake8 E721 will find these automatically. |
||
raise TypeError("The parameter string type must be str.") | ||
|
||
return [string[i:] + string[:i] for i in range(len(string))] | ||
|
||
|
||
def bwt_transform(string): | ||
""" | ||
:param str string: The string that will be used at bwt algorithm | ||
:return: A dictionary with the bwt result, the string composed of the last | ||
char of each row of the ordered rotations list and the index of the | ||
original string at ordered rotations list | ||
:rtype: dict | ||
:raises TypeError: If the string parameter type is not str | ||
:raises ValueError: If the string parameter is empty | ||
Examples: | ||
|
||
>>> bwt_transform("^BANANA") | ||
{'bwt_string': 'BNN^AAA', 'idx_original_string': 6} | ||
>>> bwt_transform("a_asa_da_casa") | ||
{'bwt_string': 'aaaadss_c__aa', 'idx_original_string': 3} | ||
>>> bwt_transform("panamabanana") | ||
{'bwt_string': 'mnpbnnaaaaaa', 'idx_original_string': 11} | ||
>>> bwt_transform(4) | ||
Traceback (most recent call last): | ||
... | ||
TypeError: The parameter string type must be str. | ||
>>> bwt_transform('') | ||
Traceback (most recent call last): | ||
... | ||
ValueError: The parameter string must not be empty. | ||
""" | ||
if not (type(string) is str): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. isinstance() |
||
raise TypeError("The parameter string type must be str.") | ||
if not string: | ||
raise ValueError("The parameter string must not be empty.") | ||
|
||
rotations = all_rotations(string) | ||
rotations.sort() # sort the list of rotations in alphabetically order | ||
# make a string composed of the last char of each rotation | ||
return { | ||
"bwt_string": "".join([word[-1] for word in rotations]), | ||
"idx_original_string": rotations.index(string), | ||
} | ||
|
||
|
||
def reverse_bwt(bwt_string, idx_original_string): | ||
""" | ||
:param str bwt_string: The string returned from bwt algorithm execution | ||
:param int idx_original_string: The index of the string that was used to | ||
generate bwt_string at ordered rotations list | ||
:return: The string used to generate bwt_string when bwt was executed | ||
:rtype str | ||
:raises TypeError: If the bwt_string parameter type is not str | ||
:raises ValueError: If the bwt_string parameter is empty | ||
:raises TypeError: If the idx_original_string type is not int or if not | ||
possible to cast it to int | ||
:raises ValueError: If the idx_original_string value is lower than 0 | ||
|
||
>>> reverse_bwt("BNN^AAA", 6) | ||
'^BANANA' | ||
>>> reverse_bwt("aaaadss_c__aa", 3) | ||
'a_asa_da_casa' | ||
>>> reverse_bwt("mnpbnnaaaaaa", 11) | ||
'panamabanana' | ||
>>> reverse_bwt(4, 11) | ||
Traceback (most recent call last): | ||
... | ||
TypeError: The parameter bwt_string type must be str. | ||
>>> reverse_bwt("", 11) | ||
Traceback (most recent call last): | ||
... | ||
ValueError: The parameter bwt_string must not be empty. | ||
>>> reverse_bwt("mnpbnnaaaaaa", "asd") | ||
Traceback (most recent call last): | ||
... | ||
TypeError: The parameter idx_original_string type must be int or passive of cast to int. | ||
>>> reverse_bwt("mnpbnnaaaaaa", -1) | ||
Traceback (most recent call last): | ||
... | ||
ValueError: The parameter idx_original_string must not be lower than 0. | ||
>>> reverse_bwt("mnpbnnaaaaaa", 11.0) | ||
'panamabanana' | ||
>>> reverse_bwt("mnpbnnaaaaaa", 11.4) | ||
'panamabanana' | ||
""" | ||
if not (type(bwt_string) is str): | ||
raise TypeError("The parameter bwt_string type must be str.") | ||
if not bwt_string: | ||
raise ValueError("The parameter bwt_string must not be empty.") | ||
try: | ||
idx_original_string = int(idx_original_string) | ||
except ValueError: | ||
raise TypeError( | ||
"The parameter idx_original_string type must be int or passive of cast to int." | ||
) | ||
if idx_original_string < 0: | ||
raise ValueError( | ||
"The parameter idx_original_string must not be lower than 0." | ||
) | ||
|
||
ordered_rotations = [""] * len(bwt_string) | ||
for x in range(len(bwt_string)): | ||
for i in range(len(bwt_string)): | ||
ordered_rotations[i] = bwt_string[i] + ordered_rotations[i] | ||
ordered_rotations.sort() | ||
return ordered_rotations[idx_original_string] | ||
|
||
|
||
if __name__ == "__main__": | ||
string = input("Provide a string that I will generate its BWT transform: ") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider doing input().strip() to remove leading and/or trailing whitespace from user input. |
||
result = bwt_transform(string) | ||
print( | ||
"Burrows Wheeler tranform for string '{}' results in '{}'".format( | ||
string, result["bwt_string"] | ||
) | ||
) | ||
original_string = reverse_bwt( | ||
result["bwt_string"], result["idx_original_string"] | ||
) | ||
print( | ||
( | ||
"Reversing Burrows Wheeler tranform for entry '{}' we get original" | ||
" string '{}'" | ||
).format(result["bwt_string"], original_string) | ||
) | ||
brunohadlich marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Python type hints https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html They take some getting used to but can catch bugs. Our automated testing already runs mypy on all pull requests.
Unfortunately, string is the name of a module in the Python Standard Library and str is the name of a Python data type so to avoid shadowing them, do not use those words as variable names. This is one of the very few places where I would advocate using a single letter variable name: s.