Description
"
awk 'length($1) < 6' table.txt
echo 'αλεπού' | awk '{print length()}'
echo 'αλεπού' | awk -b '{print length()}'
echo 'αλεπού' | LC_ALL=C awk '{print length()}'"
one doesn't need to use LC_ALL=C
or activate byte mode -b
just to count exact bytes of the input.
even in gawk unicode mode,
use
- length(str)
to count UTF8 characters, and
- match(str, /$/) - 1
to count bytes
Why that works is that the code is requesting a match of the empty string at the tail, but since no other characters were matched along the way, it defaults to reporting back to you a byte count. The minus 1 is essential because otherwise RSTART
would be at 1 virtual byte beyond the input string.
You can directly throw binary files like .MP3 .MP4 .XZ .PNG
and gawk unicode mode would give you the byte count, without any error messages
That said, only the match( )
one won't give error messages if you throw binary data at gawk unicode mode, length( )
will DEFINITELY scream, as well as match(str /.$/)
- (note the dot
.
right before$
- on valid UTF8 inputs, this function call style is equivalent tolength( )
, but on random bytes, it will DEFINITELY give you the locale error message )
(can't use this to circumvent length( )
's error message if it's pure binary input - one needs to code up an alternative approach to count it, e.g. via gsub( )
Took me a while to code it up myself , but now i could get byte-mode to count UTF8, and get unicode mode to directly take in binary data, and have it report an identical count to gnu-wc)