On 7/24/2004 12:43:00 PM, Christian Ziemski wrote:
>Ian:
>>I have thought about how to detect with no BOM, [...]
>>
>>Testing 00 0D 00 0A would be unlikely to succeed as
>>the most likely pattern is 00 0D 00 0A 00.
>>[..] but looking for 00 0D 00 0A on an odd byte boundary
>>would seem a better bet.
>
>I incorporated your suggestion into my
>solution.
>Together it seems to be reliable, at
>least a bit ;-)
>
>The check for UNIX and Mac files is
>implemented too.
Why are you testing line-ends in the first place?
There are quite many combinations to test.
Wouldn't it be easier to check, say, the first 50 unicode characters, and count how many zeroes are in even and odd byte positions?
If most odd bytes are zero, it is Big Endian; if most even bytes are zero, it is Little Endian; else it is probably not Unicode.
Here is a macro that does the above test. (Note: I compare the counters to CP (Cur_Pos) so that it will work with short files, too.)
#104 = 0
#1 = 0
#2 = 0
BOF
Repeat(50) {
if (Cur_Char==0) { #1++ }
char
if (Cur_Char==0) { #2++ }
char
}
if (#1*3 > CP && #2*5 < CP) {
#104 = 1 // Big endian
} else {
if (#2*3 > CP && #1*5 < CP) {
#104 = 2 // Little endian
}
}
BOF
--
Pauli