Read topic starting at message #17860

Topic:	Unicode to ANSI conversion (1 of 103), Read 167 times, 3 File Attachments
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Friday, May 21, 2004 03:11 AM

I have been playing with Unicode to ANSI conversion for some time, but it always seemed too hard.

Notepad (W2K, XP versions) can convert files ANSI<=>UTF-16<=>UTF-8<=>UTF-16 big endian.

You need to use a suitable font, Lucinda Console or Lucinda Sans Unicode (proportional font).

For those with MS VC++ or MSDN Microsoft provides a sample uconvert, and there are a number of APIs to perform conversions.

Prompted by Pauli's comments, I decided that Windows-1252 was all I really needed, and I would implement this in Vedit.

I was still stumped by a solution to allow an indexed table lookup in Vedit (which is how I would do it in C).

It suddenly struck me that I don't need to do this.

I could use Christian's asc-unic.vdm macro as a first pass, then fix the characters (0x80 - 0x9F) that did not map.

In the other direction (my real interest) I translate Unicode codepoints corresponding to ANSI 0x80 - 0x9F then use Christian's unic-asc.vdm to delete the unwanted bytes.

I would have preferred to do the processing in a single pass, but Vedit is so fast that it really doesn't matter, especially as the latter passes wouldn't have that much to do and are replacing same size strings.

(As an aside I wonder if this makes a difference in Vedit - I know when I wrote my first editor in the mid '70s on a 1MHz processor this was a major issue.)

UTF_ANSI.vdm performs Unicode (UTF-16) to ANSI conversion.
If the Unicode file contains characters for which there is no mapping the result is undefined.

ANSI-UTF.VDM performs ANSI to Unicode (UTF-16) conversion.
Characters for which there is no mapping (0x7F, 0x81, 0x8D, 0x8F, 0x90, 0x9D) appear unchanged as pseudo UTF characters.
These are not displayed by Notepad.

ANSI1252test.ansi is an ANSI test file, which exercises the macro.
(Convert to Unicode to test the reverse.)

ANSI1252TEST.ZIP (4KB)
ANSI_UTF.VDM (4KB)
UTF_ANSI.VDM (5KB)

Topic:	Unicode to ANSI conversion (2 of 103), Read 126 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Sunday, May 23, 2004 10:35 PM

Updated UTF_ANSI.vdm with better Unicode detection and error reporting

UTF_ANSI(1).VDM (5KB)

Topic:	Re: Unicode to ANSI conversion (3 of 103), Read 129 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Thursday, May 27, 2004 12:26 AM

At 10:35 PM 5/23/2004, you wrote:
>Updated UTF_ANSI.vdm with better Unicode detection and error reporting

Ian:

Thank you for the updated macros. Sorry that they didn't make it into the release of VEDIT 6.12.1, but I will add them very soon.

Ted.

Topic:	Re: Unicode to ANSI conversion (4 of 103), Read 127 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Monday, May 31, 2004 02:26 AM

On 5/27/2004 12:26:10 AM, Ted Green wrote:
>At 10:35 PM 5/23/2004, you
>wrote:

>Ian:
>
>Thank you for the updated
>macros. Sorry that they didn't
>make it into the release of
>VEDIT 6.12.1, but I will add
>them very soon.
>
>Ted.
>

The UTF_ANSI.vdm had an error (preventing it from deleting BOM).

This is fixed.

UTF_ANSI(2).VDM (5KB)

Topic:	Re: Unicode to ANSI conversion (5 of 103), Read 133 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Tuesday, June 22, 2004 12:50 AM

At 03:12 AM 5/21/2004, you wrote:
>It suddenly struck me that I don't need to do this.
>I could use Christian's asc-unic.vdm macro as a first pass, then fix the characters (0x80 - 0x9F) that did not map.
>
>In the other direction (my real interest) I translate Unicode codepoints corresponding to ANSI 0x80 - 0x9F then use Christian's unic-asc.vdm to delete the unwanted bytes.

Thank you for the greatly improved UTF16 to/from ANSI macros. They will be included in future versions of VEDIT.
(I did take your May 31 version UTF-ANSI.VDM.)

I considered simply replacing unic-asc.vdm with your macro, but I'm a bit troubled that your macro processes the entire file 33 times - 32 times for the special characters and then the final conversion.

Speed is still reasonable for local files. I created a 47 meg UTF-16 file by exporting my registry. The unic-asc.vdm macro took one minute to convert it to ANSI; your UTF-ANSI.VDM macro took about 4 minutes. However, converting a remote file over a LAN or WAN might be very slow.

Perhaps a clever macro could perform the conversion in just one or two passes. For example, it might check if the upper UTF-16 byte is 01, 02, 20 or 21, and then check the 32 possibilities. I don't really know what the fastest method would be.

Until I have studied this more, I will include your very useful macros in the VEDIT\USER--MAC directory.

Thank you again.

Ted.

Topic:	Re: Unicode to ANSI conversion (6 of 103), Read 132 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Tuesday, June 22, 2004 07:00 AM

On 6/22/2004 12:50:05 AM, Ted Green wrote:
>
>Speed is still reasonable for local files.
>[...]
>Perhaps a clever macro could perform the conversion in just
>one or two passes. For example, it might check if the
>upper UTF-16 byte is 01, 02, 20 or 21, and then check the
>32 possibilities. I don't really know what the fastest
>method would be.

Here is a try to do some speed optimization.
The macro now needs two passes only.

But it seems not to be always faster. For small files Ian's version is faster, for big ones this new one.
Unfortunately I'm not able to do more tests now (and I don't have Unicode files with those special characters in it to test...).

The checking line in the main loop
#104|=CC // check null byte
should be tested too.
It slows down the loop to produce a warning "only".
Perhaps it makes sense to change/delete this?

BTW: Ian has changed the initially NO/yes dialog to a YES/no dialog. That should be checked.

Christian

UTF_ANSI-Z.VDM (9KB)

Topic:	Re: Unicode to ANSI conversion (7 of 103), Read 133 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Tuesday, June 22, 2004 11:54 AM

At 07:01 AM 6/22/2004, you wrote:
>Here is a try to do some speed optimization.
>The macro now needs two passes only.
>
>But it seems not to be always faster. For small files Ian's version is faster, for big ones this new one.
>Unfortunately I'm not able to do more tests now (and I don't have Unicode files with those special characters in it to test...).

That is very interesting and clever programming Christian! Especially how you create the internal conversion table, first in hex and then in binary.
As I am more concerned about speed in huge files, I like your version better. Especially since it only needs two passes.

You can easily create a Unicode file by exporting your Windows 2000/XP registry - run regedit.exe, select Registry -> Export Registry. Then select "All", enter a filename and select "Save".

(This is my preferred way of studying the registry, as VEDIT's searching is much much faster than regedit.)

>The checking line in the main loop
> #104|=CC // check null byte
>should be tested too.
>It slows down the loop to produce a warning "only".
>Perhaps it makes sense to change/delete this?

For now, I think the warning is a good idea. It is a rather clever way of checking for special Unicode characters.

Since this macro contains bother new code and Ian's original code (commented out), it should be cleaned up before it is released. However, we should include Ian's character descriptions, e.g. "LATIN CAPITAL LETTER S WITH CARON".

Eventually I would like to build more of the Unicode translation into VEDIT (for speed), but it is easier if the algorithm is first optimized in macro code.

Does anyone want to tackle converting UTF-8 to/from ANSI? :-))

Ted.

Ted.
-------------------------------------------------------------------------
Ted Green (ted@...) Greenview Data, Inc.
Web: www.... PO Box 1586, Ann Arbor, MI 48106
Tel: (734) 996-1300 Fax: (734) 996-1308 VEDIT - Text/Data/Binary Editor
-------------------------------------------------------------------------
Spam problems? www.SpamStopsHere.com blocks 99% of spam for businesses.

Topic:	Re: Unicode to ANSI conversion (9 of 103), Read 133 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Tuesday, June 22, 2004 09:55 PM

On 6/22/2004 11:54:58 AM, Ted Green wrote:
>At 07:01 AM 6/22/2004, you
>wrote:

>Does anyone want to tackle
>converting UTF-8 to/from ANSI?
>:-))

As I am sure you are aware this is MUCH harder - too much bit shifting. It could be done in Vedit macros but would be slow.

The easy way would be to use the MultiByteToWideChar API, using the Vedit buffer as source, and creating a new buffer to receive the translated code.

To date I have seen little UTF-8, apart from email, and most of this is ANSI anyway as special characters are not used. European users may find more application.

Topic:	Re: Unicode to ANSI conversion (10 of 103), Read 134 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Tuesday, June 22, 2004 11:31 PM

At 09:55 PM 6/22/2004, you wrote:
>>Does anyone want to tackle converting UTF-8 to/from ANSI?
>
>As I am sure you are aware this is MUCH harder - too much bit shifting. It could be done in Vedit macros but would be slow.
>
>The easy way would be to use the MultiByteToWideChar API, using the Vedit buffer as source, and creating a new buffer to receive the translated code.

Ian:

You are correct; we should simply use the Windows API to perform the conversion.

Similarly we will build the UTF-16 to ANSI code into VEDIT later this year, using the Windows API. We will still have a UTF-ANSI.VDM macro, but we will replace your special conversion table (and Christian's fancy implementation of it) with the Windows API.

Ted.

Topic:	Re: Unicode to ANSI conversion (15 of 103), Read 130 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, June 23, 2004 01:00 PM

On Tue, 22 Jun 2004 23:31:00 -0400, Ted Green wrote:

>Ian:
>
>You are correct; we should simply use the Windows API to perform the
>conversion.
>
>Similarly we will build the UTF-16 to ANSI code into VEDIT later this
>year, using the Windows API. We will still have a UTF-ANSI.VDM macro,
>but we will replace your special conversion table (and Christian's
>fancy implementation of it) with the Windows API.

I interpret that as:

"Don't spend much more work into the current UTF-ANSI.VDM.
Only some cleaning up."

Do you (Ian and Ted) agree?

Christian

Topic:	Re: Unicode to ANSI conversion (16 of 103), Read 132 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Wednesday, June 23, 2004 04:52 PM

At 01:01 PM 6/23/2004, you wrote:
>I interpret that as:
>
>"Don't spend much more work into the current UTF-ANSI.VDM.
> Only some cleaning up."

That is correct.
On the other hand, the last posted macro here is not
ready for release even as a "user supplied" macro.

Ted.

Ted.
-------------------------------------------------------------------------
Ted Green (ted@...) Greenview Data, Inc.
Web: www.... PO Box 1586, Ann Arbor, MI 48106
Tel: (734) 996-1300 Fax: (734) 996-1308 VEDIT - Text/Data/Binary Editor
-------------------------------------------------------------------------
www.SpamStopsHere.com - Ranked #1 accuracy by Network Computing Magazine.

Topic:	Re: Unicode to ANSI conversion (17 of 103), Read 135 times, 2 File Attachments
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, June 23, 2004 05:03 PM

On 6/23/2004 4:52:14 PM, Ted Green wrote:
>
>On the other hand, the last posted macro here is not
>ready for release even as a "user supplied" macro.

Of course. Work at progress!

Here is my "final" version of UTF_ANSI-Z.vdm.

Both technical ways of the conversion are now there, dependent on
size of file (10MB as switch).

Fixed: The Unicode-Check at the beginning could fail
(e.g. if only one line)
Fixed: The confirmation dialog at beginning misbehaved when user
pressed ESC.

Fixed some other minor things.

Unfortunately I found that the conversion may trigger on false positives!

Example:
The text "Kylix™ 2" (in Unicode K.y.l.i.x."! .2. , where the dot represents a NULL) (see line 1 in attached file utf1.txt)

isn't converted correctly: The part "! " is misinterpreted as an Unicode character and converted. But in fact that are the last byte of one character and the first byte of the next character.

Of course the Replace() command isn't able to see the character boundaries!

When using my conversion table, which is sorted otherwise than Ian's (incidentally), that doesn't happen with that combo, but with others...

So that way of conversion seems to be not reliable :-((

Any thoughts?

Christian

UTF_ANSI-Z(1).VDM (9KB)
UTF1.TXT (1KB)

Topic:	Re: Unicode to ANSI conversion (18 of 103), Read 134 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Wednesday, June 23, 2004 05:25 PM

At 04:57 PM 6/23/2004, you wrote:
>So that way of conversion seems to be not reliable :-((
>Any thoughts?

Christian:

There may be a solution. Remember how you didn't like the line:

#104|=CC // check null byte (slow!)

Well, all the special fixups are only needed when the 2nd byte
is NOT null. Therefore, when the 2nd byte is not null, you could
perform Ian's fixups.

It might be fastest to put the fixup code into a separate text
register and call it with the Call(r) command. This may or may
not be faster than a long in-line if() { ... } command.

It would also run in one pass!

This assumes that only a few percent of the unicode characters
require fixups.

Ted.

Topic:	Re: Unicode to ANSI conversion (20 of 103), Read 133 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, June 23, 2004 05:31 PM

Ted:

On 6/23/2004 5:25:41 PM, Ted Green wrote:
>
>There may be a solution.
>Remember how you didn't like the line:

I have seen your posting too late. After sending my answer to myself...

I'll think over both tomorrow.

Christian

Topic:	Re: Unicode to ANSI conversion (22 of 103), Read 135 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Thursday, June 24, 2004 07:53 AM

On 6/23/2004 5:25:41 PM, Ted Green wrote:
>
>There may be a solution.
>Remember how you didn't like the line:
>
>#104|=CC // check null byte
>
>Well, all the special fixups are only needed when the 2nd
>byte is NOT null. Therefore, when the 2nd byte is not null, you
>could perform Ian's fixups.

Good idea!

It's done and attached (but needs some stress tests).

An advantage of this new technique is that the translation table can be easily changed or expanded. No more fiddling with that Search() command. And almost no slowdown when expanding the translation table!

Additionally I added a counter for not translated unicode characters and fixed some register usage bugs etc.

Ted:
In the original macro DI_1() uses T-Reg 123.
Isn't that a bit dangerous since @123 holds the Tools menu?
I changed it here to 121.

Christian

UTF_ANSI-Z(3).VDM (9KB)

Topic:	Re: Unicode to ANSI conversion (23 of 103), Read 136 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Thursday, June 24, 2004 12:18 PM

At 07:54 AM 6/24/2004, you wrote:

>Good idea!
>It's done and attached (but needs some stress tests).

Very nice! (As always)

>Ted:
>In the original macro DI_1() uses T-Reg 123.
>Isn't that a bit dangerous since @123 holds the Tools menu?
>I changed it here to 121.

I missed that in the original macro, but you are correct.

I will test this macro on some larger Unicode files.

Ted.

Topic:	Re: Unicode to ANSI conversion (25 of 103), Read 136 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Friday, June 25, 2004 08:09 AM

Of course I couldn't resist and still worked on the macro.

1) Added a choice to force translation, even if it
seems(!) to be no Unicode file.
Otherwise e.g. very short files couldn't be translated.

2) Implemented a new format for the translation table:
// One line per character:
// Two hex bytes for the source UTF-16 character
// One hex byte for the target ANSI character
// An optional description

This way it's no longer necessary to have two tables:
One with comments and the real one.
So it's much easier to maintain.

3) Some messages modified

A question:
Should we really still support DOS VEDIT in this macro?

The macro seems to be finished now...

Christian

UTF_ANSI-Z(4).VDM (8KB)

Topic:	Re: Unicode to ANSI conversion (26 of 103), Read 135 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Friday, June 25, 2004 10:50 AM

At 08:26 AM 6/25/2004, you wrote:
>1) Added a choice to force translation, even if it
> seems(!) to be no Unicode file.
> Otherwise e.g. very short files couldn't be translated.

Sounds reasonable.

>2) Implemented a new format for the translation table:
> // One line per character:
> // Two hex bytes for the source UTF-16 character
> // One hex byte for the target ANSI character
> // An optional description
>
> This way it's no longer necessary to have two tables:
> One with comments and the real one.
> So it's much easier to maintain.

Yes, excellent.

>A question:
>Should we really still support DOS VEDIT in this macro?

Not the DOS version, but it should support the "VEDIT OEM"
character set. In other words, based on FontCharset, it should
load one of two different translation tables.

Ted.

Topic:	Re: Unicode to ANSI conversion (27 of 103), Read 136 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Friday, June 25, 2004 12:49 PM

On Fri, 25 Jun 2004 10:50:00 -0400, Ted Green wrote:

>At 08:26 AM 6/25/2004, you wrote:
>
>>A question:
>>Should we really still support DOS VEDIT in this macro?
>
>Not the DOS version, but it should support the "VEDIT OEM"
>character set. In other words, based on FontCharset, it should
>load one of two different translation tables.

I implemented all the preparations for the new translation table.

But the table itself isn't filled with life yet.
** Any volunteers? **

BTW:
- I found that the commands Font_Charset and Code_Page are
missing in the syntax file.
- The character 0H9E should(?) be LATIN SMALL LETTER Z WITH CARON
but is displayed in VEDIT-ANSI as character "Pt".
Who is right?

Christian

UTF_ANSI-Z(5).VDM (8KB)

Topic:	Re: Unicode to ANSI conversion (28 of 103), Read 136 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Friday, June 25, 2004 12:58 PM

PS:

Here a file in Unicode format with all the characters and descriptions
from Ian's translation table used in the macro.

Christian

UTF16-1.TXT (1KB)

Topic:	Re: Unicode to ANSI conversion (31 of 103), Read 134 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Friday, June 25, 2004 07:16 PM

On 6/25/2004 12:58:19 PM, Christian Ziemski wrote:
>PS:
>
>Here a file in Unicode format
>with all the characters and
>descriptions
>from Ian's translation table
>used in the macro.
>
This only contains one line for the Euro symbol.

One of my earlier posts had my test file:-
ANSI1252test.ansi

Topic:	Re: Unicode to ANSI conversion (37 of 103), Read 135 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Saturday, June 26, 2004 02:53 AM

AOn Fri, 25 Jun 2004 19:16:00 -0400, Ian Binnie wrote:

>On 6/25/2004 12:58:19 PM, Christian Ziemski wrote:
>>
>>Here a file in Unicode format with all the characters and
>>descriptions from Ian's translation table
>>used in the macro.
>>
>This only contains one line for the Euro symbol.

Oops, the upload seems to have destroyed it due to NULL bytes.

Here it is in ZIP format.

>One of my earlier posts had my test file: ANSI1252test.ansi

That's a helpful table too!

Christian

PS: Even the ZIP file got corrupted when uploaded via NNTP.
Now I'll try it via http...

PPS: That didn't work too?!?!??!? Very strange!

PPPS: Then this way:
http://ziemski.privat.t-online.de/vedit/macros/utf16-1.zip

Topic:	Re: Unicode to ANSI conversion (29 of 103), Read 135 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Friday, June 25, 2004 06:10 PM

At 12:50 PM 6/25/2004, you wrote:
>I implemented all the preparations for the new translation table.
>
>But the table itself isn't filled with life yet.
>** Any volunteers? **

I have optimized Christian's macro for speed; it is now barely slower than the current (old) unic-asc.vdm macro. When done, the new macro will replace the old macro.

I am currently working on the Unicode to "OEM" character set translation; this is a bit more complicated than the "ANSI" translation because any Unicode character over 128 needs to be translated with a table.

I will post a new macro within a few days; so everyone should hold off until then.

Ted.

Topic:	Re: Unicode to ANSI conversion (32 of 103), Read 133 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Friday, June 25, 2004 07:33 PM

On 6/25/2004 6:10:45 PM, Ted Green wrote:
>At 12:50 PM 6/25/2004, you
>wrote:
>>I implemented all the preparations for the new translation table.
>>
>>But the table itself isn't filled with life yet.
>>** Any volunteers? **

I was about to volunteer.
I have tables for Unicode to Code page 437 & 850, from my abandoned attempt to write an OEM to Unicode translator.
The motivation here was to allow display files with box drawing characters in Windows.

I may try again using Christian's approach, but will wait to see what Ted does.
>
>I am currently working on the
>Unicode to "OEM" character set
>translation; this is a bit
>more complicated than the
>"ANSI" translation because any
>Unicode character over 128
>needs to be translated with a
>table.
>
>I will post a new macro within
>a few days; so everyone should
>hold off until then.
>
>Ted.
>

Topic:	Re: Unicode to ANSI conversion (34 of 103), Read 132 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ted Green
Date:	Friday, June 25, 2004 09:40 PM

At 07:34 PM 6/25/2004, you wrote:
>I was about to volunteer.
>I have tables for Unicode to Code page 437 & 850, from my abandoned attempt to write an OEM to Unicode translator.
>The motivation here was to allow display files with box drawing characters in Windows.

OK, I have attached a preliminary macro.

Notice that the OEM table (code page 437) needs entries for values 80h thru FFh.

The main processing loop has been speed optimized - there are actually two loops - one for ANSI and one for OEM.

For ANSI, we only need a translate table for Unicodes greater than 255.
For OEM, we need a translate table for Unicodes greater than 127.

My OEM table is nearly complete, but I ran out of time today for about 20 values which are marked with "??". Perhaps Ian can fill them in for me. :-)

Ted.

Ted.
-------------------------------------------------------------------------
Ted Green (ted@...) Greenview Data, Inc.
Web: www.... PO Box 1586, Ann Arbor, MI 48106
Tel: (734) 996-1300 Fax: (734) 996-1308 VEDIT - Text/Data/Binary Editor
-------------------------------------------------------------------------
Spam problems? www.SpamStopsHere.com blocks 99% of spam for businesses.

UTF-ANSI.VDM (10KB)

Topic:	Re: Unicode to ANSI conversion (39 of 103), Read 133 times, 2 File Attachments
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Saturday, June 26, 2004 09:15 AM

I have inserted my CodePage437 values.
Apart from the ?? items these agree with your table except for E1, E4, EA.

I have not yet tested the macro.

The attached OEM437.txt is a UTF file which can be viewed in Notepad with a
suitable font e.g. Lucida Console

----- Original Message -----
From: "Ted Green"
To: " Converting, Translating"

OEM437.TXT (2KB)
UTF-ANSIIB1.VDM (10KB)

Topic:	Re: Unicode to ANSI conversion (33 of 103), Read 131 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Friday, June 25, 2004 07:40 PM

On 6/25/2004 12:49:47 PM, Christian Ziemski wrote:
>- The character 0H9E should(?) be LATIN
>SMALL LETTER Z WITH CARON
>but is displayed in VEDIT-ANSI as
>character "Pt".
> Who is right?
This displays as SMALL LETTER Z WITH CARON using Lucinda Console.

Topic:	Re: Unicode to ANSI conversion (35 of 103), Read 132 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Saturday, June 26, 2004 02:36 AM

On Fri, 25 Jun 2004 19:40:00 -0400, Ian Binnie wrote:

>On 6/25/2004 12:49:47 PM, Christian Ziemski wrote:
>>- The character 0H9E should(?) be LATIN
>>SMALL LETTER Z WITH CARON
>>but is displayed in VEDIT-ANSI as
>>character "Pt".
>> Who is right?
>This displays as SMALL LETTER Z WITH CARON using Lucinda Console.

Yes, and with other fonts too.

That's what puzzled me a bit.
I don't know much about fonts and character sets ...

Christian

Topic:	Re: Unicode to ANSI conversion (40 of 103), Read 133 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Saturday, June 26, 2004 11:26 AM

At 07:40 PM 6/25/2004, you wrote:
>>- The character 0H9E should(?) be LATIN SMALL LETTER Z WITH CARON
>>but is displayed in VEDIT-ANSI as character "Pt".
>> Who is right?
>This displays as SMALL LETTER Z WITH CARON using Lucinda Console.

I have fixed the VEDIT ANSI fonts.

Ted.

Topic:	Re: Unicode to ANSI conversion (41 of 103), Read 135 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ted Green
Date:	Saturday, June 26, 2004 11:42 AM

Thank you Ian.

The attached macro appears to be work correctly.
As suggested, it converts unknown Unicodes to value 127 to prevent nulls and strange characters in the file.

I fixed several characters in the VEDIT-ANSI font - small and capital Z with caron. These will be in the next release.

I noticed that the largest VEDIT-ANSI font is very rough for codes 128 - 255; hopefully I can fix that soon.

Ideally, the ANSI-UTF.VDM macro should be updated to work in one pass (special conversion for values > 127) and handle both ANSI and OEM characters.

Then, I will replace the old unic-asc.vdm and asc-unic.vdm with the new macros, probably renaming them to utf-asc.vdm and asc-utf.vdm.

Ted.

UTF-ANSI(1).VDM (10KB)

Topic:	Re: Unicode to ANSI conversion (42 of 103), Read 131 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Sunday, June 27, 2004 09:49 AM

On 6/26/2004 11:42:08 AM, Ted Green wrote:
>Thank you Ian.
>
>The attached macro appears to
>be work correctly.
>As suggested, it converts
>unknown Unicodes to value 127
>to prevent nulls and strange
>characters in the file.
>
>I fixed several characters in
>the VEDIT-ANSI font - small
>and capital Z with caron.
>These will be in the next
>release.
>
I have modified the macro to include the Description for OEM characters from Windows Glyph List 4 (WGL4).

In the process I have realised I do not understand the whole process of Code Pages & Fonts in Windows.

I use Code Page 850, the standard for Australia (and most of Western Europe).
The first difference between this and Code Page 437 is character 9B

This is Unicode 00A2 "cent sign" in Code Page 437
This is Unicode 00F8 "Latin small letter o with stroke" in Code Page 850

Vedit OEM font displays "cent sign" even though my default is Multilingual Code Page 850
Terminal font displays "Latin small letter o with stroke"

In normal day to day use this makes little difference in Australia - most of the characters in Code Page 850 are not used, and Australia is close to US and English (despite differences in pronunciation and spelling) does not require most of the extra characters with accents. This would not be the case for non-English speakers

When I look at Christian's umlaute-weg.vdm (Vedit OEM) character E4 displays as "Latin small letter a with diaeresis" which looks correct, but not as expected for either Code Page.
Using the same font in Notepad it appears as "Greek capital letter Sigma" but is OK in "System".

How do fonts support Code Pages?

UTF-ANSI(2).VDM (14KB)

Topic:	Re: Unicode to ANSI conversion (43 of 103), Read 132 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Sunday, June 27, 2004 02:02 PM

On Sun, 27 Jun 2004 09:49:00 -0400, Ian Binnie wrote:

>When I look at Christian's umlaute-weg.vdm (Vedit OEM) character E4
>displays as "Latin small letter a with diaeresis" which looks
>correct, but not as expected for either Code Page.
>Using the same font in Notepad it appears as "Greek capital letter
>Sigma" but is OK in "System".

That umlaute-weg.vdm is coded for VEDIT ANSI and not for VEDIT OEM!

>How do fonts support Code Pages?

I don't know too...

Christian

Topic:	Re: Unicode to ANSI conversion (44 of 103), Read 133 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Sunday, June 27, 2004 02:30 PM

I enhanced the old umlaute-weg.vdm now to support both ANSI and OEM.

http://ziemski.privat.t-online.de/vedit/macros/umlaute-weg.vdm

And fixed an error in there: The CASE option was missing and so the
Replace() might go wrong.

The same problem is in the original macros\umlauts.vdm written by
Johannes Loefler. I fixed it too.

http://ziemski.privat.t-online.de/vedit/macros/umlauts.vdm

Christian

Topic:	Re: Unicode to ANSI conversion (48 of 103), Read 133 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Monday, June 28, 2004 03:58 PM

I found a problem in the macro:

If the file-type isn't set to "0" (CR/LF) the macro failed due to
"wrong" line ends while building the internal translation table.

I fixed it by determining and setting the file type before working
with the above table. Additionally the file type is now set LOCAL at
end of the macro.

New version is attached.

Christian

UTF-ANSI(3).VDM (14KB)

Topic:	Re: Unicode to ANSI conversion (51 of 103), Read 133 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Monday, June 28, 2004 11:24 PM

At 03:59 PM 6/28/2004, you wrote:
>I found a problem in the macro:
>
>If the file-type isn't set to "0" (CR/LF) the macro failed due to
>"wrong" line ends while building the internal translation table.
>Attachment:
>http://webboard..../upload/utf%2Dansi%283%29.vdm (14KB)

Is this fix also needed in ansi-utf.vdm ?

Ted.

Topic:	Re: Unicode to ANSI conversion (57 of 103), Read 131 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Friday, July 02, 2004 01:13 PM

At 09:50 AM 6/27/2004, you wrote:
>In the process I have realised I do not understand the whole process of Code Pages & Fonts in Windows.

Ian, I think you understand it better than anyone else in this forum; certainly better than me. To most programmers in the US, a "code page" is a program printed on paper. ;-)

>Vedit OEM font displays "cent sign" even though my default is Multilingual Code Page 850
>Terminal font displays "Latin small letter o with stroke"

Well, I created the VEDIT fonts pixel-by-pixel and it is certainly possible/likely that I made some mistakes. You already noticed errors in the VEDIT Ansi font, which I have corrected but not yet released.

>How do fonts support Code Pages?

Since the VEDIT fonts are "system" fonts with static pixels, they do not support code pages. True-Type fonts do have some support for code pages built in, in other words the character returned by a true-type font depends upon the system's code page. However, I believe that at least some non-English versions of Windows are supplied with a different set of fonts.

That is all I know, and hopefully all I need to know. :-)

BTW- As I have stated before, anyone is welcome to work on the VEDIT fonts; e.g. in case you want a smaller or bigger one than is currently supplied, or want some custom characters. It is very tedious, eye-straining work.

Ted.

Topic:	Re: Unicode to ANSI conversion (45 of 103), Read 135 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Monday, June 28, 2004 03:54 AM

On 6/26/2004 11:42:08 AM, Ted Green wrote:
>Ideally, the ANSI-UTF.VDM
>macro should be updated to
>work in one pass (special
>conversion for values > 127)
>and handle both ANSI and OEM
>characters.

I have had a go at this.

I decided to use identical translation tables, and obviously had to use a different search strategy, as we are looking for a single byte.

This uses columnar blocks in translation table.
I have never used these in macros before, and suspect I have not got this the best way - I welcome any suggestions.

ANSI_UTF(1).VDM (12KB)

Topic:	Re: Unicode to ANSI conversion (46 of 103), Read 135 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Monday, June 28, 2004 07:07 AM

On 6/28/2004 3:54:16 AM, Ian Binnie wrote:

>[ANSI-UTF.VDM]
>
>I have had a go at this.
>...
>This uses columnar blocks in translation table.
>I have never used these in macros before, and suspect I have not got this
>the best way - I welcome any suggestions.

It may be easier not to set the block beforehand but to use
Search_Block("ss",p,q,COLSET,c1, c2)
instead of your
Search_Block("|@(103)",Block_Begin,Block_End,BEGIN+NOERR)

Christian

Topic:	Re: Unicode to ANSI conversion (47 of 103), Read 134 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Monday, June 28, 2004 10:39 AM

At 07:07 AM 6/28/2004, you wrote:
>>This uses columnar blocks in translation table.
>>I have never used these in macros before, and suspect I have not got this
>>the best way - I welcome any suggestions.
>
>It may be easier not to set the block beforehand but to use
> Search_Block("ss",p,q,COLSET,c1, c2)
>instead of your
> Search_Block("|@(103)",Block_Begin,Block_End,BEGIN+NOERR)

Ian, Christian:

Columnar blocks are much slower.
Here is my suggestion:

1. Rewrite the table conversion (hex to binary) to delete everything (comments)
after the 3rd hex number.
2. Then Search("|@(103)|>") because the 8-bit character should be at
the end of the line in the table.

Ted.

Topic:	Re: Unicode to ANSI conversion (49 of 103), Read 135 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Monday, June 28, 2004 03:58 PM

On Mon, 28 Jun 2004 10:39:00 -0400, Ted Green wrote:

>Columnar blocks are much slower.
>Here is my suggestion:
>
>1. Rewrite the table conversion (hex to binary) to delete
> everything (comments) after the 3rd hex number.
>2. Then Search("|@(103)|>") because the 8-bit character
> should be at the end of the line in the table.

Thanks for the suggestion.

I've done it and fixed some things in the version Ian sent earlier.

It's attached.

Christian

ANSI-UTF.VDM (12KB)

Topic:	Re: Unicode to ANSI conversion (50 of 103), Read 135 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Monday, June 28, 2004 11:21 PM

At 03:59 PM 6/28/2004, you wrote:
>>Columnar blocks are much slower.
>>Here is my suggestion:
>>
>>1. Rewrite the table conversion (hex to binary) to delete
>> everything (comments) after the 3rd hex number.
>>2. Then Search("|@(103)|>") because the 8-bit character
>> should be at the end of the line in the table.
>
>Thanks for the suggestion.
>
>I've done it and fixed some things in the version Ian sent earlier.
>It's attached.

Thank you for the new macro Ian and Christian.
With Christian's (fixed) version of this, I will use these two new
macros to replace the old Unicode macros.
This a great improvement!

I think this issue can be considered finished.

Ted.

Topic:	Re: Unicode to ANSI conversion (52 of 103), Read 134 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Tuesday, June 29, 2004 04:51 AM

On 6/28/2004 11:21:40 PM, Ted Green wrote:

>[...] I will use these two new
>macros to replace the old Unicode macros.
>
>I think this issue can be considered finished.

Ooooh - and I thought we would hit the "100 messages" line in this thread ... ;-)

Christian

Topic:	Re: Unicode to ANSI conversion (53 of 103), Read 135 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Tuesday, June 29, 2004 05:34 AM

On 6/29/2004 4:51:11 AM, Christian Ziemski wrote:
>On 6/28/2004 11:21:40 PM, Ted Green
>wrote:
>
>>[...] I will use these two new
>>macros to replace the old Unicode macros.
>>
>>I think this issue can be considered finished.
>
>Ooooh - and I thought we would hit the
>"100 messages" line in this thread ...
>;-)
>
>Christian
>
I agree the macros are effectively finished.

There are a few minor documentation issues
Christian has done a good job of documenting the macro.

I do not fully understand why "// To be save (otherwise macro may fail):" but this is at worst harmless (except save should be safe - Christian's English is normally excellent)

The routine to build the translation table should be identical in both directions.

My ansi-utf.vdm routine did an overwrite + insert, Christian does a delete + insert 2.
I would have expected the former to be more efficient.

The checking for Is_Auto_Execution is not consistent.

Topic:	Re: Unicode to ANSI conversion (54 of 103), Read 135 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Tuesday, June 29, 2004 08:39 AM

On 6/29/2004 5:34:52 AM, Ian Binnie wrote:

>I do not fully understand why
>"// To be save (otherwise macro may fail):"

Without that coding to determine and set the file type before building the translation table it could happen the following (this way I found this issue):

For example translating a text with ansi-utf.vdm leaves the file type set to "1" (globally(!) in the older version).
Then translating it back with utf-ansi.vdm tries to build the internal table. This one is coded with CR/LF but may be build in a buffer with file type "1" (LF).
That doesn't work correctly and breaks the macro with the "BAD INPUT" message.
Perhaps there is an easier way to handle this?

> (except save should be safe - Christian's English is
>normally excellent)

Thanks. And thanks for the correction.
(Leaning english, lesson thirtyseven.)

>My ansi-utf.vdm routine did an overwrite + insert, >Christian does a delete + insert 2.
>I would have expected the former to be more efficient.

I have seen the difference but I don't know which one is faster.
And additionally I don't understand why Ted's main loop with the call(104) should be faster than an inline if() there.

Hopefully I'll find some time to compare the execution times for the different versions.
The macro for that is almost ready...

>The checking for Is_Auto_Execution is not consistent.

Oh, yes, the remaining OS_Type() thing. Should be deleted.

Who of us will do those little remaining tasks?
Should I?
We should coordinate this to avoid double work.

Christian

Topic:	Re: Unicode to ANSI conversion (55 of 103), Read 137 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Tuesday, June 29, 2004 08:54 PM

At 08:39 AM 6/29/2004, you wrote:
>For example translating a text with ansi-utf.vdm leaves the file type set to "1" (globally(!) in the older version).
>Then translating it back with utf-ansi.vdm tries to build the internal table. This one is coded with CR/LF but may be build in a buffer with file type "1" (LF).
>That doesn't work correctly and breaks the macro with the "BAD INPUT" message.
>Perhaps there is an easier way to handle this?

In general, macros should work with either file type 0 or 1.
Of course, if a temp buffer needs a special file type, it should be
set LOCAL-ly.

>>My ansi-utf.vdm routine did an overwrite + insert, >Christian does a delete + insert 2. I would have expected the former to be more efficient.
>
>I have seen the difference but I don't know which one is faster.

I think an overwrite + insert it better.

>And additionally I don't understand why Ted's main loop with the call(104) should be faster than an inline if() there.

In an if-then-else, the code must search for the closing "} else" and "}", which is CPU intensive. A Call() is extremely efficient. Someday, when the macro language is internally compiled, the if-then-else will be faster.

>Hopefully I'll find some time to compare the execution times for the different versions. The macro for that is almost ready...

The old uni-asc.vdm macro took only 1 second per megabyte.
The new utf-ansi.vdm takes 1.5 seconds/Meg for Ansi and 4 seconds/Meg for OEM characters.

Ted.

Topic:	Re: Unicode to ANSI conversion (56 of 103), Read 133 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, June 30, 2004 11:21 AM

On Tue, 29 Jun 2004 20:54:00 -0400, Ted Green wrote:

>At 08:39 AM 6/29/2004, you wrote:
>>For example translating a text with ansi-utf.vdm leaves the file type set to
>"1" (globally(!) in the older version).
>>Then translating it back with utf-ansi.vdm tries to build the internal table.
>This one is coded with CR/LF but may be build in a buffer with file type "1"
>(LF).
>>That doesn't work correctly and breaks the macro with the "BAD INPUT" message.
>>Perhaps there is an easier way to handle this?
>
>In general, macros should work with either file type 0 or 1.
>Of course, if a temp buffer needs a special file type, it should be
>set LOCAL-ly.

I'll have a look at that later. It really was a bit strange...

>I think an overwrite + insert it better.

I've done some benchmarks today and found interesting results.
But it's too early to post them. In some days...
(And then I'll fix the two macros too.)

>>And additionally I don't understand why Ted's main loop with the call(104)
>should be faster than an inline if() there.
>
>In an if-then-else, the code must search for the closing "} else" and "}",
>which is CPU intensive. A Call() is extremely efficient. Someday, when the
>macro language is internally compiled, the if-then-else will be faster.

Does that mean that searching for the closing "}" is so CPU expensive
that every saved (and not executed, simply read over) byte between the
{} is faster than the overhead of the Call() itself?
If yes: Good to know. I never would have thought that this way.

>The old uni-asc.vdm macro took only 1 second per megabyte.
>The new utf-ansi.vdm takes 1.5 seconds/Meg for Ansi and 4 seconds/Meg
>for OEM characters.

What CPU are you using??? On my PC with 800MHz it tooks your time
multiplied with 10 :-(

Christian

Topic:	Re: Unicode to ANSI conversion (58 of 103), Read 136 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ted Green
Date:	Tuesday, July 06, 2004 05:01 PM

We have enhanced the utf-ansi.vdm macro to optionally run in "quiet" mode without displaying an application window. The macro now also returns with return codes to indicate how well the conversion ran. This allows shelling from another program and checking for errors. These changes were commissioned by a customer.

To convert a file, eg. "unicode.dat", creating the ANSI file "ascii.dat", shell out to give this command:

c:\vedit\vpw.exe -q -x utf-ansi.vdm unicode.dat -a ascii.dat

It will run in "quiet" mode (no application displayed), convert the file and return with a return code. Unknown Unicode characters are converted to hex 7F (decimal 127).

If you want to convert unknown Unicode characters to something else, change line 398 to the desired value. For example, to change it to "#", change the "IC(127,OVERWRITE)" line to:

IC(35,OVERWRITE)

The return codes are:

0 - Success - fewer than 2% of the Unicode characters were unknown.

1 - Marginal - between 2% and 10% of Unicode character were unknown.

2 - Marginal - between 10% and 15% of Unicode character were unknown.

3 - Marginal - between 15% and 33% of Unicode character were unknown.

10 - Fail - More than 33% of character were unknown and the macro therefore
aborted. Most likely not a UTF-16 Unicode file.

11 - Internal error - Internal conversion table was incorrectly edited.

Tom's changes are a bit complex, but allow the return code to distinguish between a Unicode file with a few unknown characters and a non-UTF16 or a Chinese Unicode file which did not convert correctly.

The attached macro should now be used for any additional enhancements to utf-ansi.vdm. This macro will soon replace unic-asc.vdm.

Ted.

UTF-ANSI(4).VDM (16KB)

Topic:	Re: Unicode to ANSI conversion (59 of 103), Read 137 times, 2 File Attachments
Conf:	Converting, Translating
From:	Ted Green
Date:	Wednesday, July 07, 2004 12:38 AM

The last utf-ansi.vdm (with new error return codes) didn't work as advertised. :-(
Sorry. This is not hopefully corrected in the attached version.

Any auto-execution ("-x") of the macro now returns an (error) return code, independently of the "-q" quiet mode.

The attached test.bat file is a way to test the error return codes; it assumes a UTF-16 sample file called test.utf.

Thank you Pauli, Ian and Christian for all your work on this macro.

Ted.

TEST.BAT (1KB)
UTF-ANSI(5).VDM (17KB)

Topic:	Re: Unicode to ANSI conversion (60 of 103), Read 142 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, July 07, 2004 04:14 AM

On 7/7/2004 12:38:10 AM, Ted Green wrote:

>The last utf-ansi.vdm (with new error return codes) didn't
>work as advertised. :-( Sorry.
>This is now hopefully corrected in the attached version.

I tested the new macro a bit. Unfortunately the interactive mode is broken now.

I fixed that and some register usage too.

Another thing that should still be tested:
The feature of aborting the translation if more than 33% are not translatable requires a relatively complicated if() in the conversion. I tried to speed that up, but without success. The slow down should be tested IMHO.

Christian

UTF-ANSI(6).VDM (17KB)

Topic:	Re: Unicode to ANSI conversion (61 of 103), Read 134 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Friday, July 09, 2004 02:09 AM

On 7/7/2004 4:14:23 AM, Christian Ziemski wrote:
>On 7/7/2004 12:38:10 AM, Ted Green
>wrote:
>
>>The last utf-ansi.vdm (with new error return codes) didn't
>>work as advertised. :-( Sorry.
>>This is now hopefully corrected in the attached version.
>
>I tested the new macro a bit.
>Unfortunately the interactive mode is
>broken now.
>
>I fixed that and some register usage
>too.
>
>Another thing that should still be
>tested:
>The feature of aborting the translation
>if more than 33% are not translatable
>requires a relatively complicated if()
>in the conversion. I tried to speed that
>up, but without success. The slow down
>should be tested IMHO.
>
>Christian

The latest version of UTF_ANSI.vdm is much poorer at detecting UTF-16LE files.

The check for BOM FF FE is omitted - this is contained in most files and should be retained as the primary test. This may be satisfied for UNIX files or Mac (the latter is less probably as the byte order is more likely to be big-endian.

In my experience all files contain BOM except for log files.

The check that there are 5 "|H0D|000|H0A|000" is MUCH weaker than my test that the first 5 0x0D are part of "|H0D|000|H0A|000" strings, and may result in false positives. In particular it may flag binary files with Unicode strings embedded.

While the changes to the macro may be handy for approximate conversions, I think the original functionality should be retained.

Topic:	Re: Unicode to ANSI conversion (62 of 103), Read 135 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Friday, July 09, 2004 08:55 AM

At 02:09 AM 7/9/2004, you wrote:
>The latest version of UTF_ANSI.vdm is much poorer at detecting UTF-16LE files.

As we had no examples of the BOM bytes, Tom left that code out because he thought it had a logical error (would never trigger).

If you send us a few UTF-16LE sample files, Tom will make your suggested improvements.

Thank you again for your help with UTF files.

Ted.

Topic:	Re: Unicode to ANSI conversion (63 of 103), Read 137 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Saturday, July 10, 2004 08:21 PM

On 7/9/2004 8:55:32 AM, Ted Green wrote:
>At 02:09 AM 7/9/2004, you
>wrote:
>>The latest version of UTF_ANSI.vdm is much poorer at detecting UTF-16LE files.
>
>As we had no examples of the
>BOM bytes, Tom left that code
>out because he thought it had
>a logical error (would never
>trigger).
>
>If you send us a few UTF-16LE
>sample files, Tom will make
>your suggested improvements.
>
>Thank you again for your help
>with UTF files.
>
>Ted.
>

I am surprised that Tom cannot find any files.
I have more than 400 on my XP system drive.
These are mainly log, xml or ini files.

I have attached one small example. (Many are very large files.)

NOTE that it is possible to generate samples using Notepad, SaveAs and selecting Encoding.

I originally got into this because I wanted to use Vedit rather than Notepad to view these files.

MODEMLOG_TOSHIBA SOFTWARE MODEM AMR.TXT (4KB)

Topic:	Re: Unicode to ANSI conversion (64 of 103), Read 132 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Monday, July 12, 2004 03:53 AM

On 7/9/2004 8:55:32 AM, Ted Green wrote:
>
>As we had no examples of the BOM bytes, Tom left that code
>out because he thought it had a logical error (would never
>trigger).
>
>If you send us a few UTF-16LE sample files, Tom will make
>your suggested improvements.

Since Windows Unicode seems to be always little-endian with a BOM of 0xFFFE the macro checked that BOM.

But what about the big-endian format? If I understand it correctly, big-endian should be default in Unicode (architecture independent).

So the macro should support big-endian files too, IMHO.

It shouldn't be difficult to enhance the macro, in one of my older versions of unic-asc.vdm I already had implemented that.

Ted, if you agree to have this included in the new utf-ansi.vdm I'll do it.

Christian

Topic:	Re: Unicode to ANSI conversion (65 of 103), Read 132 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Monday, July 12, 2004 09:25 AM

On 7/12/2004 3:53:32 AM, Christian Ziemski wrote:
>On 7/9/2004 8:55:32 AM, Ted Green wrote:
>>
>Since Windows Unicode seems to be always
>little-endian with a BOM of 0xFFFE the
>macro checked that BOM.

Many Windows applications produce files without BOM.

>But what about the big-endian format? If
>I understand it correctly, big-endian
>should be default in Unicode
>(architecture independent).

I do not see this in Unicode documentation which expresses no preference. This is dependent on architecture, and big-endian is largely confined to Mac implementations. The preferred coding is probably UTF-8, which is widely used in MIME encoding, and is independent of processor architecture, but much harder to decode.

>So the macro should support big-endian
>files too, IMHO.

I thought of this, but it seemed to be of little practical value to Vedit users.

>It shouldn't be difficult to enhance the
>macro, in one of my older versions of
>unic-asc.vdm I already had implemented
>that.

This would require either a big-endian translation table, or a significant change to the code to construct the table or lookup code points.

Topic:	Re: Unicode to ANSI conversion (66 of 103), Read 134 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Monday, July 12, 2004 10:09 AM

On 7/12/2004 9:25:17 AM, Ian Binnie wrote:
>On 7/12/2004 3:53:32 AM, Christian Ziemski wrote:
>
>>But what about the big-endian format? If I understand it correctly,
>>big-endian should be default in Unicode (architecture independent).
>
>I do not see this in Unicode documentation which expresses no
>preference.

I'm not exactly sure. The Unicode docu is fascinating complex...

>preferred coding is probably UTF-8, which is widely used in MIME
>encoding, and is independent of processor architecture, but much
>harder to decode.

Yes, we shouldn't try to do this via macro.
But that was already discussed some weeks ago.

>>So the macro should support big-endian files too, IMHO.
>
>I thought of this, but it seemed to be of little practical value
>to Vedit users.

Hmmm, no. Only if the user mainly uses x86 architecture.

But we don't know what files are edited with VEDIT...
I often transfer files from hosts and edit them locally. And later,
when VEDITs' FTP-support will be finished, users may edit remote
files from other architectures.

>>It shouldn't be difficult to enhance the macro, in one of my older
>>versions of unic-asc.vdm I already had implemented that.
>
>This would require either a big-endian translation table, or a
>significant change to the code to construct the table or lookup
>code points.

I think it's relatively easy: Building the translation table only has
to swap two bytes per line.

And the main loop has to be shifted one byte.
It *should* be no problem.
I'll see.

Christian

Topic:	Re: Unicode to ANSI conversion (67 of 103), Read 135 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Monday, July 12, 2004 10:26 AM

At 10:09 AM 7/12/2004, you wrote:

>I'm not exactly sure. The Unicode docu is fascinating complex...

And the VEDIT macro is becoming fascinatingly complex too. :-)

>I think it's relatively easy: Building the translation table only has
>to swap two bytes per line.
>And the main loop has to be shifted one byte.
>It *should* be no problem.
>I'll see.

Good luck; but don't spent too much time on it.

I would suggest separate main loops for the new cases, so that each loop can be speed optimized. (I implemented separate loops for the ANSI and OEM cases.) Therefore the macro might have four main loops for each combination of big/small endian and oem/ansi characterset.

Ted.

Topic:	Re: Unicode to ANSI conversion (68 of 103), Read 135 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Monday, July 12, 2004 10:37 AM

On 7/12/2004 10:26:46 AM, Ted Green wrote:

>Good luck; but don't spent too much time on it.

Yes, of course! ;-)

Earlier I wrote that I don't like character sets and fonts.
That is still true, but this macro(s) are technically interesting and a good way to learn something about Unicode.
Since Unicode is becoming more and more important that seems to be a must.

Much to do for us byte-orientated people...

Christian

Topic:	Re: Unicode to ANSI conversion (69 of 103), Read 134 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Monday, July 12, 2004 04:40 PM

On Mon, 12 Jul 2004 10:26:00 -0400, Ted Green wrote:

>>[Big-endian UTF-files]
>
>Good luck; but don't spent too much time on it.
>
>I would suggest separate main loops for the new cases

I adopted your suggestion.

The new macro is ready and attached.

Done:
- Added translation from big-endian UTF-16
- Reimplemented checking for BOM
- Added retcode 12 if in non-interactive mode and not an UTF-16 file !
- Changed behavior and dialog of checkings above

To do:
- How to guess endianess when no BOM?
- What about line-end != CR/LF when checking UTF format?

I tested it with some files, but not yet intensively.
It's late here for today...

It would be nice if others could proofread and test it too (Ted, Ian?)

Christian

UTF-ANSI(7).VDM (20KB)

Topic:	Re: Unicode to ANSI conversion (70 of 103), Read 139 times, 2 File Attachments
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Tuesday, July 13, 2004 02:54 PM

I found a bug in the translation:
It could happen to be case insensitive and thus wrong.

That is fixed in the attached files.

Christian

UTF-ANSI(8).VDM (20KB)
ANSI-UTF(1).VDM (12KB)

Topic:	Re: Unicode to ANSI conversion (71 of 103), Read 136 times, 3 File Attachments
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, July 22, 2004 09:55 PM

On 7/12/2004 4:40:08 PM, Christian Ziemski wrote:
>On Mon, 12 Jul 2004 10:26:00 -0400, Ted
>Green wrote:
>
>The new macro is ready and attached.
>
>Done:
>- Added translation from big-endian
>UTF-16
>- Reimplemented checking for BOM
>- Added retcode 12 if in non-interactive
>mode and not an UTF-16 file !
>- Changed behavior and dialog of
>checkings above
>
>To do:
>- How to guess endianess when no BOM?
>- What about line-end != CR/LF when
>checking UTF format?
>
>
>I tested it with some files, but not yet
>intensively.
>It's late here for today...
>
>It would be nice if others could
>proofread and test it too (Ted, Ian?)
>
Christian,

I have been using the macro (actually the version after this). It looks OK and seems to work well.

I have also tested it on a couple of big files (exporting my registry to text produces a 114Mbyte file) and while this takes a while to convert it is acceptable. (I am intrigued at the 270 characters it could not convert, but have not had the time to find these - I expected straight ANSI from registry.)

I made a couple of very short test files in LE, BE & UTF-8.

I have never seen a native UTF-16BE file, so I do not know how it would handle these.

I have thought about how to detect with no BOM, and can speculate, but without real files to check this seems futile.

Testing 00 0D 00 0A would be unlikely to succeed as the most likely pattern is 00 0D 00 0A 00. It would be possible to do some pattern checking on the first & last byte of the file, but looking for 00 0D 00 0A on an odd byte boundary would seem a better bet.

You would also need to check for 00 0D & 00 0A for Mac & xNIX files from a non intel CPU. I expect that any pure text files would have BOM, it is only internal files e.g. log files etc which don't seem to have BOM.

Incidentally the test for files without BOM should probably be local:-
Search("|H0D",NOERR+ERRBREAK+LOCAL)
If it can't find these in the buffered text they probably don't exist

TESTUTF-16LE.TXT (1KB)
TESTUTF-8.TXT (1KB)
TESTUTF-16BE.TXT (1KB)

Topic:	Re: Unicode to ANSI conversion (72 of 103), Read 140 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Friday, July 23, 2004 05:02 AM

Ian:

You wrote:

>I have been using the macro (actually the version after this).
>It looks OK and seems to work well.

Good. Thanks.

>[big test files] I am intrigued at the 270 characters it could not
>convert, but have not had the time to find these

I added code to collect the characters that couldn't be translated.
Since there shouldn't be too many of them that code will not slow
down the entire macro.

>I have thought about how to detect with no BOM, and can speculate,
>but without real files to check this seems futile.
>
>Testing 00 0D 00 0A would be unlikely to succeed as the most likely
>pattern is 00 0D 00 0A 00. It would be possible to do some pattern
>checking on the first & last byte of the file, but looking for
>00 0D 00 0A on an odd byte boundary would seem a better bet.
>
>You would also need to check for 00 0D & 00 0A for Mac & xNIX files
>from a non intel CPU.

This topic is a bit tricky.. That's why it is still "to do" ;-)

>Incidentally the test for files without BOM should probably be local:
>Search("|H0D",NOERR+ERRBREAK+LOCAL)
>If it can't find these in the buffered text they probably don't exist.

I agree. Done.

The new version is attached.

Christian

UTF-ANSI(9).VDM (21KB)

Topic:	Re: Unicode to ANSI conversion (73 of 103), Read 134 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Friday, July 23, 2004 10:42 PM

On 7/23/2004 5:02:03 AM, Christian Ziemski wrote:
>I added code to collect the
>characters that couldn't be
>translated.
>Since there shouldn't be too
>many of them that code will
>not slow
>down the entire macro.

I looked at the characters, and now think I understand why they are there.

I found 90 90 after the end of a REG_MULTI_SZ value i.e. junk, not displayed by regedit.

Most of the rest was in strings for which I don't have language support loaded e.g. in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Preview

>The new version is attached.

This took ~7 minutes to convert a 114 Mbyte file on my Pentium4M 1.7MHz. Notepad took over 1 minute to load the same file.

Topic:	Re: Unicode to ANSI conversion (74 of 103), Read 136 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Saturday, July 24, 2004 12:43 PM

Ian:

On Thu, 22 Jul 2004 21:55:00 -0400, you wrote:

>I have thought about how to detect with no BOM, [...]
>
>Testing 00 0D 00 0A would be unlikely to succeed as
>the most likely pattern is 00 0D 00 0A 00.
>[..] but looking for 00 0D 00 0A on an odd byte boundary
>would seem a better bet.

I incorporated your suggestion into my solution.
Together it seems to be reliable, at least a bit ;-)

The check for UNIX and Mac files is implemented too.

To be able to test the algorithm easyly I wrote an extra macro for it,
it's attached.

If it proves o.k. I'll merge it into the main macro utf-ansi.vdm.

Christian

UTF-CHECK.VDM (7KB)

Topic:	Re: Unicode to ANSI conversion (75 of 103), Read 138 times
Conf:	Converting, Translating
From:	Pauli Lindgren
Date:	Tuesday, July 27, 2004 11:37 AM

On 7/24/2004 12:43:00 PM, Christian Ziemski wrote:
>Ian:
>>I have thought about how to detect with no BOM, [...]
>>
>>Testing 00 0D 00 0A would be unlikely to succeed as
>>the most likely pattern is 00 0D 00 0A 00.
>>[..] but looking for 00 0D 00 0A on an odd byte boundary
>>would seem a better bet.
>
>I incorporated your suggestion into my
>solution.
>Together it seems to be reliable, at
>least a bit ;-)
>
>The check for UNIX and Mac files is
>implemented too.

Why are you testing line-ends in the first place?
There are quite many combinations to test.

Wouldn't it be easier to check, say, the first 50 unicode characters, and count how many zeroes are in even and odd byte positions?
If most odd bytes are zero, it is Big Endian; if most even bytes are zero, it is Little Endian; else it is probably not Unicode.

Here is a macro that does the above test. (Note: I compare the counters to CP (Cur_Pos) so that it will work with short files, too.)



      #104 = 0

      #1 = 0

      #2 = 0

      BOF

      Repeat(50) {

      if (Cur_Char==0) { #1++ }

      char

      if (Cur_Char==0) { #2++ }

      char

      }

      if (#1*3 > CP && #2*5 < CP) {

      #104 = 1		// Big endian

      } else {

      if (#2*3 > CP && #1*5 < CP) {

      #104 = 2	// Little endian

      }

      }

      BOF

--
Pauli

Topic:	Re: Unicode to ANSI conversion (76 of 103), Read 136 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Tuesday, July 27, 2004 11:49 AM

At 11:36 AM 7/27/2004, you wrote:
>Why are you testing line-ends in the first place?
>There are quite many combinations to test.
>
>Wouldn't it be easier to check, say, the first 50 unicode characters, and count how many zeroes are in even and odd byte positions?
>If most odd bytes are zero, it is Big Endian; if most even bytes are zero, it is Little Endian; else it is probably not Unicode.

Thank you for another approach to this.
I'm going to let you guys figure this out.

I do plan on releasing a VEDIT 6.13 very soon; I will include the "latest and greatest" version of these macros.

Tom just finished a very extensive macro which converts flat files into CSV format with field definitions, different types of field delimiters and various options. This and the new Unicode macros will be the major enhancements for 6.13.

Ted.

Topic:	Re: Unicode to ANSI conversion (81 of 103), Read 132 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, July 28, 2004 11:30 AM

On Tue, 27 Jul 2004 11:49:00 -0400, Ted Green wrote:

>I do plan on releasing a VEDIT 6.13 very soon; I will
>include the "latest and greatest" version of these macros.

Ted:

Can you give us a hint please:
Is "very soon" == "this week" or later?

Knowing that would make life easier while finishing the macro.

Christian

Topic:	Re: Unicode to ANSI conversion (82 of 103), Read 135 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Wednesday, July 28, 2004 03:52 PM

At 11:30 AM 7/28/2004, you wrote:
>Can you give us a hint please:
>Is "very soon" == "this week" or later?

I estimate it is next week.
I will post Tom's completed work in a few days.

Ted.

Topic:	Re: Unicode to ANSI conversion (77 of 103), Read 136 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Tuesday, July 27, 2004 12:20 PM

On Tue, 27 Jul 2004 11:37:00 -0400, Pauli Lindgren wrote:

>Why are you testing line-ends in the first place?

Because Ian came up with it.
I only wrote some more code around that idea...

>There are quite many combinations to test.
>
>Wouldn't it be easier to check, say, the first 50 unicode
>characters, and count how many zeroes are in even and
>odd byte positions?

That should be o.k. too, yes.
At least with UTF files in languages like English and German.

But since the macro translates UTF16-to-ANSI there is only little
chance to have files in Chinese or so ;-)

Your way seems to be the sufficient and so the better one.
Additionally I like it due to its ability to test small files.

If Ian also agrees, I'll use it in utf-ansi.vdm.

Or do you want to implement it there, Pauli?

Christian

Topic:	Re: Unicode to ANSI conversion (78 of 103), Read 143 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, July 28, 2004 12:59 AM

Here the test macro with Pauli's algorithm.

I added a Min() command to be able to check files smaller than 50
characters too.

Christian

UTF-CHECK2.VDM (3KB)

Topic:	Re: Unicode to ANSI conversion (79 of 103), Read 142 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Tuesday, July 27, 2004 09:12 PM

On 7/27/2004 12:20:20 PM, Christian Ziemski wrote:
>On Tue, 27 Jul 2004 11:37:00 -0400,
>Pauli Lindgren wrote:
>
>>Why are you testing line-ends in the first place?
>
>Because Ian came up with it.
>I only wrote some more code around that
>idea...

I originally was looking at a way to detect Windows UTF-16LE text files without BOM - mostly log files.

The logic was to look for CR/LF to ensure that it was a TEXT file, at the same time checking byte order.

The macro has moved on a long way since then. You tend to get stuck in the original rut.

>>Wouldn't it be easier to check, say, the first 50 unicode
>>characters, and count how many zeroes are in even and
>>odd byte positions?

This logic assumes that the file is Unicode, but would be a better way of distinguishing between LE & BE.

It would probably fail is applied to random files, e.g .exe.

>That should be o.k. too, yes.
>At least with UTF files in languages
>like English and German.
>
>But since the macro translates
>UTF16-to-ANSI there is only little
>chance to have files in Chinese or so
>;-)

Agreed - if you are not using CP1252 the macro (and indeed Vedit) would not be much use.

>Your way seems to be the sufficient and
>so the better one.
>Additionally I like it due to its
>ability to test small files.
>
>If Ian also agrees, I'll use it in
>utf-ansi.vdm.

This seems like a better approach. I have looked at Christian's latest, and it looks OK, and works, but is a bit complex.

I would still like to include text file checking.
Maybe that could be part of the file mode check.

Topic:	Re: Unicode to ANSI conversion (80 of 103), Read 137 times
Conf:	Converting, Translating
From:	Pauli Lindgren
Date:	Wednesday, July 28, 2004 11:26 AM

On 7/27/2004 9:12:03 PM, Ian Binnie wrote:
>
>This logic assumes that the file is Unicode, but would be a better way of
>distinguishing between LE & BE.
>
>It would probably fail is applied to random files, e.g .exe.

In case of binary files, the algorithm would most likely return #104=0, i.e. it detects the file as "non Unicode file".
This even works with files that contains lots of NUL bytes, because there are tests for minimum counter values, too.

You could adjust the min/max limit values (by adjusting the multipliers) for best accuracy.

--
Pauli

Topic:	Re: Unicode to ANSI conversion (83 of 103), Read 139 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, July 29, 2004 03:44 AM

On 7/28/2004 11:26:56 AM, Pauli Lindgren wrote:
>On 7/27/2004 9:12:03 PM, Ian Binnie
>wrote:
>>
>In case of binary files, the algorithm
>would most likely return #104=0, i.e. it
>detects the file as "non Unicode file".
>This even works with files that contains
>lots of NUL bytes, because there are
>tests for minimum counter values, too.
>
I agree.

I wrote my comments after reading the post and before testing the macro (Christian's latest).
It seems to work well, and should not have false positives.

Topic:	Re: Unicode to ANSI conversion (84 of 103), Read 140 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Thursday, July 29, 2004 06:44 AM

On 7/29/2004 3:44:19 AM, Ian Binnie wrote:
>It seems to work well, and should not have false positives.

O.k. I'll put it into utf-ansi.vdm then.

Christian

Topic:	Re: Unicode to ANSI conversion (85 of 103), Read 141 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Saturday, July 31, 2004 02:03 AM

I've put the new checking/determining routine into the main macro.

It's finished now - if that is really possible...

Nevertheless: Please do some final tests, folks!

Christian

UTF-ANSI(10).VDM (20KB)

Topic:	Re: Unicode to ANSI conversion (86 of 103), Read 142 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Sunday, August 01, 2004 10:03 PM

On 7/31/2004 2:03:57 AM, Christian Ziemski wrote:
>
>I've put the new
>checking/determining routine
>into the main macro.
>
>It's finished now - if that is
>really possible...
>
>Nevertheless: Please do some
>final tests, folks!
>
>
>Christian
>
This is quite elegant.

I ran a few tests and it works OK.

I found the display of untranslated characters confusing.
Previously I had viewed these in Notepad (as Unicode), and thought of formatting this as UTF-16, but decided HEX would be more useful in Vedit.

This is included in the attached.

UTF-ANSI(11).VDM (21KB)

Topic:	Re: Unicode to ANSI conversion (87 of 103), Read 139 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Monday, August 02, 2004 03:16 AM

On 8/1/2004 10:03:36 PM, Ian Binnie wrote:

>I found the display of untranslated characters confusing.
>Previously I had viewed these in Notepad (as Unicode), and thought of formatting
>this as UTF-16, but decided HEX would be more useful in Vedit.

And I added a space between the two hex bytes now, do you agree?

Additionally I fixed some possible problems regarding file type and overwrite mode.

New version is attached.

Christian

UTF-ANSI(12).VDM (21KB)

Topic:	Re: Unicode to ANSI conversion (88 of 103), Read 143 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Monday, August 02, 2004 08:08 AM

On 8/2/2004 3:16:58 AM, Christian Ziemski wrote:
>On 8/1/2004 10:03:36 PM, Ian Binnie
>wrote:
>
>>I found the display of untranslated characters confusing.
>>Previously I had viewed these in Notepad (as Unicode), and thought of formatting
>>this as UTF-16, but decided HEX would be more useful in Vedit.
>
>And I added a space between the two hex
>bytes now, do you agree?

No problems

>
>Additionally I fixed some possible
>problems regarding file type and
>overwrite mode.

Again I do not see any problems, but feel that the code:-

#98=File_Check("|(VEDIT_TEMP)\utf-ansi.err")
...

is unnecessarily complex, and offers no benefit.

I find File_Open works as well, and is much clearer.

Topic:	Re: Unicode to ANSI conversion (89 of 103), Read 148 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Monday, August 02, 2004 09:05 AM

On 8/2/2004 8:08:14 AM, Ian Binnie wrote:
>>
>>Additionally I fixed some possible problems regarding file type and overwrite mode.
>
>Again I do not see any problems,

For example it happened if I translated a file with utf-ansi and then {File, Reload}ed it for a next try.
Now the file has most often a file type of "record/binary" and is in overwrite-only mode. That prevented another translation run.

Perhaps it's a bit dependent on the VEDIT configuration.

>but feel that the code:-
>
>#98=File_Check("|(VEDIT_TEMP)\utf-ansi.err") ...
>
>is unnecessarily complex, and offers no benefit.
>
>I find File_Open works as well, and is much clearer.

You are absolutely correct.
Sometimes I'm running on the wrong and complicated road.

The new version is attached

Christian

UTF-ANSI(13).VDM (21KB)

Topic:	Re: Unicode to ANSI conversion (90 of 103), Read 156 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Monday, August 02, 2004 08:39 PM

On 8/2/2004 9:05:41 AM, Christian Ziemski wrote:
>On 8/2/2004 8:08:14 AM, Ian Binnie
>wrote:
>>>
>>>Additionally I fixed some possible problems regarding file type and overwrite mode.
>>
>>Again I do not see any problems,
>
>For example it happened if I translated
>a file with utf-ansi and then {File,
>Reload}ed it for a next try.
>Now the file has most often a file type
>of "record/binary" and is in
>overwrite-only mode. That prevented
>another translation run.

Christian, I think you misinterpreted my colloquial English.

I saw no problem with your change. (I looked at the code, and ran some tests.)

I admit I have never experienced the issue you mention, but it is good to protect from this.

>The new version is attached

and works well.

Topic:	Re: Unicode to ANSI conversion (91 of 103), Read 166 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Tuesday, August 03, 2004 03:38 PM

Ian:

On Mon, 02 Aug 2004 20:39:00 -0400, you wrote:

>Christian, I think you misinterpreted my colloquial English.

No, it's o.k. I like dialog and discussion.
Working alone in one's chamber only for oneself is not so much fun.

And especially when programming discussion is really constructive!

>I admit I have never experienced the issue you mention,
>but it is good to protect from this.

I'm trying to write safe code, perhaps sometimes a bit
over-complicated...

Fortunately Pauli, you or someone else is stopping me from time to
time. ;-)

Christian

Topic:	Re: Unicode to ANSI conversion (30 of 103), Read 136 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Friday, June 25, 2004 06:57 PM

On 6/25/2004 8:09:41 AM, Christian Ziemski wrote:
>
>Of course I couldn't resist
>and still worked on the macro.
>
>The macro seems to be finished
>now...
>
Christian,

You are too fast.

I had thought of doing some of the changes you proposed, but you get there first.

I have tested your previous version, and it seems to work OK.

I had only one observation, that when there is an untranslatable Unicode character it leaves the Least Significant Byte i.e. 2500 (Box Drawing Light Horizontal) becomes 00. My macro did the same.

It may be preferable to replace with a default. One of the characters which is not supported in Unicode e.g. 127 may be reasonable.

Topic:	Re: Unicode to ANSI conversion (36 of 103), Read 137 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Saturday, June 26, 2004 02:36 AM

On Fri, 25 Jun 2004 18:57:00 -0400, Ian Binnie wrote:

>I had only one observation, that when there is an untranslatable
>Unicode character it leaves the Least Significant Byte i.e. 2500
>(Box Drawing Light Horizontal) becomes 00. My macro did the same.
>
>It may be preferable to replace with a default. One of the
>characters which is not supported in Unicode e.g. 127 may be
>reasonable.

That would require another if() in the loop.
But you are right, it's not good to leave a NULL byte in the text!

Christian

Topic:	Re: Unicode to ANSI conversion (38 of 103), Read 135 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Saturday, June 26, 2004 06:51 AM

Ian:

On Fri, 25 Jun 2004 18:57:00 -0400, you wrote:

>You are too fast.
>
>I had thought of doing some of the changes you proposed,
>but you get there first.

Oh, I'm really sorry. ;-)

But this weekend I'll hold back myself!
So you are safe.

Christian

Topic:	Re: Unicode to ANSI conversion (19 of 103), Read 135 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, June 23, 2004 05:28 PM

To answer my own question (or at least try it).

>Of course the Replace() command isn't
>able to see the character boundaries!
>...
>So that way of conversion seems to be not reliable :-((
>
>Any thoughts?

One solution could be to check whether the cursor is on a double-byte boundary. If not: ignore that occurance.

I've done it with
if (Cur_Pos & 1) {
Char(1)
Continue
}
in my conversion loop.

The new version is attached and has to be tested!!!

But not now. It's late here.

Christian

UTF_ANSI-Z(2).VDM (9KB)

Topic:	Re: Unicode to ANSI conversion (21 of 103), Read 132 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Wednesday, June 23, 2004 08:57 PM

On 6/23/2004 1:00:51 PM, Christian Ziemski wrote:
>On Tue, 22 Jun 2004 23:31:00 -0400, Ted
>Green wrote:
>
>>Ian:
>>
>>You are correct; we should simply use the Windows API to perform the
>>conversion.
>>
>>Similarly we will build the UTF-16 to ANSI code into VEDIT later this
>>year, using the Windows API. We will still have a UTF-ANSI.VDM macro,
>>but we will replace your special conversion table (and Christian's
>>fancy implementation of it) with the Windows API.
>
>I interpret that as:
>
>"Don't spend much more work into the
>current UTF-ANSI.VDM.
> Only some cleaning up."
>
>Do you (Ian and Ted) agree?
>
>
>Christian

I agree.

I mainly got into this by a desire to view Unicode log files in Vedit and decided Windows-1252 was all I really needed (actually more than really necessary).

It is worthwhile tidying up the current macros, but not get too carried away.

There are other options for industrial strength conversions, and the API approach will allow Vedit to process quite large files efficiently, and handle different UTF codings.

I note in one of your other posts you ran into character boundary problems. I found the same when I tried to write an OEM => Unicode macro, which ran into circular translation which is insoluble with my simple Replace() approach.

Topic:	Re: Unicode to ANSI conversion (24 of 103), Read 135 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Thursday, June 24, 2004 04:54 PM

On Wed, 23 Jun 2004 20:57:00 -0400, Ian Binnie wrote:
>
>I mainly got into this by a desire to view Unicode log files in Vedit and
>decided Windows-1252 was all I really needed (actually more than really
>necessary).

Ian:
I don't have VC++6.0 for the sample UCONVERT.C you mentioned in
another posting. And in fact, I don't needed Unicode (until now).

>It is worthwhile tidying up the current macros, but not get too carried away.

Yes, but it was an interesting task.

Christian

Topic:	Re: Unicode to ANSI conversion (8 of 103), Read 134 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Tuesday, June 22, 2004 09:41 PM

On 6/22/2004 7:00:37 AM, Christian Ziemski wrote:
>On 6/22/2004 12:50:05 AM, Ted Green
>wrote:
>>
>Here is a try to do some speed
>optimization.
>The macro now needs two passes only.

I did explore other options, which seemed too complex.
This is a clever idea.
>
>But it seems not to be always faster.
>For small files Ian's version is faster,
>for big ones this new one.

I did try a few moderately large files, and found the speed acceptable, although I am also concerned at the multiple passes. This is of course because the files are effectively buffered in memory by Windows.

I did try some tests only performing passes on the Vedit buffered memory, but this did not seem to significantly improve things. PS the last line in the macro is probably superfluous now.

>Unfortunately I'm not able to do more
>tests now (and I don't have Unicode
>files with those special characters in
>it to test...).

If you have VC++6.0 there is a sample UCONVERT.C
This converts most Unicode options (even OEM => Unicode).
The user interface is terrible.

>
>The checking line in the main loop
> #104|=CC // check null byte
>should be tested too.
>It slows down the loop to produce a
>warning "only".
>Perhaps it makes sense to change/delete
>this?
>
>
>BTW: Ian has changed the initially
>NO/yes dialog to a YES/no dialog. That
>should be checked.

Christian, This was deliberate. I always found your default to be counter intuitive - this is not how most Windows dialog boxes work. Unfortunately I can't change the inbuilt OEM/ANSI.

For a long time I thought it didn't work, because I (hangs head in shame) didn't read the dialog box - like most users.

Topic:	Re: Unicode to ANSI conversion (11 of 103), Read 135 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, June 23, 2004 03:57 AM

On 6/22/2004 9:41:27 PM, Ian Binnie wrote:

>I did try a few moderately large files, and found the speed acceptable,
>although I am also concerned at the multiple passes.
>This is of course because the files are effectively buffered in
>memory by Windows.

>I did try some tests only performing passes on the Vedit buffered
>memory, but this did not seem to significantly improve things.
>PS the last line in the macro is probably superfluous now.

Perhaps we should leave both ways in the macro:

if (File_Size < 10000000 ) {
// use Ian's multi-path
} else {
// use Christian's dual-path
}

Hopefully this evening I'll find time to test it a bit more.

>>BTW: Ian has changed the initially NO/yes dialog to a YES/no dialog.
>>That should be checked.

>Christian, This was deliberate.
>I always found your default to be counter intuitive
>- this is not how most Windows dialog boxes work.

It is not *my* default! I don't like it too.
But several dialogs in VEDIT dealing with big actions are that way.
And so I made the above note.

>Unfortunately I can't change the inbuilt OEM/ANSI.
>For a long time I thought it didn't work, because I (hangs head in
>shame) didn't read the dialog box - like most users.

Yes, I know that effect. It caught me too. ;-)

Christian

Topic:	Re: Unicode to ANSI conversion (14 of 103), Read 136 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Wednesday, June 23, 2004 10:01 AM

At 03:58 AM 6/23/2004, you wrote:
>>>BTW: Ian has changed the initially NO/yes dialog to a YES/no dialog.
>>>That should be checked.
>
>>Christian, This was deliberate.
>>I always found your default to be counter intuitive
>>- this is not how most Windows dialog boxes work.
>
>It is not *my* default! I don't like it too.
>But several dialogs in VEDIT dealing with big actions are that way.
>And so I made the above note.

We can change the default button to be "Yes".

Ted.

Topic:	Re: Unicode to ANSI conversion (12 of 103), Read 135 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Wednesday, June 23, 2004 05:06 AM

Christian,

I have had more time to study your macro.

This is quite clever.

There is a lot of unnecessary code in the Hex-Bin conversion.
When you are supplying your own table there is no need to test for lower case or non-hex characters. This won't make much difference to speed, but it will be shorter and clearer.

Topic:	Re: Unicode to ANSI conversion (13 of 103), Read 139 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Wednesday, June 23, 2004 07:26 AM

On 6/23/2004 5:06:10 AM, Ian Binnie wrote:
>
>There is a lot of unnecessary code in the Hex-Bin
>conversion. When you are supplying your
>own table there is no need to test for lower case or non-hex
>characters. This won't make much difference to speed, but
>it will be shorter and clearer.

Ian:

Feel free to change it . It's your macro. ;-)

Christian

Topic:	Unicode to ANSI conversion (92 of 103), Read 84 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, April 03, 2008 10:53 PM

I had always rejected UTF-8 to ANSI conversion as too hard, and suggested Vedit use the inbuilt API.

It occurred to me the UTF-8 to UTF-16 conversion is a simple algorithm, which could be followed by Unicode (UTF-16) to ANSI conversion.

I have written a standalone UTF-8 to UTF-16 macro, although this could be built in to the other converters.

UTF8CONV.VDM (2KB)

Topic:	Unicode to ANSI conversion (93 of 103), Read 88 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Friday, April 04, 2008 08:18 AM

On 4/3/2008 10:53:01 PM, Ian Binnie wrote:
>I had always rejected UTF-8 to
>ANSI conversion as too hard,
>and suggested Vedit use the
>inbuilt API.
>
>It occurred to me the UTF-8 to
>UTF-16 conversion is a simple
>algorithm, which could be
>followed by Unicode (UTF-16)
>to ANSI conversion.
>
>I have written a standalone
>UTF-8 to UTF-16 macro,
>although this could be built
>in to the other converters.

Minor corrections and Error detection added.

UTF8CONV(1).VDM (3KB)

Topic:	Unicode to ANSI conversion (94 of 103), Read 76 times
Conf:	Converting, Translating
From:	Peter Rejto
Date:	Thursday, June 05, 2008 02:42 PM

Gentlemen,

Thank you for an interesting discussion. I have learned a lot.

I was also glad to see that my Vedit 6.15 does have the utf-ansi.vdm macro in the macros directory.

I also performed an experiment: I used my regedit.cmd command to export my Windos XP registry to a file, registry.reg_pr. (I added the -pr string to the extension to emphasize that this file is my experimental file.) Then I used the Vedit {Edit, Translate.Unicode(UTF-16) to ANSI } Menu command to translate the file to ANSI. I got the following error message"

This file contained 206 Unicode
character(s)
(0%) that could not be translated to ANSI.

Possibly the (0%) is an error. Sop, I thought that it is no harm done mentioning it.

I also looked at the file utf-ansi.err that it generated. The header line of this file refers to UTF-16LE. In other words it added the suffix "LE". One of the codes listed in this file was 6E 30.

I tried to look up this code on the website that you listed in the macro file. I did not succeed. I have a hunch that I am missing a hexadecimal number.

Thanks,

-peter

Topic:	Unicode to ANSI conversion (95 of 103), Read 85 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Thursday, June 05, 2008 04:45 PM

On 6/5/2008 2:42:49 PM, Peter Rejto wrote:
>
>I also performed an experiment: I used my
>regedit.cmd command to export my Windos XP registry to a
>file, registry.reg_pr.
>Then I used the Vedit {Edit, Translate.Unicode(UTF-16) to
>ANSI } Menu command to translate the file to ANSI. I
>got the following error message"
>
>This file contained 206 Unicode character(s)
>(0%) that could not be translated to ANSI.
>
>Possibly the (0%) is an error.

How big is your registry export?
I assume more than several MB.
206 characters are not so many then ...

Anyway, there seems to be an error:
It's calculated by:
#105 = File_Size
#104 = #121*100/#105 // #104= %(percentage) of unknown chars

But the filesize is the UTF source one. Twice too big...
So IMHO it should read:
#104 = #121*100/#88 // #104= %(percentage) of unknown chars
because #88 already contains the target filesize.

(Not really tested, I only looked at the old code for now.)

>I also looked at the file utf-ansi.err that it
>generated. The header line of this file refers to UTF-16LE.
>In other words it added the suffix "LE".

Yes, "Little endian".
It is described for example here:
http://en.wikipedia.org/wiki/Little_endian

PCs (Intel/AMD) are using this byte order.

Christian

Topic:	Unicode to ANSI conversion (98 of 103), Read 74 times, 1 File Attachment
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Saturday, June 07, 2008 12:14 PM

On 6/5/2008 4:45:09 PM, I wrote:
>On 6/5/2008 2:42:49 PM, Peter Rejto wrote:
>>
>>Possibly the (0%) is an error.
>
>Anyway, there seems to be an error:
>It's calculated by:
>#105 = File_Size
>#104 = #121*100/#105 // #104=%(percentage) of unknown chars
>But the filesize is the UTF source one.
>Twice too big...
>So IMHO it should read:
>[...]

I have to correct myself: The above was and is correct.

But I enhanced the output of the percentage:
If it's less than 1% it's displayed as exactly that ("<1%") and not as "0%".

And I added the link Ian mentioned to the macros comments.

Christian

UTF-ANSI(14).VDM (21KB)

Topic:	Unicode to ANSI conversion (99 of 103), Read 80 times
Conf:	Converting, Translating
From:	Peter Rejto
Date:	Sunday, June 08, 2008 01:18 AM

On 6/7/2008 12:14:05 PM, Christian Ziemski wrote:
>On 6/5/2008 4:45:09 PM, I wrote:
>>On 6/5/2008 2:42:49 PM, Peter Rejto wrote:
>>>
>>>Possibly the (0%) is an error.
>>
>>Anyway, there seems to be an error:
>>It's calculated by:
>>#105 = File_Size
>>#104 = #121*100/#105 // #104=%(percentage) of unknown chars
>>But the filesize is the UTF source one.
>>Twice too big...
>>So IMHO it should read:
>>[...]
>
>I have to correct myself: The above was
>and is correct.
>
>But I enhanced the output of the
>percentage:
>If it's less than 1% it's displayed as
>exactly that ("<1%") and not as "0%".
>
>And I added the link Ian mentioned to
>the macros comments.
>
>Christian
>

Great,

Aha, the previous utf-ansi.vdm message was approximate and not exact. Possibly, I am the only Vedit user who is fussy about approximate versus exact issue. I certainly do appreciate it the more specific message.

I just saved utf-ansi(14). I am sure glad version 1 is also available. I plan to go back to it and use it as a tutorial on counters. If you you know of a simpler example, let me know.

Now something related. I would like to save your message with the webboard number. I would like to save your message with the webboard message number. However, I could not. The reason is that the webboard "bottom" prompt no longer works on this message. In other words, webboard does not display the "bottom" messages in this thread. Hence I can not use the cursor to get the number of these messages.

-peter.

Topic:	Unicode to ANSI conversion (100 of 103), Read 86 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Sunday, June 08, 2008 02:21 AM

On 6/8/2008 1:18:11 AM, Peter Rejto wrote:
>
>Aha, the previous utf-ansi.vdm message
>was approximate and not exact. Possibly,
>I am the only Vedit user who is fussy
>about approximate versus exact issue.

Vedit only has integer arithmetic and no floating point!
So such calculations are always approximate.

>I just saved utf-ansi(14). I am sure
>glad version 1 is also available. I plan
>to go back to it and use it as a
>tutorial on counters. If you you know of
>a simpler example, let me know.

Since utf-ansi is a relative complex macro it may be a bad tutorial for simple things.
What do you exactly mean with "counters" here?

>Now something related. I would like to
>save your message with the webboard
>number. However, I could not.

No unusual behavior here. Perhaps your browser is somehow confused? Closing and clearing its cache may help.

Christian

Topic:	Unicode to ANSI conversion (101 of 103), Read 92 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Sunday, June 08, 2008 08:19 AM

On 6/8/2008 2:21:32 AM, Christian Ziemski wrote:

>Vedit only has integer arithmetic and no
>floating point!
>So such calculations are always
>approximate.

No Excuse ;)

Even back in the days of 8 bit microprocessors I used integer arithmetic to perform high precision calculations rather than relying on approximate floating point packages.

I did this in assembler and c, and it could be done in vedit.
Not that I think it is needed in this case.

Topic:	Unicode to ANSI conversion (102 of 103), Read 99 times
Conf:	Converting, Translating
From:	Christian Ziemski
Date:	Sunday, June 08, 2008 08:59 AM

On 6/8/2008 8:19:01 AM, Ian Binnie wrote:
>On 6/8/2008 2:21:32 AM, Christian Ziemski wrote:
>
>>Vedit only has integer arithmetic and no
>>floating point!
>>So such calculations are always approximate.
>
>No Excuse ;)

O.k. ;-)

>Even back in the days of 8 bit microprocessors I used integer
>arithmetic to perform high precision calculations rather than relying on
>approximate floating point packages.
>
>I did this in assembler and c, and it could be done in vedit.

That was tried some years ago:

In "VEDIT Macro Language Support"
"Simple math using N-registers" 4/16/2001

Direct link: http://webboard..../read?4016,30

>Not that I think it is needed in this case.

Me too!

Christian

Topic:	Unicode to ANSI conversion (103 of 103), Read 71 times
Conf:	Converting, Translating
From:	Peter Rejto
Date:	Tuesday, June 17, 2008 10:33 AM

On 6/7/2008 12:14:05 PM, Christian Ziemski wrote:

>But I enhanced the output of the
>percentage:
>If it's less than 1% it's displayed as
>exactly that ("<1%") and not as "0%".
>
>And I added the link Ian mentioned to
>the macros comments.

Christian,

I tried your macro, UTF-ANSI(14).VDM and it works like a charm. Here are the details:

1.: On my AMD ATHLON 64, 3000+ machine, running Windows-XP-64x, I translated my registry file. This file was a62 MGB and it took about 90 seconds to run. (This time I got the message that 429 characters were not translated, (<1%)

2.: On my AMD ATHLON 3200+ machine, running Windows-XP, it also took 90 seconds to do the same translation. (In other words what I lost on the CPU speed, I gained back on the 64 bit operating system.)

Thanks again to both of you for this remarkable macro.

-peter

Topic:	Unicode to ANSI conversion (96 of 103), Read 79 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, June 05, 2008 09:32 PM

On 6/5/2008 2:42:49 PM, Peter Rejto wrote:
>I also performed an
>experiment: I used my
>regedit.cmd command to export
>my Windos XP registry to a
>file, registry.reg_pr.
>
>This file contained 206
>Unicode
> character(s)
>(0%) that could not be
>translated to ANSI.

The XP registry contains lots of character strings which won't map. These won't cause you any problems, as they are International strings for Asian/Eastern language support.

>
>I also looked at the file
>utf-ansi.err that it
>generated. The header line of
>this file refers to UTF-16LE.
>In other words it added the
>suffix "LE". One of the codes
>listed in this file was 6E 30.
>
>I tried to look up this code
>on the website that you listed
>in the macro file. I did not
>succeed. I have a hunch that I
>am missing a hexadecimal
>number.

The links are WLG-4 which only supports Western European languages.

If you want the full reference look at http://www.unicode.org/charts

306E is a Hiragana chracter

Topic:	Unicode to ANSI conversion (97 of 103), Read 81 times
Conf:	Converting, Translating
From:	Peter Rejto
Date:	Friday, June 06, 2008 12:13 AM

A big thank you to each of you gentlemen,

My original registry file was about 20 MB and your macro went through it in less than a minute. At the same time, the resulting file is about half the size of the original one. I tried visually compare the Notepad version of the original file and the Vedit version of the translated file. They look pretty much the same.

This was my first experience with a huge file. Now I understand what you mean by saying that Vedit works with huge files.

Finally, an uninformed question. Would be possible to display the cursor position on the stat line ? I have a hunch that it would convey the feeling that the macro works hard ?

Also, thanks to each of you for the .org reference.
I shall add it to my version of your macro.

-peter