Read topic starting at message #56419

Topic:	UTF-ANSI.VDM with UTF-8 conversion (1 of 18), Read 49 times, 1 File Attachment
Conf:	Converting, Translating
From:	Pauli Lindgren
Date:	Thursday, February 26, 2009 12:26 PM

I have now merged Ians UTF8conv.VDM (from 2008-04-04) that converts UTF-8 into UTF-16, and the UTF-8 detection from Cristians UTF-8.vdm macro, with utf-ansi.vdm macro.

Now you can use this single macro to convert both UTF-16 and UTF-8 files into ANSI or OEM.

I added radio buttons to the dialog box so that user can manually select the input file format in case the automatic detection does not work correctly. This also makes the "force conversion" dialog box unnecessary.

In addition, I added an option to convert unknown characters into HTML numeric codes (such as Д).
Thus, if you are editing an HTML file, all the characters in the converted ANSI file are still readable in a web browser.
In addition, it would be possible to convert the file back to UTF without losing any characters.
(That would require the similar modifications to ansi-utf.vdm. So far, I have done just some quick test with Ians new utf16-8conv.vdm macro.)

--
Pauli

UTF-ANSI(15).VDM (22KB)

Topic:	UTF-ANSI.VDM with UTF-8 conversion (2 of 18), Read 31 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, February 26, 2009 04:07 PM

On 2/26/2009 12:26:35 PM, Pauli Lindgren wrote:
>I have now merged Ians
>UTF8conv.VDM (from 2008-04-04)
>that converts UTF-8 into
>UTF-16, and the UTF-8
>detection from Cristians
>UTF-8.vdm macro, with
>utf-ansi.vdm macro.
>
>Now you can use this single
>macro to convert both UTF-16
>and UTF-8 files into ANSI or
>OEM.

I had a similar thought, and have a macro that works, but am still tidying up and testing.

My original thought was to do a 2 pass conversion, as Pauli has done, but I realised that the UTF-8 translation cold be done a character at a time, and included in the existing single pass loop.

I will post this shortly.

Topic:	UTF-ANSI.VDM with UTF-8 conversion (3 of 18), Read 30 times, 2 File Attachments
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, February 26, 2009 05:57 PM

I have enhanced the original utf-ansi.vdm so it now converts UTF-8.

UTF-8 will be detected if BOM is found, and the user can force the conversion.

My original thought was to do a 2 pass conversion, but I realised that the UTF-8 translation could be done a Unicode character at a time, and included in the existing single pass loop.

I converted the main loop by using abbreviated commands and no whitespace as in the original, although I am not convinced this actually saves much in a single pass operation.

I have also included a longer version, with comments.

UTF-ANSIFULL.VDM (21KB)
UTF-ANSI(16).VDM (21KB)

Topic:	UTF-ANSI.VDM with UTF-8 conversion (7 of 18), Read 36 times, 2 File Attachments
Conf:	Converting, Translating
From:	Pauli Lindgren
Date:	Monday, March 02, 2009 12:45 PM

On 2/26/2009 5:57:05 PM, Ian Binnie wrote:
>
>My original thought was to do a 2 pass conversion, but I
>realised that the UTF-8 translation could be done a
>Unicode character at a time,and included in the existing
>single pass loop.

I had the same thought. Even if the speed is usually not an issue, it is unnecessary work to convert 8-bit characters to 16-bit and then back to 8-bit.

Here is my version of UTF-ANSI.VDM with Ians single pass conversion.

I added 4 byte sequence support to the UTF-8 conversion to avoid data loss (even if the 4 byte sequences are probably rare).

The UTF-8 part of the code is not "compressed" yet. Maybe that is not necessary with UTF-8? I used Search("|G") to quickly skip ASCII characters, which should speed things up in most cases.

In addition, here is the first implementation of ANSI-UTF.VDM that combines ANSI to UTF-16 and ANSI to UTF-8 conversions. It is still two-pass conversion, and there is no support for UTF-16BE (maybe it is not needed?).

I have done some tests by converting UTF-8 files (many of which contain non-ANSI characters) into ANSI and back to UTF-8. In all the tests, the resulting file matches with the original, so it seems that the conversions are working.

--
Pauli

ANSI-UTF(2).VDM (15KB)
UTF-ANSI(17).VDM (23KB)

Topic:	UTF-ANSI.VDM with UTF-8 conversion (8 of 18), Read 31 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Tuesday, March 03, 2009 02:14 AM

On 3/2/2009 12:45:38 PM, Pauli Lindgren wrote:
>On 2/26/2009 5:57:05 PM, Ian Binnie
>wrote:
>Here is my version of UTF-ANSI.VDM with
>Ians single pass conversion.

>I added 4 byte sequence support to the
>UTF-8 conversion to avoid data loss
>(even if the 4 byte sequences are
>probably rare).

This looks like it translates 4 byte sequences into the correct Unicode code point (0x10000 - 0x10FFFF),
but unless you have an interest in ancient languages or the more esoteric CJK characters seems pointless.

It is just extra, and obviously untested code.

You make no effort to do the same for UTF-16 which include the same code points as 4 byte sequences. These will be retained as 2 unconverted code sequences.

The original code was only intended to perform ANSI/OEM translation, and omitted code which would not be needed for this.

>The UTF-8 part of the code is not
>"compressed" yet. Maybe that is not
>necessary with UTF-8?

I really couldn't see the point, I think Christian just got into an optimisation mode.

>I used
>Search("|G") to quickly skip ASCII
>characters, which should speed things up
>in most cases.

This is a good idea, I have included in my version.

I still stand by my original comments.
In particular the conversion to html, no matter how useful you may find it is not Unicode/ANSI conversion, but a separate task.

I also disagree with offering the user an option to change the conversion when it has been robustly determined, especially when it confuses the UI.

The single dialog box is a good idea, so I have merged the forced conversion box with the confirmation.

The macro appears almost identical to the original, with the addition of UTF-8.

PS I have added code for OEM code points 01-1F. These are the original IBM display codes which are still used.

>I have done some tests by converting
>UTF-8 files (many of which contain
>non-ANSI characters) into ANSI and back
>to UTF-8. In all the tests, the
>resulting file matches with the
>original, so it seems that the
>conversions are working.

I have test files for all ANSI, OEM & WGL4 code points which I use to test my macros (and other Unicode tasks).

UTF-ANSIFULL(1).VDM (23KB)

Topic:	UTF-ANSI.VDM with UTF-8 conversion (10 of 18), Read 27 times
Conf:	Converting, Translating
From:	Pauli Lindgren
Date:	Tuesday, March 03, 2009 01:39 PM

On 3/3/2009 2:14:24 AM, Ian Binnie wrote:
>
>This looks like it translates 4 byte
>sequences into the correct Unicode code
>point (0x10000 - 0x10FFFF),
>but unless you have an interest in
>ancient languages or the more esoteric
>CJK characters seems pointless.

Those characters may occur in a file, so it is better not to destroy them.

>You make no effort to do the same for
>UTF-16 which include the same code
>points as 4 byte sequences. These will
>be retained as 2 unconverted code
>sequences.

What do you mean "unconverted"?
I believe the 2x2 byte UTF-16 character codes are handled just like any other UTF-16 characters. If you convert UTF-16 to ANSI and then back to UTF-16, you will get a file that matches with the original file.

But I did not say this was a finished macro. There is still work to do.

>
>The original code was only intended to
>perform ANSI/OEM translation, and
>omitted code which would not be needed
>for this.

And that was a severe flaw in the old macro.
The main purpose of the macro is to enable editing Unicode files in Vedit, since Vedit does not have Unicode support.
But with the old macro, any file that contains non-ANSI characters is corrupted and therefore can not be edited with Vedit.

>
>>The UTF-8 part of the code is not
>>"compressed" yet. Maybe that is not
>>necessary with UTF-8?
>
>I really couldn't see the point, I think
>Christian just got into an optimisation
>mode.

I wonder if someone has tested how much speed improvement the compression gives?

I guess the speed was critical when converting Registry files. But those are UTF-16, so the speed of UTF-8 conversion may not be as critical.

>I still stand by my original comments.
>In particular the conversion to html, no
>matter how useful you may find it is not
>Unicode/ANSI conversion, but a separate
>task.

And I totally and absolutely disagree with you.
Preserving the non-ANSI character is NOT something that could be done separately. If the characters are destroyed in conversion, they are already lost.

>
>I also disagree with offering the user
>an option to change the conversion when
>it has been robustly determined,
>especially when it confuses the UI.

You mean when BOM exists? That is why the radio buttons are grayed in that case (only it does not work).

--
Pauli

Topic:	Re: UTF-ANSI.VDM with UTF-8 conversion (11 of 18), Read 28 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Tuesday, March 03, 2009 02:52 PM

At 01:40 PM 3/3/2009, you wrote:
>From: "Pauli Lindgren"
>
>On 3/3/2009 2:14:24 AM, Ian Binnie wrote:
>>
>>This looks like it translates 4 byte
>>sequences into the correct Unicode code
>>point (0x10000 - 0x10FFFF),
>>but unless you have an interest in
>>ancient languages or the more esoteric
>>CJK characters seems pointless.
>
>Those characters may occur in a file, so it is better not to destroy them.

Thank you for working out the details of this macro. I will assume that when done, this will replace the existing UTF macros.

BTW - Current VEDIT status:

Gabe has mostly finished the .CHM version of the online help.

Gabe has started working on creating an .msi installation program; mostly just evaluating InstallShield and other commercial products.

Ted.

Topic:	UTF-ANSI.VDM with UTF-8 conversion (13 of 18), Read 21 times
Conf:	Converting, Translating
From:	Pauli Lindgren
Date:	Tuesday, March 10, 2009 11:47 AM

On 3/3/2009 2:14:24 AM, Ian Binnie wrote:
>
>The macro appears almost identical to
>the original, with the addition of
>UTF-8.
>
>PS I have added code for OEM code points
>01-1F. These are the original IBM
>display codes which are still used.

Ian,

Some comments about your version of the macro:

- You moved the if(Is_Quiet) { Exit(12) } command out of the if. This has the effect that in Quiet mode, the macro does not convert those files that have BOM.

- If UTF-8 is detected with what you call "weak detection", your dialog box claims that this is not an Unicode file, even if it really is UTF-8.

- Is the UTF-16 detection "reliable"? At least not if the file is short and/or contains lots of non-ANSI characters.

- After UTF-8 conversion the cursor is moved to totally wrong place

- Alert sound is given when dialog box is opened even if BOM was found. This gives the false impression that there is something wrong.

In addition to the above, there are the problems mentioned earlier, such as:

- No Preserve option, which means that all non-ANSI (or non-OEM) characters are destroyed.

- 4 byte UTF-8 characters not handled at all (not even skipped). If the file contains those, the macro works incorrectly.

--
Pauli

Topic:	UTF-ANSI.VDM with UTF-8 conversion (14 of 18), Read 21 times, 2 File Attachments
Conf:	Converting, Translating
From:	Pauli Lindgren
Date:	Tuesday, March 10, 2009 11:57 AM

Here are new versions of UTF-ANSI.VDM and ANSI-UTF.VDM.

UTF-ANSI.VDM

- Added the OEM characters 0x01 to 0x1F in the table.

- I now have a separate dialog box in case there is BOM.
However, this is not nice. It is confusing when the same operation suddenly creates a totally different dialog box.
Graying the radio buttons would be much better option, if only it worked.

- Alert sound is given when trying to convert an empty file.

- If file is read only, show error dialog and exit.

- Re-organized the file format detection. BOM check and guessing the format done separately.
(These can be later moved to subroutines).

- UTF-16 test did not work well with UTF files that contain OEM characters (such as line drawing symbols).
Therefore I increased the number of characters to test from 50 to 200 and adjusted the constants (now only 40% of characters needs to be ASCII instead of 2/3).

- Created a "stronger" UTF-8 detection. Now it checks the first 5 non-ASCII characters. UTF-8 is assumed if all of those are valid UTF-8 characters, and at least one non-ASCII character is found.
In addition, the check now recognizes 4 byte sequences, too.
Added LOCAL parameter to the search so that it will not cause long delay in large files.
This routine could be used to validate UTF-8 file, by replacing "Repeat(5)" with "Repeat(ALL)" and removing the LOCAL option.

- Use Cur_Line to restore position instead of Cur_Pos. This works better when the file size changes.

- If "Preserve" option is not used, 4-byte UTF-8 characters are converted into character 127.

ANSI-UTF.VDM

- This version performs conversion in Single pass. It was somewhat more complex than I thought.
The main loop of conversion is built in T-Reg(106). The content varies depending on the options selected.

- Added the OEM characters 0x01 to 0x1F in the table, and the necessary changes to the code.
CR and LF are not converted, so that line breaks will stay visible.

- Default values for dialog box options are set at the beginning of the macro for easy customization.

- If file is read only, show error dialog and exit.

- Use Cur_Line to restore position instead of Cur_Pos. This works better when the file size changes.

- Set file type to 0 only if file type >4

--
Pauli

ANSI-UTF(3).VDM (17KB)
UTF-ANSI(18).VDM (25KB)

Topic:	UTF-ANSI.VDM with UTF-8 conversion (15 of 18), Read 21 times, 1 File Attachment
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, March 12, 2009 02:43 AM

I tried your new macros.
At least the prompts are in a more logical order.

I tried a round trip conversion, using your html code, and this failed.

My standard Windows Glyph List 4 test file contains the following:-
"£ pound sign [lira sign is (₤)]"

This of course was incorrectly converted, a typical side effect of simple in-band conversions.

You questioned whether the UTF-16 detection is "reliable".
Yes it is, because the chance of false positives is low, although there are files which will generate a false negative - leading to a prompt.

The only files which will generate a false positives are those containing a preponderance of xx00 character sequences.
These are not common in Unicode (most blocks do not use these), and there are only 2 in WGL4.
A file with lots of '2500 - box drawings light horizontal' may generate a false positive.

NOTE that it is impossible to reliably detect file format, and all tests are statistical in nature.
It would be possible to produce a more robust test, but merely extending the existing to 200 characters does not significantly increase the reliability (but doesn't hurt). It would make more sense to use the whole of the currently loaded data (which slightly but not significantly increases the reliability). The original limit of 50 was an historical accident, and the additional time to test the whole buffer is negligible.

You are correct to state that this test requires no more than 33% non-ASCII characters but weakening to 60% would increase the risk of binary files triggering false positives. You are presumably concentrating on the UTF-16LE vs UTF-16BE and not other file types.

A reliable test for UTF-8 is impossible, although the current test will detect certain illegal UTF-8 files. It will certainly allow some ANSI or OEM files to be incorrectly identified as UTF-8.

I still do not understand your fixation with 4 byte UTF-8 characters.
My original macro indicated that these are not supported.

You are extremely unlikely to encounter these, as they are not supported in standard Windows installations, and specialised software and fonts are needed to handle them. Anyone using these would almost certainly have suitable software, and would not be using an ANSI editor to edit them.

It probably does no harm to include the extra code. This looks OK, and probably doesn't slow the macro appreciably, but I have an aversion to code which will never be executed, and is untested.

I still maintain that the task of 'converting a Unicode file for editing in an ANSI editor' is NOT Unicode to ANSI file conversion, which is what I want to do, but Ted has expressed a desire for a common macro. It is unfortunate that vedit does not provide a convenient reliable method of passing parameters to macros (as discussed in another related thread), or of customising its menu. I have no desire to see a superfluous dialog box asking unnecessary questions, so have added an option to bypass this altogether. Maybe this should prompt if it can't determine file type, but this is not an issue for me.

I haven't really looked at ANSI-UTF.VDM, but unless this can be driven from a menu, bypassing prompts, I have no interest. (I will modify my version to bypass the existing annoying confirmation prompt. I put up with it for years, automatically hitting Return, because this seemed to be the vedit style, but there is no need to continue.)

I disagree with the inclusion of OEM codes 01-1F. I included these in the reverse translation, as there is an unambiguous Unicode to OEM mapping, but each OEM character maps to 2 Unicode characters, and the use of control codes if far more common.

PS You should include your name in these macros to share the blame.

UTF-ANSI(19).VDM (25KB)

Topic:	UTF-ANSI.VDM with UTF-8 conversion (16 of 18), Read 21 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, March 12, 2009 02:45 AM

On 3/12/2009 2:43:28 AM, Ian Binnie wrote:
>I tried your new macros.
>At least the prompts are in a
>more logical order.
>
>I tried a round trip
>conversion, using your html
>code, and this failed.
>
>My standard Windows Glyph List
>4 test file contains the
>following:-
>"£ pound sign [lira sign
>is (₤)]"

This was the html code - the Forum converted it.

Topic:	Re: UTF-ANSI.VDM with UTF-8 conversion (17 of 18), Read 21 times
Conf:	Converting, Translating
From:	Ted Green
Date:	Thursday, March 12, 2009 10:13 AM

At 02:44 AM 3/12/2009, vtech-convert Listmanager wrote:
>From: Ian Binnie
>
>I tried your new macros.
>At least the prompts are in a more logical order.
>
>I tried a round trip conversion, using your html code, and this failed.
>
>My standard Windows Glyph List 4 test file contains the following:-
>"£ pound sign [lira sign is (₤)]"
>
>This of course was incorrectly converted, a typical side effect of simple in-band conversions.

Thank you Pauli and Ian for trying to work out the details of the Unicode conversions, as I have no experience with Unicode at all.

But, PLEASE be kind to each other. I will be very hurt to see two people get into a fight over VEDIT macros. :-O
The work each of you had done has really benefited VEDIT and its users.

You two seem to have very different ideas and preferences in how the conversion should be done. Perhaps a Checkbox (with memory) could accommodate these two styles.

Thanks again.

Ted.

Topic:	Re: UTF-ANSI.VDM with UTF-8 conversion (18 of 18), Read 23 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, March 12, 2009 07:48 PM

On 3/12/2009 10:13:42 AM, Ted Green wrote:
>At 02:44 AM 3/12/2009, vtech-convert
>Listmanager wrote:
>>From: Ian Binnie
>>
>Thank you Pauli and Ian for trying to
>work out the details of the Unicode
>conversions, as I have no experience
>with Unicode at all.
>
>But, PLEASE be kind to each other. I
>will be very hurt to see two people get
>into a fight over VEDIT macros. :-O
>The work each of you had done has really
>benefited VEDIT and its users.
>
>You two seem to have very different
>ideas and preferences in how the
>conversion should be done. Perhaps a
>Checkbox (with memory) could accommodate
>these two styles.

Ted,

I hope my comments have been polite, although I confess to being passionate and some mistake my sense of humour for rudeness.

I see no problem in a robust exchange of views.

I wasn't going to post any more on this, but Pauli's last macro had a number of improvements, and I saw a way to improve it (as well as including a flag to turn off dialogues).

Of course line 155 should be:-
Repeat(EOB_Pos/2-1)) { // Check characters in current buffer

The macro includes all of Pauli's code, even though he and I will never agree about dialogue boxes. I don't think he has written a macro without one, I have never used one.

Topic:	UTF-ANSI.VDM with UTF-8 conversion (4 of 18), Read 32 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Thursday, February 26, 2009 06:00 PM

On 2/26/2009 12:26:35 PM, Pauli Lindgren wrote:
>Now you can use this single
>macro to convert both UTF-16
>and UTF-8 files into ANSI or
>OEM.
>
>I added radio buttons to the
>dialog box so that user can
>manually select the input file
>format in case the automatic
>detection does not work
>correctly. This also makes the
>"force conversion" dialog box
>unnecessary.
>
>In addition, I added an option
>to convert unknown characters
>into HTML numeric codes (such
>as Д).
>Thus, if you are editing an
>HTML file, all the characters
>in the converted ANSI file are
>still readable in a web
>browser.
>In addition, it would be
>possible to convert the file
>back to UTF without losing any
>characters.
>(That would require the
>similar modifications to
>ansi-utf.vdm. So far, I have
>done just some quick test with
>Ians new utf16-8conv.vdm
>macro.)

Pauli,

I tried this macro, and have a couple of comments.

1. The dialog box presents too much information, and detracts from its primary purpose of WARNING the user that he is about to perform a massive irreversible change to a file.

This is a matter of taste. You seem to like dialog boxes, I don't.
In my experience most people don't read the on screen information presented.

2. Your attempt to detect a UTF-8 file without BOM is clever, but weak (it only looks at the 1st graphic character).
I agree that failure of this test would guarantee that the file is NOT UTF-8.

I did not include a test in my macro, because I have never seen a UTF-8 file without BOM.
It is almost impossible to detect a UTF-8 file - after all a file containing only ASCII is UTF-8.
It is possible to test a file for conformance to UTF-8 (which does not contain illegal sequences).

3. You included a partial test for invalid UTF-8, from my original UTF8conv.VDM.
I decided to exclude this. The original macro did not test for invalid UTF-16 codes, and it seemed unnecessary for the purposes of converting the file - presumably from a valid source. It would be possible construct a UTF validator if this was useful.

Topic:	UTF-ANSI.VDM with UTF-8 conversion (5 of 18), Read 30 times
Conf:	Converting, Translating
From:	Pauli Lindgren
Date:	Friday, February 27, 2009 01:05 PM

On 2/26/2009 6:00:19 PM, Ian Binnie wrote:
>
>1. The dialog box presents too much
>information, and detracts from its
>primary purpose of WARNING the user that
>he is about to perform a massive
>irreversible change to a file.
>
>This is a matter of taste. You seem to
>like dialog boxes, I don't.
>In my experience most people don't read
>the on screen information presented.

How can there be too much information?

The important thing is that there IS a dialog box so that the user can cancel the operation for example if he accidentally pressed a wrong key.

But it is of course better if the dialog box tells what is going on and gives useful information. For example, it is better that the dialog box title is "Unicode to ANSI" instead of "Confirmation". And I think it is really useful to see which format the file is, and to have option to manually change the selection in case the automatic detection does not work.

And the option to preserve non-ANSI characters is really important. Maybe it should be ON by default?

If some people do not read the information, they can just click OK. Just like they do with the dialog that does not give any information.

I tried to gray the radio buttons in case there is BOM. (In that case it is probably not necessary to change the input format.) But that seems not to work.

>
>2. Your attempt to detect a UTF-8 file
>without BOM is clever, but weak (it only
>looks at the 1st graphic character).
>I agree that failure of this test would
>guarantee that the file is NOT UTF-8.

The detection is from Christians macro. I, too, thought that it might be necessary to check more than one graphic character. But so far, it has worked correctly with all the files I have tested.

Anyway, it does not need to be fool proof. After all, the user can manually change the selection. And the UTF-8 detection is done after UTF-16 detection, so UTF-16 files can not cause false positives.

>
>I did not include a test in my macro,
>because I have never seen a UTF-8 file
>without BOM.

The only UTF-8 files I have seen that *do* contain BOM are those I have converted with Notepad.

UTF-8 is widely used in web pages. A HTML file does not contain BOM. Instead, the format is detected by using HTTP headers or Meta tags.

In addition, I often edit Wiki pages with Vedit using Firefox plugin "It's All Text". The plugin fetches text from any text input box in a HTML form into Vedit, and automatically copies it back when you close the file.
The advantage compared to copy-paste is that UTF-8 is *not* converted to ANSI. When Windows does the conversion, all the non-ANSI characters are lost. Which is really bad. But the plugin does not generate BOM.

>
>3. You included a partial test for
>invalid UTF-8, from my original
>UTF8conv.VDM.

I included that just for debug. I had some problems and I thought there was something wrong with the UTF-8 file, but it was a bug in another place of the macro.
I removed most of my debug code before posting, but I forgot this one.

Some checking might be useful (without stopping the conversion), but maybe it slows down the operation too much.

--
Pauli

Topic:	UTF-ANSI.VDM with UTF-8 conversion (6 of 18), Read 30 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Friday, February 27, 2009 06:59 PM

On 2/27/2009 1:05:24 PM, Pauli Lindgren wrote:
>On 2/26/2009 6:00:19 PM, Ian Binnie
>wrote:
>How can there be too much information?

You can't, but if the data is not telling the user anything new it is not information (at least according to Information Theory), and just noise.

>The important thing is that there IS a
>dialog box so that the user can cancel
>the operation for example if he
>accidentally pressed a wrong key.

Agreed.

>But it is of course better if the dialog
>box tells what is going on and gives
>useful information. For example, it is
>better that the dialog box title is
>"Unicode to ANSI" instead of
>"Confirmation".

I would agree with this, I have never actually looked at it.
Why not use the macro to replace Unicode with UTF-x?

I find even the current prompt annoying, but accept that this is how most vedit macros work.
If it was my macro it would not be there, because if I make a mistake I just reload the file.

> And I think it is really
>useful to see which format the file is,
>and to have option to manually change
>the selection in case the automatic
>detection does not work.

You should not have the option if BOM is found.

>And the option to preserve non-ANSI
>characters is really important. Maybe it
>should be ON by default?

Again this is a matter of taste.
It does NOT "preserve non-ANSI", but corrupts the file with html.

This is explained by your comments in other posts.
I mainly operate on files generated by programs.
UTF-8 is becoming more common for .NET applications.

Virtually all convert with no problems, but those which do not have a single symbol inserted, which is easily detected.
Html has nothing to do with these files, it would just make them harder to read.

You obviously use html files, so this makes sense for you.

>If some people do not read the
>information, they can just click OK.
>Just like they do with the dialog that
>does not give any information.

This is not the way people work unfortunately.
Human Factors engineering ensures the most important information first, and makes it more prominent.

If users habitually have to respond to a prompt, it becomes automatic.

Your comment also illustrates another difference in our modus operandi. I use the keyboard, and rarely touch the mouse. I prefer menus to dialog boxes for this reason.

If you are happy with the mouse, you can use Notepad, which supports Unicode, without all the problems of conversion.

Incidentally I find the vedit limitation on menu length frustrating, particularly when the error message is "INVALID MENU". Vedit needs to move beyond the Windows 3 single level menus, and allow sub-menus for Tools etc.

>I tried to gray the radio buttons in
>case there is BOM. (In that case it is
>probably not necessary to change the
>input format.) But that seems not to
>work.

Why not just have 2 dialog boxes?
After all if a BOM is found, the user can't improve things.

>>2. Your attempt to detect a UTF-8 file
>>without BOM is clever, but weak (it only
>>looks at the 1st graphic character).
>>I agree that failure of this test would
>>guarantee that the file is NOT UTF-8.
>
>The detection is from Christians macro.
>I, too, thought that it might be
>necessary to check more than one graphic
>character. But so far, it has worked
>correctly with all the files I have
>tested.

The test should be restricted to the currently loaded buffer.

>UTF-8 is widely used in web pages. A
>HTML file does not contain BOM. Instead,
>the format is detected by using HTTP
>headers or Meta tags.

As recommended by rfc3629

>In addition, I often edit Wiki pages
>with Vedit using Firefox plugin "It's
>All Text". The plugin fetches text from
>any text input box in a HTML form into
>Vedit, and automatically copies it back
>when you close the file.
>The advantage compared to copy-paste is
>that UTF-8 is *not* converted to ANSI.
>When Windows does the conversion, all
>the non-ANSI characters are lost. Which
>is really bad. But the plugin does not
>generate BOM.

Windows only converts to ANSI because vedit asks it to.
If vedit requested CF_UNICODETEXT this is what it would get.

If you use Notepad you will get Unicode.

Topic:	UTF-ANSI.VDM with UTF-8 conversion (9 of 18), Read 34 times
Conf:	Converting, Translating
From:	Pauli Lindgren
Date:	Tuesday, March 03, 2009 01:19 PM

On 2/27/2009 6:59:28 PM, Ian Binnie wrote:
>On 2/27/2009 1:05:24 PM, Pauli Lindgren
>>How can there be too much information?
>
>You can't, but if the data is not
>telling the user anything new it is not
>information (at least according to
>Information Theory), and just noise.

Are you suggesting that the useful information in the dialog "is not telling anything new"?

>If it was my macro it would not be
>there, because if I make a mistake I
>just reload the file.

You can reload the file only if you did save it before calling the macro. Most people do not save the file before making mistakes.

>
>> And I think it is really
>>useful to see which format the file is,
>>and to have option to manually change
>>the selection in case the automatic
>>detection does not work.
>
>You should not have the option if BOM is
>found.

It is useful to see the format even if there is BOM.

In addition, a different dialog box for that situation would mean more complex macro. But if the graying option can not be made to work in Vedit, maybe the different dialog box could be used.

>
>>And the option to preserve non-ANSI
>>characters is really important. Maybe it
>>should be ON by default?
>
>Again this is a matter of taste.
>It does NOT "preserve non-ANSI", but
>corrupts the file with html.

That is absolute bullshit.
The preserve option DOES preserve the information and allows reverse conversion without data loss. Without this option, all the non-ANSI characters are corrupted and destroyed irreversibly. So the truth is exact opposite to what you claim.

Obviously you have some problem with HTML.
Would you be happier if some non-standard coding was used to store the non-ANSI characters? That was my original plan, but then I realized that using existing HTML coding has many advantages.

>I mainly operate on files generated by programs.
>UTF-8 is becoming more common for .NET applications.

Why do you think that if the file is "generated by programs", the data can be freely destroyed?
If the files contain non-ANSI characters, obviously those characters do contain some necessary information.

>Html has nothing to do with these files,
>it would just make them harder to read.

Why should html have to do with those files?
It is just a method used to preserve the information.
When you convert the file back to UTF, the html is replaced with UTF characters.

If html codes are "hard" to read, destroyed characters are IMPOSSIBLE to read. And more importantly, impossible to convert back to Unicode.

>If users habitually have to respond to a
>prompt, it becomes automatic.

That has nothing to do with this macro.
The habit to automatically accept anything is caused by web browsers and other similar programs that constantly pop all kind of dialog boxes without user asking for it. Often those dialog boxes really do not contain any usable information.

However, when the user enters a command to perform a complex operation, he expects to get some response.

>
>Your comment also illustrates another
>difference in our modus operandi. I use
>the keyboard, and rarely touch the
>mouse. I prefer menus to dialog boxes
>for this reason.

Why? You can select the options from dialog box with keyboard just as easily and often with fewer key presses.

The menus have very little room, so it is not possible to include all the different options in the menus. For example, to have a different menu item for all possible combinations of Unicode conversion.

>
>If you are happy with the mouse, you can
>use Notepad, which supports Unicode,
>without all the problems of conversion.

You are the one who favors Notepad. So feel free to use it.
And you can use Notepad with keyboard as well.

>
>Incidentally I find the vedit limitation
>on menu length frustrating, particularly
>when the error message is "INVALID
>MENU". Vedit needs to move beyond the
>Windows 3 single level menus, and allow
>sub-menus for Tools etc.

Sub-menus in User and Tools menus have been in my wish list for ages, too. But I don't think this has anything to do with Windows 3. The other menus (File, Edit etc.) have had sub-menus all the time, even in DOS version.

>
>Windows only converts to ANSI because
>vedit asks it to.
>If vedit requested CF_UNICODETEXT this
>is what it would get.

This is not something specific to Vedit.
All applications that do not support Unicode have the same effect. And even if the application does support Unicode, pasting converts it to ANSI if Unicode mode has not been selected in the application.

For example on Notepad 2, if you just paste Unicode text in an empty window, it is converted to ANSI. You must first manually create a new file and save it as UTF-xx. Then you can paste the Unicode data correctly.

I wonder if it is even possible in Windows to paste as "raw", without needing to know in which format the data is.

>
>If you use Notepad you will get Unicode.

This is because Notepad internally uses Unicode even with 8-bit files. If it is ANSI file, it is converted to Unicode. If it is Unicode file, it stays in Unicode.

If you load a 100MB ASCII text file in Notepad, it uses at least 200MB of memory.

--
Pauli

Topic:	UTF-ANSI.VDM with UTF-8 conversion (12 of 18), Read 33 times
Conf:	Converting, Translating
From:	Ian Binnie
Date:	Tuesday, March 03, 2009 07:15 PM

Pauli,

This appears to be coming personal, and nothing further will be gained by pursuing the detail.

I understand your position, I ask that you try to understand mine.
I know I am not going to change your mind.

After all, this is customisable. You could use your macro, regardless of what is included in Vedit, as could I.
It is a case of including the most appropriate tool in Vedit.

Just for the record, I wrote this macro to perform a specific task, which could not easily be done in any other application.
Specifically converting an OEM file to UTF-8.

I though others may find it useful, so shared it.

I use Vedit as a programming editor, and for its powerful macro language, which lets me write programs with less effort than rolling out a c++ program.

If I want to edit a Unicode file, I use a Unicode enabled editor.
I only use Vedit if I need to perform some more complex task, although I like its clean uncluttered interface without popups and all the other Windows paraphernalia.

Vedit, unfortunately, is getting old, it hasn't had a substantial upgrade for 3 years (I was promised 7.0 in 2001).
It is no longer the universal tool it was.
It needs Unicode to be viable.

I have written/converted a number of programs to Unicode over the last few years.
It really isn't hard once you get started (although I admit I have been using c++ not assembler).