Topic: Eliminating duplicates in mail files with VEDIT (1 of 2), Read 28 times
Conf: VEDIT Macro Library
From: G. Wilson
Date: Wednesday, November 19, 2003 11:40 PM


Can VEDIT be used to remove duplicate messages from mail files?

The problem arises when messages are downloaded from the server multiple times or when mail files are concatenated. Mail packages, Eudora etc., are notoriously bad for not having any mail file utilities, e.g. no duplicate eliminators etc.

Mail files, e.g.: *.MBX, RFC822, can contain a mixture of:

- single mail messages (not duplicated);

- duplicated messages that are identical and contiguously repeating in blocks within the file;

- duplicated messages that are identical but interspersed throughout the file;

- duplicates where the text message body is identical but the mail headers are not, e.g.: the header may be from forwarded, cc or bcc mail or the header has been damaged or stripped (bad indexing etc.);

- duplicates where the header (or parts of it) are identical but the message body is not.

Essentially, the editor needs to compare headers and message bodies separately and jointly.

Are there any VEDIT macros that can tackle such a job? Or is there any comparatively easy automated way to remove these duplicates using VEDIT's editing facilities?

 


Topic: Eliminating duplicates in mail files with VEDIT (2 of 2), Read 40 times
Conf: VEDIT Macro Library
From: Pauli Lindgren
Date: Friday, November 21, 2003 05:31 AM

What type of duplicates do you want to delete, then?
Do you want to delete forwarded messages? Or messages that you have received separately but have an identical contents (e.g. spam)?

Messages that are actually duplicates should have identical Message-Id header, so the Message-Id could be used to identify duplicates.

I made a simple macro that does this. To use the macro, first open your mailbox file in Vedit, and then run the macro. After running the macro, a new edit buffer is opened with the list of ID headers that are duplicates. You could then use the header as search string to find the duplicate in the actual mailbox file.

I tested this macro with Pegasus Mail. The macro code follows:
----

// Find duplicate mail messages

#80 = Buf_Num // #80 = Buffer for Mailbox file
Buf_Switch(#81=Buf_Free) // #81 = Buffer for message ID list
Ins_Newline
#82 = Reg_Free() // #82 = T-reg for tmp data
#83 = 0 // number of messages
#84 = 0 // number of duplicates

// Create list of message ID's
Statline_Message("Reading message ID's")
Buf_Switch(#80)
BOF
Repeat(ALL) {
Search("| Reg_Copy(#82,1)
Buf_Switch(#81)
Reg_Ins(#82)
Buf_Switch(#80)
Line(1,ERRBREAK)
#83++
}

// Sort ID list
Statline_Message("Sorting")
Buf_Switch(#81)
Sort_Merge("1,200",0,File_Size,NOCOLLATE)

// Find duplicates (delete all lines except duplicates)
BOF
Repeat(ALL) {
Reg_Copy_Block(#82,BOL_pos,EOL_pos)
Line(1,ERRBREAK)
if(Compare(#82,CASE) != 0) { // no duplicate
Line(-1) // delete previous ID
Del_Line(1)
} else {
#84++
}
}

// Show info on status line
Num_Str(#83,#82)
Reg_Set(#82," messages,",APPEND)
Num_Str(#84,#82, APPEND)
Reg_Set(#82," duplicates",APPEND)
Statline_Message(@(#82))


---
Pauli