Friday, July 29, 2011

Fun with Mail (part 2)

Some more details on the mail filtering I'd talked about here. To recap, I'd set up a Linux server in the corner of the laundry room, and was using it for IMAP, with Thunderbird on a Mac as the client. But I was finding Thunderbird's filtering very unreliable. I guessed that might be because it was IMAP rather than local folders. So I decided to do some filtering in Smalltalk. I set up a cron job as

export VISUALWORKS=/home/aknight/vw7.7.1nc
/home/aknight/vw7.7.1nc/bin/linux86/visual /home/aknight/bin/imap.im -nogui -evaluate "10 seconds wait. Net.Filter new run

To do the filtering, I wrote a simple class called Filter. I put it in the Net namespace because that way it would see all the Net classes I wanted to use, and because I was too lazy to make my own namespace for just one class.

When one of these filter objects is created we also set up an IMAP client, as

"Security.X509.X509Registry default 
   addTrusted: Security.X509.AlansGlobal."
"client useSecureConnection."
client := IMAPClient host: '192.168.1.5'.
[client connect] 
   on: Security.SSLWarning 
   do: [:ex | ex proceed].
client user: (Net.Settings defaultIdentity).
client login.client select: 'Inbox'.

You'll notice the first two lines are commented out, because after I'd set up the security I decided I didn't really need it for a process running on the same machine, within my home network. But I left the code there because it might be important in other circumstances. The remaining lines create an IMAPClient, tell it which host to use, tell it to use the identity that I'd entered in the settings, have it log in to the server, and issue the select: command to look at the Inbox.

One thing that's important is that when we're done, we should be careful to close the connection, doing
client close.
client disconnect.
or else the server gets too many connections after a while and complains. Not everything has a nice garbage collector to clean up for us.

Once we're connected, we need to get the messages.

messages
  | unseen tempMessages result notDeleted |

  unseen := client searchMessages: 'UNSEEN'.
  notDeleted := client searchMessages: 'NOT DELETED'.
  notDeleted isEmpty ifTrue: [^Dictionary new].
  tempMessages := client fetchMessages: notDeleted.
  client markAsUnSeen: unseen.
  result := Dictionary new.
  tempMessages do: [:each |
    result at: 
      (Integer readFrom: each key readStream) 
      put: each value].
  ^result.

The searchMessages: API will let you search on the server for a particular criteria. The criteria are pretty self-evident. One thing that I'm working around here is that using these API's marks the messages as read. So what I'm doing is finding all the unread messages and keeping a list of their ids, then fetching all of the messages, and then marking the ones that were previously unread as unread again. Not very elegant, but it worked ok. There's probably a race condition there if new messages arrive in between the steps, but the worst thing that happens is the messages show up as having been read when it's not true.

Once we've got all the messages, we loop over them and run filters. Much of the code for that is actually error handling. Martin Kobetic, who wrote a lot of our Net code, says that spam is a wonderful source of edge cases for the various protocols and formats. The main part of the loop looks like

messages keysAndValuesDo: [:key :eachMessage |
   message := [[MailMessage readFrom: eachMessage first readStream]
      on: KeyNotFoundError
      do: [:ex |ex receiver = StreamEncoder encoderDirectory
         ifTrue: [#undecodeablejunk]
         ifFalse: [ex pass]]]
      on: ParsingSimpleBodyError
      do: [:ex | #undecodeablejunk].
   message = #undecodeablejunk ifTrue: [
      Transcript cr; show: index printString, ' is undecodeable'.
      WindowingSystem isHeadless ifFalse: [eachMessage inspect]].
Messages are keyed by integers (message number in the particular mailbox) on the server. So we loop over the key (message number) and the message itself. Well, the message is actually an array with one element with the message body. We need to read that and extract the various header fields. But the message might be in an encoding we don't have. That comes up as a KeyNotFoundError, meaning we didn't find the encoding name, say, Big10. I chose to interpret that as meaning the message wasn't important, so I just return the special symbol #undecodeablejunk and log it. If I'm running interactively, I inspect the message, so I can validate that. I did have some valid messages get flagged as junk that way, but not a lot.

Even if we've got the encoding, the message may be malformed in interesting ways, and we may get a ParsingSimpleBodyError, so I catch that and also mark things non-decodeable.

Then we want to actually run the filters. I defined a pragma for filters, so what I have is a bunch of methods that look like
filtervwnc
   "self new run"
   
   ^self matchRecipient: 'vwnc@cs.uiuc.edu' andMoveTo: 'INFO.vwnc'

Where matchRecipient:andMoveTo: looks like
matchRecipient: recipient andMoveTo: mailbox
   message to, message cc do: [:eachRecipient |
      ('*', recipient, '*' match: eachRecipient) ifTrue: [
         ^self moveTo: mailbox]]
The pragmas are run by iterating over the collection we get from

filters
   ^(Pragma allNamed: #filter: from: self class to: self class)
      sorted: [:a :b | (a argumentAt: 1) <= (b argumentAt: 1)].

If any of the filters return value is the symbol #stop then we don't run any other filters, otherwise we keep going until the end. So, for example, I put in a filter that if an email was directly addressed to one of my email addresses, don't run any of the other filters, leave it in the Inbox. And once any filter has tagged a particular message, we move the message to the appropriate place and then stop.

Finally, there's moving the messages. The actual move is just a copy and delete in terms of
IMAP operations.
move: messageIdentifier to: mailbox

 | result1 result2 |
 result1 := client copy: messageIdentifier to: mailbox.
 result2 := client markForDelete: messageIdentifier.
but just to make doubly sure I'm not running into trouble I put in some checking ahead of that.
moveTo: mailbox

 | checkMessage |
 "First, check that the message is what we thought it was."
 checkMessage := (client fetchMessages: (Array with: index)) first value first.
 client markAsUnSeen: (Array with: index).
 (messages at: index) first = checkMessage ifFalse: [^#stop].
 self move: (Array with: index) to: mailbox.
 Transcript cr; show: 'Moving ', index printString, ' to ', mailbox.
 message isSymbol 
     ifTrue: [Transcript cr; show: message] 
     ifFalse: [ 
  Transcript cr; show: (message from first, '   ', message subject)].
 ^#stop.

In the end, with a bit of fighting with things that are hard to debug when the right thing just doesn't happen, I got this pretty much working. It had a few issues. One is that even though I was carefully running the filters a few seconds after each fetch, there was often some delay in the filters running, so I'd have things that should get redirected to mailing lists left in the Inbox for a couple of minutes. Another was that every once in a while it'd get stuck on a message that was malformed in a new and interesting way, and I had to go look at the error processing again. Some messages did get falsely caught - there are people sending legitimate emails who used some very peculiar encodings or header formats.

The biggest issue, though, is that in the end this proved mostly unnecessarily because I switched to an email client where the filters work on the Mac (Postbox) and that has a number of other advantages over Thunderbird as well. I'm still using the Smalltalk filtering. It has the advantage that it doesn't require the mail client on my main computer to be running in order for filtering to happen. But I've switched some of the most common filters to just use Postbox's filtering, mostly the ones that are for mailing lists that generate a lot of traffic. But I've still got most of my filters in Smalltalk, and nowadays the need for me to check them is pretty rare. And it was definitely an interesting experience writing it.