Since we're currently just trying to stitch together In-Reply-To
and References headers we don't need that much sophistication.
It's when we later add full-text searching that GMime will be
useful.
So for now, even though my own code here is surely very buggy
compared to GMime it's also a lot faster. And speed is what
we're after for the initial index creation.
This is the beginning of the notmuch library as well, with its
interface in notmuch.h. So far we've got create, open, close, and
add_message (all with a notmuch_database prefix).
The current add_message function has already been whittled down from
what we have in notmuch-index-message to add only references,
message-id, and thread-id to the index, (that is---just enough to do
thread-linkage but nothing for full-text searching).
The concept here is to do something quickly so that the user can get
some data into notmuch and start using it. (The most interesting stuff
is then thread-linkage and labels like inbox and unread.) We can
defer the full-text indexing of the body of the messages for later,
(such as in the background while the user is reading mail).
The initial thread-stitching step is still slower than I would like.
We may have to stop using libgmime for this step as its overhead is
not worth it for the simple case of just parsing the message-id,
references, and in-reply-to headers.
This was for some time testing, (to see how fast xapian could be
if we were strictly adding documents and not doing any other IO
or computation). The answer is that xapian is quite fast, (on
the order of 1000 documents per second).
Of course, there's not much that this program does yet. It's got
some structure for some sub-commands that don't do anything. And
it has a main command that prints some explanatory text and then
counts all the regular files in your mail archive.
These were more significant than the previous leak because these were
in the loop and leaking memory for every message being parsed. It
turns out that g_hash_table_new should probably be named
g_hash_table_new_and_leak_memory_please. The actually useful function
is g_hash_table_new_full which lets us pass a free function, (to free
keys when inserting duplicates into the hash table). And after all,
weeding out duplicates is the only reason we are using this hash table
in the first place.
It almost goes without saying, valgrind found these leaks.
This was a single object in main outside any loops, so there was
no impact on performance or anything, but obviously we still want
to patch this.
Of course, valgrind gets the credit for seeing this.
When looking for a trailing ':' to introduce a quotation we peek at
the last character before a newline. But for blank lines, that's not
where we want to look. And when the first line in our buffer is a
blank line, we're underrunning our buffer. The fix is easy---just
bail early on blank lines since they have no terms anyway.
Thanks to valgrind for pointing out this error.
Previously, we used as the thread-id the message-id of the first
message in the thread that we happened to find. In fact, this is a
totally arbitrary identifier, so it might as well be random. And an
advantage of actually using a random identifier is that we now have
fixed-length thead identifiers, (and the way is open to even allow
abbreviated identifiers like git does---though we're less likely to
show these identifiers to actual users).
Instead of always showing the overall rate, we wait until the end
to show that. Then, on incremental updates we show the rate over the
last increment. This makes it much easier to actually watch what's
happening, (and it's easy to see the efect of xapian's internal
10,000 document flush).
We could (and probably should) reparse and index all the headers from
the embedded message, but I'm not choosing to do that now---I'm just
indexing the body of the embedded message.
As can be seen here, there are not a lot of differences. I've verified
this by using sup-sync to import a month of mail from the sup mailing
list, and comparing the database term-by-term, value-by-value, and
data-by-data with that created by notmuch. There are no differences
other than those documented here.
Here's another change which I'm making for sup compatibility against
my better judgment. It seems that sup never indexes content from
mime parts with content-disposition of attachment. But these
attachments are often very indexable, (for example, the first one
I encountered was a small shell script).
So I'll have to think a bit about whether or not I want to revert
this commit. To do this properly we would really want to distinguish
between attachments that are indexable, (such as text), and those
that aren't, (such as binaries). I know the mime-type alone isn't
alwas sufficient here as even this little plaintext shell script
was attached as octet-stream.
And if we wanted to get really fancy we could run things like antiword
to generate text from non-text attachments and index their output.
Instead of using the recursive "foreach" method, we implement our
own recursive function. This allows us to ignore the signature
component of a multipart/signed message, (which we certainly
don't need to index).
Ignoring this whitespace seems like a good idea to me, but it's
interfering with my comparisons with sup since sup doesn't do this.
This might be a commit worth dropping in the future since it exists
only for pedantic consistency with sup and not for any reason of its
own.
Here's another instance where I "knew" gmime must have support for
some functionality, but not finding it, I rolled my own. Now that
I found g_mime_references_decode I'm glad to drop my ugly code.
This cleans up some old code that was very ugly, (separately opening
the mail file and seeking to the end of the headers to parse the
body). I knew gmime must have had support for transparently decoding
mime content, but I just couldn't find it previously.
Note: Multipart and MultipartSigned parts are not handled yet.
Things are quite happy now. The few differences I see with sup are:
1. sup forces email address domains to lowercase, (I don't think I care)
2. sup and notmuch disagree on ordering of multiple thread_id values
(another thing that's of no concern)
We are still doing one thing wrong when a message belongs to multiple
threads. We've got a nice comma-separated thread-value just like sup,
but then we're also putting in a comma-separated thread-term where
sup does multiple thread terms. That should be an easy fix.
Beyond that, sup and notmuch are still disagreeing on the term lists
for some messages, (I think attachment vs. inline content-disposition
is at least one piece of this). But there are likley still differences
in the heuristics for which chunks of the message body to index. I'll
be looking into this more.
Currently we're looking up all parents (based on In-reply-to and
References header) and using the list of all thread_id values
from those as our thread_id value. We're missing one step which
sup does which is to also look up any children in the database
that have reference our message ID. So we'll need to do that next.
This is in preparation for doing a couple of passes over the references,
(one to add terms to the database, and a second to find the thread_id).
We also now parse the In-reply-to header which we were missing before.
We treat it identically to the References header.
We identify it based on a trailing ':' on the line before a quote
begins.
At this point the database-dump diff between sup and notmuch is
getting very, very small, (at least for our one test message).
At this point, we're achieving a result that is *very* close to
what sup does. The only difference is that we are still indexing
the "excerpts from message ..." line, and we are not yet indexing
references.
Most of this code is fairly clean and works well. One part is
fairly painful---namely extracting the body of an email message
from libgmime. Currently, I'm just extracting the offset to
the end of the headers, and then separately opening the message.
Surely there's a better way.
Anyway, with that the results are looking very similar to sup-sync
now, (as verified by xapian-dump). The only substantial difference
I'm seeing now is that sup does not seem to index quoted portions
of messages nor signatures. I'm not actually sure whether I want
to follow sup's lead in that or not.
At the same time, I've started hacking up sup with a new NotmuchIndex
class in the place of the previous XapianIndex class. The new class
stores only the source_info field in the document data, (rather than
a serialized ruby hash with a bunch of data that can be found in the
original message).
Eventually, I plan to replace source_info with a relative filename for
the message, (or even a list of filenames for when multiple messages
in the database share a common message ID).