Currently we're looking up all parents (based on In-reply-to and
References header) and using the list of all thread_id values
from those as our thread_id value. We're missing one step which
sup does which is to also look up any children in the database
that have reference our message ID. So we'll need to do that next.
This is in preparation for doing a couple of passes over the references,
(one to add terms to the database, and a second to find the thread_id).
We also now parse the In-reply-to header which we were missing before.
We treat it identically to the References header.
We identify it based on a trailing ':' on the line before a quote
begins.
At this point the database-dump diff between sup and notmuch is
getting very, very small, (at least for our one test message).
At this point, we're achieving a result that is *very* close to
what sup does. The only difference is that we are still indexing
the "excerpts from message ..." line, and we are not yet indexing
references.
Most of this code is fairly clean and works well. One part is
fairly painful---namely extracting the body of an email message
from libgmime. Currently, I'm just extracting the offset to
the end of the headers, and then separately opening the message.
Surely there's a better way.
Anyway, with that the results are looking very similar to sup-sync
now, (as verified by xapian-dump). The only substantial difference
I'm seeing now is that sup does not seem to index quoted portions
of messages nor signatures. I'm not actually sure whether I want
to follow sup's lead in that or not.