This is based on the old notmuch-index-message.cc from early in
the history of notmuch, but considerably cleaned up now that
we have some experience with Xapian and know just what we want
to index, (rather than just blindly trying to index exactly
what sup does).
This does slow down notmuch_database_add_message a *lot*, but I've
got some ideas for getting some time back.
Somehow this naming with an underscore crept in, (but only in the
private header, so notmuch.c was compiling with no prototype). Fix
to be the notmuch_thread_get_subject originally intended.
The previous functions were always called together, so we might as
well just have one function for this. Also, the reset() name was
poor, and prepare_iterator() is much more descriptive.
We want to be able to iterate over tags stored in various ways, so
the previous TermIterator-based tags object just wasn't general
enough. The new interface is nice and simple, and involves only
C datatypes.
We've now got a new notmuch_query_search_threads and a
notmuch_threads_result_t iterator. The thread object itself
doesn't do much yet, (just allows one to get the thread_id),
but that's at least enough to see that "notmuch search" is
actually doing something now, (since it has been converted
to print thread IDs instead of message IDs).
And maybe that's all we need. Getting the messages belonging
to a thread is as simple as a notmuch_query_search_messages
with a string of "thread:<thread-id>".
Though it would be convenient to add notmuch_thread_get_messages
which could use the existing notmuch_message_results_t iterator.
Now we just need an implementation of "notmuch show" and we'll
have something somewhat usable.
Instead of supporting multiple thread IDs, we now merge together
thread IDs if one message is ever found to belong to more than one
thread. This allows for constructing complete threads when, for
example, a child message doesn't include a complete list of References
headers back to the beginning of the thread.
It also simplifies dealing with mapping a message ID to a thread ID
which is now a simple get_thread_id just like get_message_id, (and no
longer an iterator-based thing like get_tags).
We were previously just doing fprintf;exit at each point, but I
wanted to add file and line-number details to all messages, so it
makes sense to use a single macro for that.
The generic notmuch_terms_t iterator should provide support for
notmuch_thread_ids_t when we switch as well, (And it would be
interesting to see if we could reasonably make this support a
PostingIterator too. Time will tell.)
First, it's nice that for now we don't have any users yet, so we
can make incompatible changes to the database layout like this
without causing trouble. ;-)
There are a few reasons for this change. First, we now use value 0
uniformly as a timestamp for both mail and timestamp documents, (which
lets us cleanup an ugly and fragile bare 0 in the add_value and
get_value calls in the timestamp code).
Second, I want to drop the thread value entirely, so putting it at the
end of the list means we can drop it as compatible change in the
future. (I almost want to drop the message-ID value too, but it's nice
to be able to sort on it to get diff-able output from "notmuch dump".)
But the thread value we never use as a value, (we would never sort on
it, for example). And it's totally redundant with the thread terms we
store already. So expect it to disappear soon.
We'll be using this for storing really long terms in the database
and when we just need to look them up, (and never read back the
original data directly from the database). For example, storing
arbitrarily long directory paths in the database along with
mtime timestamps.
Note that if we did want to store arbitrarily long terms and also
be able to read them back, the Xapian folks recommending splitting
the term off with multiple prefixes. See the note near the end
of this page:
http://trac.xapian.org/wiki/FAQ/UniqueIds
Here's the second big fix to message-ID handling, (the first was to
generate message IDs when an email contained none). Now, with no
document missing a message ID, and no two documents having the same
message ID, we have a nice consistent database where the message ID
can be used as a unique key.
This is the last piece needed for add_message to be able to properly
support a message with a duplicate message ID. This function creates
a new notmuch_message_t object but one that may reference an existing
document in the database.
This will support the add_message function in incrementally creating
state in a new notmuch_message_t. The new functions are
_notmuch_message_set_filename
_notmuch_message_add_thread_id
_notmuch_message_ensure_thread_id
_notmuch_message_set_date
_notmuch_message_sync
This is important as we're using the message ID as the unique key
in our database. So previously, all messages with no message ID
would be treated as the same message---not good at all.
We actually need this before the include of xutil.h, but
it was previously stuck randomly among various system
includes. Instead, put it at the top, right after include
the notmuch.h header that defines it.
With this function, and the recently added support for
notmuch_message_get_thread_ids, we now recode the find_thread_ids
function to work just the way we expect a user of the public
notmuch API to work. Not too bad really.
The motivation here is that our top-level notmuch.c main program
wants to start using these, but we don't want it to see into
notmuch-private.h, (since our main program is a test vehicle
for the "public" notmuch interface in notmuch.h).
To properly support sorting in notmuch_query we know use an
Enquire object. We also throw in a QueryParser too, so we're
really close to being able to support arbitrary full-text
searches.
I took a look at the supported QueryParser syntax and chose
a set of flags for everything I like, (such as supporting
Boolean operators in either case ("AND" or "and"), supporting
phrase searching, supporting + and - to include/preclude terms,
and supporting a trailing * on any term as a wildcard).
This is a fairly big milestone for notmuch. It's our first command
to do anything besides building the index, so it proves we can
actually read valid results out from the index.
It also puts in place almost all of the API and infrastructure we
will need to allow searching of the database.
Finally, with this change we are now using talloc inside of notmuch
which is truly a delight to use. And now that I figured out how
to use C++ objects with talloc allocation, (it requires grotty
parts of C++ such as "placement new" and "explicit destructors"),
we are valgrind-clean for "notmuch dump", (as in "no leaks are
possible").
This is in preparation for a new, public notmuch_message_t.
Eventually, the public notmuch_message_t is going to grow enough
features to need to be file-backed and will likely need everything
that's now in message-file.c. So we may fold these back into one
object/implementation in the future.
The line-based parsing can be a bit awkward when wanting to peek
ahead, (say, for folded header values), but it's so convenient
to be able to trust that a string terminator exists on every
line so it cleans up the code considerably.
Since we're currently just trying to stitch together In-Reply-To
and References headers we don't need that much sophistication.
It's when we later add full-text searching that GMime will be
useful.
So for now, even though my own code here is surely very buggy
compared to GMime it's also a lot faster. And speed is what
we're after for the initial index creation.
This is the beginning of the notmuch library as well, with its
interface in notmuch.h. So far we've got create, open, close, and
add_message (all with a notmuch_database prefix).
The current add_message function has already been whittled down from
what we have in notmuch-index-message to add only references,
message-id, and thread-id to the index, (that is---just enough to do
thread-linkage but nothing for full-text searching).
The concept here is to do something quickly so that the user can get
some data into notmuch and start using it. (The most interesting stuff
is then thread-linkage and labels like inbox and unread.) We can
defer the full-text indexing of the body of the messages for later,
(such as in the background while the user is reading mail).
The initial thread-stitching step is still slower than I would like.
We may have to stop using libgmime for this step as its overhead is
not worth it for the simple case of just parsing the message-id,
references, and in-reply-to headers.