We're preparing for being able to deal with files with duplicate
message IDs here. The plan is to create a notmuch_message_t object in
add_message that may or may not reference a document that exists in
the database. So to do this, we have to find the message ID before we
do any manipulation of the doc.
I still don't like the name message_file at all, but we're about
to start using a notmuch_message_t in this function so we need
to do something to keep the identifiers separate for now.
Eventually, it probably makes sense to push the message-parsing
code from database.cc to message.cc.
We recently started discarding files as "not email" if they have none
of Subject, From, nor To. Apaprently, my mail collection contains a
number of messages that I sent, that are saved without Subject and
From, (perhaps these were drafts?).
Anyway, it's fortunate I had those since they alerted me to this bug,
where we were not parsing the "To" header in some cases.
This is important as we're using the message ID as the unique key
in our database. So previously, all messages with no message ID
would be treated as the same message---not good at all.
I'm glad that when I implemented "notmuch restore" I went through the
extra effort to take the code I had written in one sitting into over a
dozen commits. Sure enough, I hadn't tested well enough and had
totally broken "notmuch setup", (segfaults and bogus thread_id
values).
With the little commits I had made, git bisect saved the day, and I
went back to make the fixes right on top of the commits that
introduced the bugs. So now we octopus merge those in.
We deleted this in favor of our fancy new thread_ids iterator
from the message object. But one of the previous callers of
insert_thread_id isn't using notmuch_message_t yet. I made
the mistake of thinking I could just call g_hash_table_insert
directly, but the problem was that nobody was splitting
up the thread_id string at its commas.
So with this, we were inserting bogus comma-separated IDs
into the hash table, so thread_id values were ballooning
out of control. Should be much better now.
With this function, and the recently added support for
notmuch_message_get_thread_ids, we now recode the find_thread_ids
function to work just the way we expect a user of the public
notmuch API to work. Not too bad really.
I'm too lazy to see what the RFC says, but I know that having
whitespace inside a message-ID is sure to confuse things. And
besides, this makes things more compatible with sup so that
I have some hope of importing sup labels.
To properly support sorting in notmuch_query we know use an
Enquire object. We also throw in a QueryParser too, so we're
really close to being able to support arbitrary full-text
searches.
I took a look at the supported QueryParser syntax and chose
a set of flags for everything I like, (such as supporting
Boolean operators in either case ("AND" or "and"), supporting
phrase searching, supporting + and - to include/preclude terms,
and supporting a trailing * on any term as a wildcard).
This is a fairly big milestone for notmuch. It's our first command
to do anything besides building the index, so it proves we can
actually read valid results out from the index.
It also puts in place almost all of the API and infrastructure we
will need to allow searching of the database.
Finally, with this change we are now using talloc inside of notmuch
which is truly a delight to use. And now that I figured out how
to use C++ objects with talloc allocation, (it requires grotty
parts of C++ such as "placement new" and "explicit destructors"),
we are valgrind-clean for "notmuch dump", (as in "no leaks are
possible").
This is in preparation for a new, public notmuch_message_t.
Eventually, the public notmuch_message_t is going to grow enough
features to need to be file-backed and will likely need everything
that's now in message-file.c. So we may fold these back into one
object/implementation in the future.
We were properly feeing this memory when the thread-ids list was not
empty, but leaking it when it was.
Thanks, of course, to valgrind along with the G_SLICE=always-malloc
environment variable which makes leak checking with glib almost
bearable.
I was incorrectly using the return value of stat (-1) instead of
errno (ENOENT) to try to construct the error message here.
Also, while we're here, reword the error message to not have
"stat" in it, which in spite of what a Unix programmer will
tell you, is not actually a word.
When documenting these functions I described support for a
NOTMUCH_BASE environment variable to be consulted in the case
of a NULL path. Only, I had forgotten to actually write the
code.
This code exists now, with a new, exported function:
notmuch_database_default_path
This is helpful for things like indexes that other mail programs
may have left around. It also means we can make the initial
instructions much easier, (the user need not worry about moving
away auxiliary files from some other email program).
Looks like we can copy in a hash-table implementation, (from cairo,
say), and then a few _ascii_ functions from glib, (we'll need to
switch a few current uses if things like isspace, etc. to locale-
independent versions as well). So not too hard to free ourselves
of glib for now, (until we add GMime back in later, of course).
Since we're currently just trying to stitch together In-Reply-To
and References headers we don't need that much sophistication.
It's when we later add full-text searching that GMime will be
useful.
So for now, even though my own code here is surely very buggy
compared to GMime it's also a lot faster. And speed is what
we're after for the initial index creation.
This is the beginning of the notmuch library as well, with its
interface in notmuch.h. So far we've got create, open, close, and
add_message (all with a notmuch_database prefix).
The current add_message function has already been whittled down from
what we have in notmuch-index-message to add only references,
message-id, and thread-id to the index, (that is---just enough to do
thread-linkage but nothing for full-text searching).
The concept here is to do something quickly so that the user can get
some data into notmuch and start using it. (The most interesting stuff
is then thread-linkage and labels like inbox and unread.) We can
defer the full-text indexing of the body of the messages for later,
(such as in the background while the user is reading mail).
The initial thread-stitching step is still slower than I would like.
We may have to stop using libgmime for this step as its overhead is
not worth it for the simple case of just parsing the message-id,
references, and in-reply-to headers.