First, it's nice that for now we don't have any users yet, so we
can make incompatible changes to the database layout like this
without causing trouble. ;-)
There are a few reasons for this change. First, we now use value 0
uniformly as a timestamp for both mail and timestamp documents, (which
lets us cleanup an ugly and fragile bare 0 in the add_value and
get_value calls in the timestamp code).
Second, I want to drop the thread value entirely, so putting it at the
end of the list means we can drop it as compatible change in the
future. (I almost want to drop the message-ID value too, but it's nice
to be able to sort on it to get diff-able output from "notmuch dump".)
But the thread value we never use as a value, (we would never sort on
it, for example). And it's totally redundant with the thread terms we
store already. So expect it to disappear soon.
We're now dropping all pretense of keeping the database directly
compatible with sup's current xapian backend. (But perhaps someone
might write a new nothmuch backend for sup in the future.)
In coming up with the prefix values here, I tried to follow the
conventions of http://xapian.org/docs/omega/termprefixes.html as
closely as makes sense, (with some domain translation from "web"
to "email archive").
The idea here is that only some of the prefix names (such as "id" and
"tag") actually make sense in external user-supplied query
strings. Other things like "type" are internal implementation details
of how we store things in the database. So internal machinery will add
those terms to the database and we don't need to support them in the
string itself.
With this, we can now simply loop over the external prefix values to
let the quiery parser know about them. So as we add prefixes in the
future, we'll only need to add them to this list.
The key for this is call add_boolean_prefix on the QueryParser
object. That tells the query parser to take something like "tag:inbox"
and transform it into the "Linbox" term and do what it needs to do to
make this term a requirement of the search. We're starting to have a
real system here.
Also, I didn't want to expose the ugly name of "msgid" to the user, so
we add a prefix name of simply "id" instead.
The previous code was only correct as long as the timestamp prefix
was only a single character. But with the recent change to a
multi-character prefix, this broke. So fix it now.
I've decided not to try for sup compatibility at the leve of the
xapian datbase. There's just too much about sup's usage of the
database that I don't like, (beyond the embedded ruby data structures
there is redundant storage of message IDs, thread IDs, and dates (in
both terms and values)).
I'm going to fix that up in the database of notmuch, with some other
changes as well. (I plan to drop "reference" terms once linkage to a
thread ID through the reference is established. I also plan to add
actual documents to represent threads.)
So with all that incompatibility, I might as well make my own prefix
values. And while doing that, I should try to be as compatible as
possible with the conventions described here:
http://xapian.org/docs/omega/termprefixes.html
And document that notmuch_database_add_message can return this
value. This pushes the hard decision of what to do with duplicate
messages out to the user, but that's OK. (We weren't really doing
anything with these ourselves, and this way the user is at least
informed of the issue, rather than it just getting papered over
internally.)
Again preferring notmuch_database_t* over Xapian::Database*.
Also, we're standardizing on "doc_id" rather than "docid" locally, (as
an analoge to "message_id"), in spite of the "Xapian::docid" name,
(which, fortunately, we can ignore and just us "unsigned int" instead).
This name is a more accurate description of what it does, and
the more general naming will make sense as we start storing
non-message documents in the database (such as directory
timestamps).
Also, don't pass around a Xapian::Database where it's more our
style to pass a notmuch_database_t*.
Here's the second big fix to message-ID handling, (the first was to
generate message IDs when an email contained none). Now, with no
document missing a message ID, and no two documents having the same
message ID, we have a nice consistent database where the message ID
can be used as a unique key.
We're preparing for being able to deal with files with duplicate
message IDs here. The plan is to create a notmuch_message_t object in
add_message that may or may not reference a document that exists in
the database. So to do this, we have to find the message ID before we
do any manipulation of the doc.
I still don't like the name message_file at all, but we're about
to start using a notmuch_message_t in this function so we need
to do something to keep the identifiers separate for now.
Eventually, it probably makes sense to push the message-parsing
code from database.cc to message.cc.
We recently started discarding files as "not email" if they have none
of Subject, From, nor To. Apaprently, my mail collection contains a
number of messages that I sent, that are saved without Subject and
From, (perhaps these were drafts?).
Anyway, it's fortunate I had those since they alerted me to this bug,
where we were not parsing the "To" header in some cases.
This is important as we're using the message ID as the unique key
in our database. So previously, all messages with no message ID
would be treated as the same message---not good at all.
I'm glad that when I implemented "notmuch restore" I went through the
extra effort to take the code I had written in one sitting into over a
dozen commits. Sure enough, I hadn't tested well enough and had
totally broken "notmuch setup", (segfaults and bogus thread_id
values).
With the little commits I had made, git bisect saved the day, and I
went back to make the fixes right on top of the commits that
introduced the bugs. So now we octopus merge those in.
We deleted this in favor of our fancy new thread_ids iterator
from the message object. But one of the previous callers of
insert_thread_id isn't using notmuch_message_t yet. I made
the mistake of thinking I could just call g_hash_table_insert
directly, but the problem was that nobody was splitting
up the thread_id string at its commas.
So with this, we were inserting bogus comma-separated IDs
into the hash table, so thread_id values were ballooning
out of control. Should be much better now.
With this function, and the recently added support for
notmuch_message_get_thread_ids, we now recode the find_thread_ids
function to work just the way we expect a user of the public
notmuch API to work. Not too bad really.
I'm too lazy to see what the RFC says, but I know that having
whitespace inside a message-ID is sure to confuse things. And
besides, this makes things more compatible with sup so that
I have some hope of importing sup labels.
To properly support sorting in notmuch_query we know use an
Enquire object. We also throw in a QueryParser too, so we're
really close to being able to support arbitrary full-text
searches.
I took a look at the supported QueryParser syntax and chose
a set of flags for everything I like, (such as supporting
Boolean operators in either case ("AND" or "and"), supporting
phrase searching, supporting + and - to include/preclude terms,
and supporting a trailing * on any term as a wildcard).
This is a fairly big milestone for notmuch. It's our first command
to do anything besides building the index, so it proves we can
actually read valid results out from the index.
It also puts in place almost all of the API and infrastructure we
will need to allow searching of the database.
Finally, with this change we are now using talloc inside of notmuch
which is truly a delight to use. And now that I figured out how
to use C++ objects with talloc allocation, (it requires grotty
parts of C++ such as "placement new" and "explicit destructors"),
we are valgrind-clean for "notmuch dump", (as in "no leaks are
possible").
This is in preparation for a new, public notmuch_message_t.
Eventually, the public notmuch_message_t is going to grow enough
features to need to be file-backed and will likely need everything
that's now in message-file.c. So we may fold these back into one
object/implementation in the future.
We were properly feeing this memory when the thread-ids list was not
empty, but leaking it when it was.
Thanks, of course, to valgrind along with the G_SLICE=always-malloc
environment variable which makes leak checking with glib almost
bearable.
I was incorrectly using the return value of stat (-1) instead of
errno (ENOENT) to try to construct the error message here.
Also, while we're here, reword the error message to not have
"stat" in it, which in spite of what a Unix programmer will
tell you, is not actually a word.
When documenting these functions I described support for a
NOTMUCH_BASE environment variable to be consulted in the case
of a NULL path. Only, I had forgotten to actually write the
code.
This code exists now, with a new, exported function:
notmuch_database_default_path
This is helpful for things like indexes that other mail programs
may have left around. It also means we can make the initial
instructions much easier, (the user need not worry about moving
away auxiliary files from some other email program).
Looks like we can copy in a hash-table implementation, (from cairo,
say), and then a few _ascii_ functions from glib, (we'll need to
switch a few current uses if things like isspace, etc. to locale-
independent versions as well). So not too hard to free ourselves
of glib for now, (until we add GMime back in later, of course).
Since we're currently just trying to stitch together In-Reply-To
and References headers we don't need that much sophistication.
It's when we later add full-text searching that GMime will be
useful.
So for now, even though my own code here is surely very buggy
compared to GMime it's also a lot faster. And speed is what
we're after for the initial index creation.
This is the beginning of the notmuch library as well, with its
interface in notmuch.h. So far we've got create, open, close, and
add_message (all with a notmuch_database prefix).
The current add_message function has already been whittled down from
what we have in notmuch-index-message to add only references,
message-id, and thread-id to the index, (that is---just enough to do
thread-linkage but nothing for full-text searching).
The concept here is to do something quickly so that the user can get
some data into notmuch and start using it. (The most interesting stuff
is then thread-linkage and labels like inbox and unread.) We can
defer the full-text indexing of the body of the messages for later,
(such as in the background while the user is reading mail).
The initial thread-stitching step is still slower than I would like.
We may have to stop using libgmime for this step as its overhead is
not worth it for the simple case of just parsing the message-id,
references, and in-reply-to headers.