It's nice that Xapian provides a little function to print a textual
representation of the entire query tree. So now, if you compile
like so:
make CFLAGS=-DDEBUG_QUERY
then you get a nice output of the query string received by the query
module, and the final query actually being sent to Xapian.
This isn't behaving at all like it's documented yet, (for example,
it's returning message IDs not thread IDs[*]). In fact, the output
code is just a copy of the body of "notmuch dump", so all you
get for now is message ID and tags.
But this should at least be enough to start exercising the query
functionality, (which is currently very buggy).
[*] I'll want to convert the databse to store thread documents
before fixing that.
The current problem is that when this function fails the caller
doesn't get any information about what the particular failure
was, (something in the filesystem? or in Xapian?). We should fix
that.
With some recent testing, the timestamp was failing, (overflowing
the term limit), and reporting an error, but the top-level notmuch
command was still returning a success return value.
I think it's high time to add a test suite, (and the code base is
small enough that if we add it now it shouldn't be *too* hard to
shoot for a very high coverage percentage).
The previous code was only correct as long as the timestamp prefix
was only a single character. But with the recent change to a
multi-character prefix, this broke. So fix it now.
I've decided not to try for sup compatibility at the leve of the
xapian datbase. There's just too much about sup's usage of the
database that I don't like, (beyond the embedded ruby data structures
there is redundant storage of message IDs, thread IDs, and dates (in
both terms and values)).
I'm going to fix that up in the database of notmuch, with some other
changes as well. (I plan to drop "reference" terms once linkage to a
thread ID through the reference is established. I also plan to add
actual documents to represent threads.)
So with all that incompatibility, I might as well make my own prefix
values. And while doing that, I should try to be as compatible as
possible with the conventions described here:
http://xapian.org/docs/omega/termprefixes.html
With this, "notmuch new" is now plenty fast even with large archives
spanning many sub-directories. Document this both in "notmuch help"
and also in the output of notmuch setup.
Finally, I can get new messages into my notmuch database without
having to run a complete "notmuch setup" again. This takes
advantage of the recent timestamp capabilities in the database
to avoid looking into directories that haven't changed since the
last time "notmuch new" was run.
Get rid of a useless leading 0 on the seconds value, and make a
distinction between "files" and "messages", (we process many
files, but not all of them are recongized as messages). Finally,
add a summary line at the end saying how many unique messages
were added to the database. Since this comes right after the
total number of files, it gives the user at least a hint as
to how many messages were encountered with duplicate message IDs.
The notmuch_database_get_default_path function is unique in not
accepting a notmuch_database_t* (nor creating one). So list it
outside the other notmuch_database functions.
Some people might argue for more initializers to be "safer",
but I actually prefer to leave things this way. It saves
typing, but the real benefit is that the things that do
require initialization stand out so we know to watch them
carefully. And with valgrind, we actually get to catch
errors earlier if we *don't* initialize them. So that can
be "safer" ironically enough.
And document that notmuch_database_add_message can return this
value. This pushes the hard decision of what to do with duplicate
messages out to the user, but that's OK. (We weren't really doing
anything with these ourselves, and this way the user is at least
informed of the issue, rather than it just getting papered over
internally.)
This were just unclean, (an invisble sort of uncleanliness, but still
there are liable to make for ugly diffs). Oh, wait, like this one!
But at least it's not sprinkled among code changes.
Again preferring notmuch_database_t* over Xapian::Database*.
Also, we're standardizing on "doc_id" rather than "docid" locally, (as
an analoge to "message_id"), in spite of the "Xapian::docid" name,
(which, fortunately, we can ignore and just us "unsigned int" instead).
This name is a more accurate description of what it does, and
the more general naming will make sense as we start storing
non-message documents in the database (such as directory
timestamps).
Also, don't pass around a Xapian::Database where it's more our
style to pass a notmuch_database_t*.
We'll be using this for storing really long terms in the database
and when we just need to look them up, (and never read back the
original data directly from the database). For example, storing
arbitrarily long directory paths in the database along with
mtime timestamps.
Note that if we did want to store arbitrarily long terms and also
be able to read them back, the Xapian folks recommending splitting
the term off with multiple prefixes. See the note near the end
of this page:
http://trac.xapian.org/wiki/FAQ/UniqueIds
This helps the user gauge the severity of the error.
For example, when restoring my sup tags I see a bunch of tags missing
for message IDs of the form "sup-faked-...". That's not surprising
since I know that sup generates these with the md5sum of the message
header while notmuch uses the sha-1 of the entire message. But how
much will this hurt?
Well, now that I can see that most of the missing tags are just
"attachment", then I'm not concerned, (I'll be automatically creating
that tag in the future based on the message contents). But if a
missing tag is "inbox" then that's more concerning because that's data
that I can't easily regenerate outside of sup.
With the recent improvements to the handling of message IDs we
"know" that a NULL message ID is impossible, (so we simply
abort if the impossible happens).
Here's the second big fix to message-ID handling, (the first was to
generate message IDs when an email contained none). Now, with no
document missing a message ID, and no two documents having the same
message ID, we have a nice consistent database where the message ID
can be used as a unique key.
This is the last piece needed for add_message to be able to properly
support a message with a duplicate message ID. This function creates
a new notmuch_message_t object but one that may reference an existing
document in the database.
This function is only supposed to be called with a doc_id that
was queried from the database already. So there's an internal
error if no document with that doc_id can be found in the database.
In that case, return NULL.
This will support the add_message function in incrementally creating
state in a new notmuch_message_t. The new functions are
_notmuch_message_set_filename
_notmuch_message_add_thread_id
_notmuch_message_ensure_thread_id
_notmuch_message_set_date
_notmuch_message_sync
This is a new public function to find the filename of the original
email message for a message-object that was found in the database.
We may change this function in the future to support returning a
list of filenames, (for messages with duplicate message IDs).
We're preparing for being able to deal with files with duplicate
message IDs here. The plan is to create a notmuch_message_t object in
add_message that may or may not reference a document that exists in
the database. So to do this, we have to find the message ID before we
do any manipulation of the doc.
The idea here is to allow internal users to see a non-synced message
object, (for example, while parsing a message file and incrementally
adding terms, etc.). We're willing to take the care to get the
improved performance.
But for the public interface, keeping everything synced will be much
less confusing, (reference lots of sup bugs that happen due to
message state being altered by the user but not synced to the database).
I still don't like the name message_file at all, but we're about
to start using a notmuch_message_t in this function so we need
to do something to keep the identifiers separate for now.
Eventually, it probably makes sense to push the message-parsing
code from database.cc to message.cc.
It's even enough to check if a "missing" header was accidentally
left off the list in the call to restrict_headers. (And it's
cheap since we only check in case no such header was found in the
message.)
We recently started discarding files as "not email" if they have none
of Subject, From, nor To. Apaprently, my mail collection contains a
number of messages that I sent, that are saved without Subject and
From, (perhaps these were drafts?).
Anyway, it's fortunate I had those since they alerted me to this bug,
where we were not parsing the "To" header in some cases.
This is important as we're using the message ID as the unique key
in our database. So previously, all messages with no message ID
would be treated as the same message---not good at all.
This way both the .c and .h files have the same name, and all of the
code imported from the "libsha1" implementation is in filenames
matching libsha1.*.
This also gives me room to make my own notmuch_sha1 wrapper functions
in sha1.c.
I'm glad that when I implemented "notmuch restore" I went through the
extra effort to take the code I had written in one sitting into over a
dozen commits. Sure enough, I hadn't tested well enough and had
totally broken "notmuch setup", (segfaults and bogus thread_id
values).
With the little commits I had made, git bisect saved the day, and I
went back to make the fixes right on top of the commits that
introduced the bugs. So now we octopus merge those in.
We deleted this in favor of our fancy new thread_ids iterator
from the message object. But one of the previous callers of
insert_thread_id isn't using notmuch_message_t yet. I made
the mistake of thinking I could just call g_hash_table_insert
directly, but the problem was that nobody was splitting
up the thread_id string at its commas.
So with this, we were inserting bogus comma-separated IDs
into the hash table, so thread_id values were ballooning
out of control. Should be much better now.
Here's more evidence that C++ is a nightmare to program---or that
I'm smart enough to realize that C++ is more clever than I will
ever be.
Most of my issues with C++ have to do with it hiding things from
me that I'd really like to and expect to be aware of as a C
programmer.
For example, the specific problem here is that there's a
short-lived std::string, from which I just want to copy
the C string. I try to do that on the next line, but before
I can, C++ has already called the destructor on the std::string.
Now, C++ isn't alone in doing garbage collecting like this.
But in a *real* garbage-collecting system, everything would
work that way. For example, here, I'm still holding a pointer
to the C string contents, so if the garbage collector were
aware of that reference, then it might clean up the std::string
container and leave the data I'm still using.
But that's not what we get with C++. Instead, some things are
reference counted and collected, (like the std::string), and
some things just aren't (like the C string it contains). The
end result is that it's very fragile. It forces me to be aware
of the timing of hidden functions. In a "real" system I wouldn't
have to be aware of that timing, and in C the function just
wouldn't be hidden.