This is a rebase and cleanup of Istvan Marko's patch from
id:m3pqnj2j7a.fsf@zsu.kismala.com
Search retrieves these headers for every message in the search
results. Previously, this required opening and parsing every message
file. Storing them directly in the database significantly reduces IO
and computation, speeding up search by between 50% and 10X.
Taking full advantage of this requires a database rebuild, but it will
fall back to the old behavior for messages that do not have headers
stored in the database.
We keep the lib/xutil.c version. As a consequence, also factor out
_internal_error and associated macros. It might be overkill to make a
new file error_util.c for this, but _internal_error does not really
belong in database.cc.
Previously, notmuch_database_remove_message would remove the message
file name, sync the change to the message document, re-find the
message document, and then delete it if there were no more file names.
An interruption after sync'ing would result in a file-name-less,
permanently un-removable zombie message that would produce errors and
odd results in searches. We could wrap this in an atomic section, but
it's much simpler to eliminate the round-about approach and just
delete the message document instead of sync'ing it if we removed the
last filename.
As of gcc 4.6, there are new warnings from -Wattributes along the lines of:
warning: ‘_notmuch_messages’ declared with greater visibility
than the type of its field ‘_notmuch_messages::iterator’
[-Wattributes]
To squelch these, we decorate all such containing structs with
__attribute__((visibility("default"))). We take care to let only the
C++ compiler see this, (since the C compiler would otherwise warn
about ignored visibility attributes on types).
This replaces the guts of the filename list and tag list, making those
interfaces simple iterators over the generic string list. The
directory, message filename, and tags-related code now build generic
string lists and then wraps them in specific iterators. The real wins
come in later patches, when we use these for even more generic
functionality.
As a nice side-effect, this also eliminates the annoying dependency on
GList in the tag list.
A new "folder:" prefix in the query string can now be used to match
the directories in which mail files are stored.
The addition of this feature causes the recently added
search-by-folder tests to now pass.
This reduces thread search's 1+2t Xapian queries (where t is the
number of matched threads) to 1+t queries and constructs exactly one
notmuch_message_t for each message instead of 2 to 3.
notmuch_query_search_threads eagerly fetches the docids of all
messages matching the user query instead of lazily constructing
message objects and fetching thread ID's from term lists.
_notmuch_thread_create takes a seed docid and the set of all matched
docids and uses a single Xapian query to expand this docid to its
containing thread, using the matched docid set to determine which
messages in the thread match the user query instead of using a second
Xapian query.
This reduces the amount of time required to load my inbox from 4.523
seconds to 3.025 seconds (1.5X faster).
Some people use notmuch with non-maildir files, (for example, email
messages in MH format, or else cool things like using sluk[*] to suck
down feeds into a format that notmuch can index).
To better support uses like that, don't do any renaming for files that
are not in a directory named either "new" or "cur".
[*] https://github.com/krl/sluk/
This augments the existing notmuch_message_get_filename by allowing
the caller access to all filenames in the case of multiple files for a
single message.
To support this, we split the iterator (notmuch_filenames_t) away from
the list storage (notmuch_filename_list_t) where previously these were
a single object (notmuch_filenames_t). Then, whenever the user asks
for a file or filename, the message object lazily creates a complete
notmuch_filename_list_t and then:
For notmuch_message_get_filename, returns the first filename
in the list.
For notmuch_message_get_filenames, creates and returns a new
iterator for the filename list.
The new implementation is simply a talloc-based list of strings. The
former support (a list of database terms with a common prefix) is
implemented by simply pre-iterating over the terms and populating the
list. This should provide no performance disadvantage as callers of
thigns like notmuch_directory_get_child_files are very likely to
always iterate over all filenames anyway.
This new implementation of notmuch_filenames_t is in preparation for
adding API to query all of the filenames for a single message.
Instead of having an API for setting a library-wide flag for
synchronization (notmuch_database_set_maildir_sync) we instead
implement maildir synchronization with two new library functions:
notmuch_message_maildir_flags_to_tags
and notmuch_message_tags_to_maildir_flags
These functions are nicely documented here, (though the implementation
does not quite match the documentation yet---as plainly evidenced by
the current results of the test suite).
This patch allows bi-directional synchronization between maildir
flags and certain tags. The flag-to-tag mapping is defined by flag2tag
array.
The synchronization works this way:
1) Whenever notmuch new is executed, the following happens:
o New messages are tagged with configured new_tags.
o For new or renamed messages with maildir info present in the file
name, the tags defined in flag2tag are either added or removed
depending on the flags from the file name.
2) Whenever notmuch tag (or notmuch restore) is executed, a new set of
flags based on the tags is constructed for every message and a new
file name is prepared based on the old file name but with the new
flags. If the flags differs and the old message was in 'new'
directory then this is replaced with 'cur' in the new file name. If
the new and old file names differ, the file is renamed and notmuch
database is updated accordingly.
The rename happens before the database is updated. In case of crash
between rename and database update, the next run of notmuch new
brings the database in sync with the mail store again.
This prevents any of the private functions from being leaked out
through the library interface (at least when compiling with a
recent-enough gcc to support the visibility pragma).
Scott Henson reported an internal error that occurred when he tried to
add a message that referenced another message with a message ID well
over 300 characters in length. The bug here was running into a Xapian
limit for the length of metadata key names, (which is even more
restrictive than the Xapian limit for the length of terms).
We fix this by noticing long message ID values and instead using a
message ID of the form "notmuch-sha1-<sha1_sum_of_message_id>". That
is, we use SHA1 to generate a compressed, (but still unique), version
of the message ID.
We add support to the test suite to exercise this fix. The tests add a
message referencing the long message ID, then add the message with the
long message ID, then finally add another message referencing the long
ID. Each of these tests exercise different code paths where the
special handling is implemented.
A final test ensures that all three messages are stitched together
into a single thread---guaranteeing that the three code paths all act
consistently.
Previously we were using Xapian's add_document to allocate document ID
values for notmuch_message_t objects. This had the drawback of adding
a partially constructed mail document to the database. If notmuch was
subsequently interrupted before fully populating this document, then
later runs would be quite confused when seeing the partial documents.
There are reports from the wild of people hitting internal errors of
the form "Message ... has no thread ID" for example, (which is
currently an unrecoverable error).
We fix this by manually allocating document IDs without adding
documents. With this change, we never call Xapian's add_document
method, but only replace_document with either the current document ID
of a message or a new one that we have allocated.
With this patch the Received: header becomes special in the way
we treat headers - this is the only header for which we concatenate
all the instances we find (instead of just returning the first one).
This will be used in the From guessing code for replies as we need to
be able to walk ALL of the Received: headers in a message to have a
good chance to guess which mailbox this email was delivered to.
Signed-off-by: Dirk Hohndel <hohndel@infradead.org>
message->authors contains the author's name (as we want to print it)
get / set methods are declared in notmuch-private.h
Signed-off-by: Dirk Hohndel <hohndel@infradead.org>
At the moment all threads are named based on the name of the first message
in the thread. However, this can cause problems if people either start
new threads by replying-all (as unfortunately, many out there do) or
change the subject of their mails to reflect a shift in a thread on a
list.
This patch names threads based on (a) matches for the query, and (b) the
search order. If the search order is oldest-first (as in the default
inbox) it chooses the oldest matching message as the subject. If the
search order is newest-first it chooses the newest one.
Reply prefixes ("Re: ", "Aw: ", "Sv: ", "Vs: ") are ignored
(case-insensitively) so a Re: won't change the subject.
Note that this adds a "sort" argument to _notmuch_thread_create and
_thread_add_matched_message, so that when constructing the thread we can
be aware of the sort order.
Signed-off-by: Jesse Rosenthal <jrosenthal@jhu.edu>
We rename 'has_more' to 'valid' so that it can function whether
iterating in a forward or reverse direction. We also rename
'advance' to 'move_to_next' to setup parallel naming with
the proposed functions 'move_to_first', 'move_to_last', and
'move_to_previous'.
The first phase copies data from the old format to the new format
without deleting anything. This allows an old notmuch to still use the
database if the upgrade process gets interrupted. The second phase
performs the deletion (after updating the database version number). If
the second phase is interrupted, there will be some unused data in the
database, but it shouldn't cause any actual harm.
The recent support for renames in the database is our first time
(since notmuch has had more than a single user) that we have a
database format change. To support smooth upgrades we now encode a
database format version number in the Xapian metadata.
Going forward notmuch will emit a warning if used to read from a
database with a newer version than it natively supports, and will
refuse to write to a database with a newer version.
The library also provides functions to query the database format
version:
notmuch_database_get_version
to ask if notmuch wants a newer version than that:
notmuch_database_needs_upgrade
and a function to actually perform that upgrade:
notmuch_database_upgrade
Previously we had NOTMUCH_DATABASE_MODE_READ_ONLY but
NOTMUCH_STATUS_READONLY_DATABASE which was ugly and confusing. Rename
the latter to NOTMUCH_STATUS_READ_ONLY_DATABASE for consistency.
Previously, many checks were deep in the library just before a cast
operation. These have now been replaced with internal errors and new
checks have instead been added at the beginning of all top-levelentry
points requiring a read-write database.
The new checks now also use a single function for checking and
printing the error message. This will give us a convenient location to
extend the check, (such as based on database version as well).
This new directory ojbect provides all the infrastructure needed to
detect when files or directories are deleted or renamed. There's still
code needed on top of this (within "notmuch new") to actually do that
detection.
The code to map a filename to a direntry is something that we're going
to want in a future _remove_message function, so put it in a new
function _notmuch_database_filename_to_direntry .
The library interface is unchanged so far, (still just
notmuch_database_add_message), but internally, the old
_set_filename function is now _add_filename instead.
Instead of storing the complete message filename in the data portion
of a mail document we now store a 'direntry' term that contains the
document ID of a directory document and also the basename of the
message filename within that directory. This will allow us to easily
store multple filenames for a single message, and will also allow us
to find mail documents for files that previously existed in a
directory but that have since been deleted.
Some pending commits want the _split_path functionality separate from
mapping a directory to a document ID. The split_path function now
returns the basename as well as the directory name.
We'll soon have mail documents referring to their parent directory's
directory documents, so we'll need access to _find_parent_id in files
such as message.cc.
We'll soon be having multiple entry points that accept a filename
path, so we want common code for getting a relative path from a
potentially absolute path.
And fix the initialization such that the private enum will always have
distinct values from the public enum even if we similarly miss the
addition of a new public value in the future.
The function _notmuch_message_add_thread_id has been removed
from the private interface of notmuch. There's no reason for
one to keep a declaration of its prototype in the code base.
Also, lets update a commentary that referenced that function
and escaped from previous scrutiny.
Signed-off-by: Fernando Carrijo <fcarrijo@yahoo.com.br>
Xapian provides an interator-based interface to all search results.
So it was natural to make notmuch_messages_t be iterator-based as
well. Which we did originally.
But we ran into a problem when we added two APIs, (_get_replies and
_get_toplevel_messages), that want to return a messages iterator
that's *not* based on a Xapian search result. My original compromise
was to use notmuch_message_list_t as the basis for all returned
messages iterators in the public interface.
This had the problem of introducing extra latency at the beginning
of a search for messages, (the call would block while iterating over
all results from Xapian, converting to a message list).
In this commit, we remove that initial conversion and instead provide
two alternate implementations of notmuch_messages_t (one on top of a
Xapian iterator and one on top of a message list).
With this change, I tested a "notmuch search" returning *many* results
as previously taking about 7 seconds before results started appearing,
and now taking only 2 seconds.
We only rarely need to actually open the database for writing, but we
always create a Xapian::WritableDatabase. This has the effect of
preventing searches and like whilst updating the index.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Carl Worth <cworth@cworth.org>
We had exposed this to the internal implementation for a short time,
(only while we had the silly code fetching In-Reply-To values from
message files instead of from the database). Make this private again
as it should be.
The message file header parsing code parses only enough of the file to
find the desired header fields, then it leaves the file open until the
next header parsing call or when the message is no longer in use. If a
large number of messages end up being active, this will quickly run
out of file descriptors.
Here, we add support to explicitly close the message file within a
message, (_notmuch_message_close) and call that from thread
construction code.
Signed-off-by: Keith Packard <keithp@keithp.com>
Edited-by: Carl Worth <cworth@cworth.org>:
Many portions of Keith's original patch have since been solved other
ways, (such as the code that changed the handling of the In-Reply-To
header). So the final version is clean enough that I think even Keith
would be happy to have his name on it.
This function has only one caller, and that one caller was passing the
same value for both talloc_owner and the notmuch database. Dropping
the redundant argument simplifies the documentation of this function
considerably.
We now properly analyze the in-reply-to headers to create a proper
tree representing the actual thread and present the messages in this
correct thread order. Also, there's a new "depth:" value added to the
"message{" header so that clients can format the thread as desired,
(such as by indenting replies).
The existing notmuch_message_get_header is *almost* good enough for
this, except that we also need to remove the '<' and '>'
delimiters. We'll probably want to implement this function with
database storage in the future rather than loading the email message.
This prototype has been sitting around for a while with no function
implementing it. I wonder if there's a compiler warning I could turn
on to catch these things.
Previously, the notmuch_messages_t object was a linked list built on
top of a linked-list node with the odd name of notmuch_message_list_t.
Now, we've got much more sane naming with notmuch_message_list_t being
a list built on a linked-list node named notmuch_message_node_t. And
now the public notmuch_messages_t object is a separate iterator based
on notmuch_message_node_t. This means the interfaces for the new
notmuch_message_list_t object are now made available to the library
internals.
The new object is simply a linked-list of notmuch_message_t objects,
(unlike the old object which contained a couple of Xapian iterators).
This works now by the query code immediately iterator over all results
and creating notmuch_message_t objects for them, (rather than waiting
to create the objects until the notmuch_messages_get call as we did
earlier).
The point of this change is to allow other instances of lists of
messages, (such as in notmuch_thread_t), that are not directly related
to Xapian search results.
Note that we don't print the number of *unread* messages, but instead
the number of messages that matched the search terms. This is in
keeping with our philosophy that the inbox is nothing more than a
search view. If we search for messages with an inbox tag, then that's
what we'll get a count of. (And if somebody does want to see unread
counts, then they can search for the "unread" tag.)
Getting the number of matched messages is really nice when doing
historical searches. For example in a search like:
notmuch search tag:sent
(where the "sent" tag has been applied to all messages originating
from the user's email address)---here it's really nice to be able to
see a thread where the user just mentioned one point [1/13] vs. really
getting involved in the discussion [10/29].
We've now expanded the notmuch_thread_create function to fire off a
secondary database query to find all the messages that belong to this
particular thread. This allows us to now have the complete authors'
list for the thread, and will also make it trivial to print accurate
message counts for threads in the future.