This is intended to fix the slow behaviour of "notmuch new" (and possibly
"notmuch reindex") when large numbers of files are deleted.
The underlying issue [1] seems to be the Xapian glass backend spending
a large amount of time in db.has_positions when running queries with
large-ish amounts of unflushed changes.
This commit removes two uses of Xapian queries [2], and replaces them with
an approximation of what Xapian would do after optimizing the
queries. This avoids the calls to has_positions (which are in any case
un-needed because we are only using boolean terms here).
[1] Thanks to "andres" on IRC for narrowing down the performance
bottleneck.
[2] Thanks to Olly Betts of Xapian fame for talking me a through a fix
that does not require people to update Xapian.
Introduce a new configuration value for the mail root, and use it to
locate mail messages in preference to the database.path (which
previously implied the mail messages were also in this location.
Initially only a subset of the CLI is tested in a split
configuration. Further changes will be needed for the remainder of the
CLI to work in split configurations.
Include the (currently unused) mode argument which will specify which
mode to re-open the database in. Functionality and docs to be
finalized in a followup commit.
static_cast is a bit tricky to understand and error prone, so add a
second pointer to (potentially the same) Xapian database object that
we know has the right subclass.
The original generic handler had an extra '%s' in the format
string. Update tests that failed to catch this because the template to
print status strings checked 'stat', which was not set.
It's not very nice to return FALSE for an error, so provide
notmuch_message_get_flag_st as a migration path.
Bump LIBNOTMUCH_MINOR_VERSION because the API is extended.
Currently I don't know of a good way of testing this, but at least in
principle a Xapian exception in _notmuch_message_{add,remove}_term
would cause an abort in the library.
This should not change functionality, but does slightly reduce code
duplication. Perhaps more importantly it allows consistent changes to
all of the similar exception handling in message.cc.
strncmp looks for a prefix that matches, which is very much not what
we want here. This fixes the bug reported by Franz Fellner in
id:1588595993-ner-8.651@TPL520
When indexing the cleartext of an encrypted message, record any
protected subject in the database, which should make it findable and
visible in search.
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
This should not change the indexing process yet as nothing calls
_notmuch_message_gen_terms with a user prefix name. On the other hand,
it should not break anything either.
_notmuch_database_prefix does a linear walk of the list of (built-in)
prefixes, followed by a logarithmic time search of the list of user
prefixes. The latter is probably not really noticable.
Previously this functioned scanned every term attached to a given
Xapian document. It turns out we know how to read only the terms we
need to preserve (and we might have already done so). This commit
replaces many calls to Xapian::Document::remove_term with one call to
::clear_terms, and a (typically much smaller) number of calls to
::add_term. Roughly speaking this is based on the assumption that most
messages have more text than they have tags.
According to the performance test suite, this yields a roughly 40%
speedup on "notmuch reindex '*'"
The new `body:` field (in Xapian terms) or prefix (in slightly
sloppier notmuch) terms allows matching terms that occur only in the
body.
Unprefixed query terms should continue to match anywhere (header or
body) in the message.
This follows a suggestion of Olly Betts to use the facility (since
Xapian 1.0.4) to add the same field with multiple prefixes. The double
indexing of previous versions is thus replaced with a query time
expension of unprefixed query terms to the various prefixed
equivalent.
Reindexing will be needed for 'body:' searches to work correctly;
otherwise they will also match messages where the term occur in
headers (demonstrated by the new tests in T530-upgrade.sh)
For non-root messages, this should not should anything currently, as
the messages are already added in date order. In the future we will
add some non-root messages in a second pass out of order and the
sorting will be useful. It does fix the order of multiple
root-messages (although it is overkill for that).
This is technically an API change, but it is not an ABI change, and
it's merely a statement that limits what the library can do.
This is in parallel to notmuch_query_get_database(), which also takes
a const pointer.
We've had _notmuch_message_database() internally for a while, and it's
useful. It turns out to be useful on the other side of the library
interface as well (i'll use it later in this series for "notmuch
show"), so we expose it publicly now.
The observation is that we are only using the messages to get there
thread_id, which is kindof a pessimal access pattern for the current
notmuch_message_get_thread_id
There are some situations where the user wants to get rid of the
cleartext index of a message. For example, if they're indexing
encrypted messages normally, but suddenly they run across a message
that they really don't want any trace of in their index.
In that case, the natural thing to do is:
notmuch reindex --decrypt=false id:whatever@example.biz
But of course, clearing the cleartext index without clearing the
stashed session key is just silly. So we do the expected thing and
also destroy any stashed session keys while we're destroying the index
of the cleartext.
Note that stashed session keys are stored in the xapian database, but
xapian does not currently allow safe deletion (see
https://trac.xapian.org/ticket/742).
As a workaround, after removing session keys and cleartext material
from the database, the user probably should do something like "notmuch
compact" to try to purge whatever recoverable data is left in the
xapian freelist. This problem really needs to be addressed within
xapian, though, if we want it fixed right.
If we see index options that ask us to decrypt when indexing a
message, and we encounter an encrypted part, we'll try to descend into
it.
If we can decrypt, we add the property index.decryption=success.
If we can't decrypt (or recognize the encrypted type of mail), we add
the property index.decryption=failure.
Note that a single message may have both values of the
"index.decryption" property: "success" and "failure". For example,
consider a message that includes multiple layers of encryption. If we
manage to decrypt the outer layer ("index.decryption=success"), but
fail on the inner layer ("index.decryption=failure").
Because of the property name, this will be automatically cleared (and
possibly re-set) during re-indexing. This means it will subsequently
correspond to the actual semantics of the stored index.
This allows us to create new properties that will be automatically set
during indexing, and cleared during re-indexing, just by choice of
property name.
C99 stdbool turned 18 this year. There really is no reason to use our
own, except in the library interface for backward
compatibility. Convert the lib internally to stdbool.
I considered a higher level interface where the caller passes a tag
name rather than a flag character, but the role of the "unread" tag is
particularly confusing with such an interface.
There are at least three places in notmuch that can trigger an
indexing action:
* notmuch new
* notmuch insert
* notmuch reindex
I have plans to add some indexing options (e.g. indexing the cleartext
of encrypted parts, external filters, automated property injection)
that should properly be available in all places where indexing
happens.
I also want those indexing options to be exposed by (and constrained
by) the libnotmuch C API.
This isn't yet an API break because we've never made a release with
notmuch_param_t.
These indexing options are relevant in the listed places (and in the
libnotmuch analogues), but they aren't relevant in the other kinds of
functionality that notmuch offers (e.g. dump/restore, tagging, search,
show, reply).
So i think a generic "param" object isn't well-suited for this case.
In particular:
* a param object sounds like it could contain parameters for some
other (non-indexing) operation. This sounds confusing -- why would
i pass non-indexing parameters to a function that only does
indexing?
* bremner suggests online a generic param object would actually be
passed as a list of param objects, argv-style. In this case (at
least in the obvious argv implementation), the params might be some
sort of generic string. This introduces a problem where the API of
the library doesn't grow as new options are added, which means that
when code outside the library tries to use a feature, it first has
to test for it, and have code to handle it not being available.
The indexopts approach proposed here instead makes it clear at
compile time and at dynamic link time that there is an explicit
dependency on that feature, which allows automated tools to keep
track of what's needed and keeps the actual code simple.
My proposal adds the notmuch_indexopts_t as an opaque struct, so that
we can extend the list of options without causing ABI breakage.
The cost of this proposal appears to be that the "boilerplate" API
increases a little bit, with a generic constructor and destructor
function for the indexopts struct.
More patches will follow that make use of this indexopts approach.
This new function asks the database to reindex a given message.
The parameter `indexopts` is currently ignored, but is intended to
provide an extensible API to support e.g. changing the encryption or
filtering status (e.g. whether and how certain non-plaintext parts are
indexed).
This operation is relatively inexpensive, as the needed metadata is
already computed by our lazy metadata fetching. The goal is to support
better UI for messages with multipile files.