When indexing the cleartext of an encrypted message, record any
protected subject in the database, which should make it findable and
visible in search.
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
This should not change the indexing process yet as nothing calls
_notmuch_message_gen_terms with a user prefix name. On the other hand,
it should not break anything either.
_notmuch_database_prefix does a linear walk of the list of (built-in)
prefixes, followed by a logarithmic time search of the list of user
prefixes. The latter is probably not really noticable.
Previously this functioned scanned every term attached to a given
Xapian document. It turns out we know how to read only the terms we
need to preserve (and we might have already done so). This commit
replaces many calls to Xapian::Document::remove_term with one call to
::clear_terms, and a (typically much smaller) number of calls to
::add_term. Roughly speaking this is based on the assumption that most
messages have more text than they have tags.
According to the performance test suite, this yields a roughly 40%
speedup on "notmuch reindex '*'"
The new `body:` field (in Xapian terms) or prefix (in slightly
sloppier notmuch) terms allows matching terms that occur only in the
body.
Unprefixed query terms should continue to match anywhere (header or
body) in the message.
This follows a suggestion of Olly Betts to use the facility (since
Xapian 1.0.4) to add the same field with multiple prefixes. The double
indexing of previous versions is thus replaced with a query time
expension of unprefixed query terms to the various prefixed
equivalent.
Reindexing will be needed for 'body:' searches to work correctly;
otherwise they will also match messages where the term occur in
headers (demonstrated by the new tests in T530-upgrade.sh)
For non-root messages, this should not should anything currently, as
the messages are already added in date order. In the future we will
add some non-root messages in a second pass out of order and the
sorting will be useful. It does fix the order of multiple
root-messages (although it is overkill for that).
This is technically an API change, but it is not an ABI change, and
it's merely a statement that limits what the library can do.
This is in parallel to notmuch_query_get_database(), which also takes
a const pointer.
We've had _notmuch_message_database() internally for a while, and it's
useful. It turns out to be useful on the other side of the library
interface as well (i'll use it later in this series for "notmuch
show"), so we expose it publicly now.
The observation is that we are only using the messages to get there
thread_id, which is kindof a pessimal access pattern for the current
notmuch_message_get_thread_id
There are some situations where the user wants to get rid of the
cleartext index of a message. For example, if they're indexing
encrypted messages normally, but suddenly they run across a message
that they really don't want any trace of in their index.
In that case, the natural thing to do is:
notmuch reindex --decrypt=false id:whatever@example.biz
But of course, clearing the cleartext index without clearing the
stashed session key is just silly. So we do the expected thing and
also destroy any stashed session keys while we're destroying the index
of the cleartext.
Note that stashed session keys are stored in the xapian database, but
xapian does not currently allow safe deletion (see
https://trac.xapian.org/ticket/742).
As a workaround, after removing session keys and cleartext material
from the database, the user probably should do something like "notmuch
compact" to try to purge whatever recoverable data is left in the
xapian freelist. This problem really needs to be addressed within
xapian, though, if we want it fixed right.
If we see index options that ask us to decrypt when indexing a
message, and we encounter an encrypted part, we'll try to descend into
it.
If we can decrypt, we add the property index.decryption=success.
If we can't decrypt (or recognize the encrypted type of mail), we add
the property index.decryption=failure.
Note that a single message may have both values of the
"index.decryption" property: "success" and "failure". For example,
consider a message that includes multiple layers of encryption. If we
manage to decrypt the outer layer ("index.decryption=success"), but
fail on the inner layer ("index.decryption=failure").
Because of the property name, this will be automatically cleared (and
possibly re-set) during re-indexing. This means it will subsequently
correspond to the actual semantics of the stored index.
This allows us to create new properties that will be automatically set
during indexing, and cleared during re-indexing, just by choice of
property name.
C99 stdbool turned 18 this year. There really is no reason to use our
own, except in the library interface for backward
compatibility. Convert the lib internally to stdbool.
I considered a higher level interface where the caller passes a tag
name rather than a flag character, but the role of the "unread" tag is
particularly confusing with such an interface.
There are at least three places in notmuch that can trigger an
indexing action:
* notmuch new
* notmuch insert
* notmuch reindex
I have plans to add some indexing options (e.g. indexing the cleartext
of encrypted parts, external filters, automated property injection)
that should properly be available in all places where indexing
happens.
I also want those indexing options to be exposed by (and constrained
by) the libnotmuch C API.
This isn't yet an API break because we've never made a release with
notmuch_param_t.
These indexing options are relevant in the listed places (and in the
libnotmuch analogues), but they aren't relevant in the other kinds of
functionality that notmuch offers (e.g. dump/restore, tagging, search,
show, reply).
So i think a generic "param" object isn't well-suited for this case.
In particular:
* a param object sounds like it could contain parameters for some
other (non-indexing) operation. This sounds confusing -- why would
i pass non-indexing parameters to a function that only does
indexing?
* bremner suggests online a generic param object would actually be
passed as a list of param objects, argv-style. In this case (at
least in the obvious argv implementation), the params might be some
sort of generic string. This introduces a problem where the API of
the library doesn't grow as new options are added, which means that
when code outside the library tries to use a feature, it first has
to test for it, and have code to handle it not being available.
The indexopts approach proposed here instead makes it clear at
compile time and at dynamic link time that there is an explicit
dependency on that feature, which allows automated tools to keep
track of what's needed and keeps the actual code simple.
My proposal adds the notmuch_indexopts_t as an opaque struct, so that
we can extend the list of options without causing ABI breakage.
The cost of this proposal appears to be that the "boilerplate" API
increases a little bit, with a generic constructor and destructor
function for the indexopts struct.
More patches will follow that make use of this indexopts approach.
This new function asks the database to reindex a given message.
The parameter `indexopts` is currently ignored, but is intended to
provide an extensible API to support e.g. changing the encryption or
filtering status (e.g. whether and how certain non-plaintext parts are
indexed).
This operation is relatively inexpensive, as the needed metadata is
already computed by our lazy metadata fetching. The goal is to support
better UI for messages with multipile files.
Commit d5523ead90 ("Mark some structures in the library interface
with visibility=default attribute.") fixed some mixed visibility
issues with structs. With the symbol default visibility reversed, this
is no longer a problem.
The index(3) function has been deprecated in POSIX since 2001 and
removed in 2008, and most code in notmuch already calls strchr(3).
This fixes a compilation error on Android whose libc does not have
index(3).
This function was deprecated in notmuch 0.21. We re-use the name for
a status returning version, and deprecate the _st name. One or two
remaining uses of the (removed) non-status returning version fixed at
the same time
The object where pointer to `data` was received was deleted before
it was used in _notmuch_string_list_append().
Relevant Coverity messages follow:
3: extract
Assigning: data = std::__cxx11::string(message->doc.()).c_str(),
which extracts wrapped state from temporary of type std::__cxx11::string.
4: dtor_free
The internal representation of temporary of type std::__cxx11::string
is freed by its destructor.
5: use after free:
Wrapper object use after free (WRAPPER_ESCAPE)
Using internal representation of destroyed object local data.
For reasons not completely understood at this time, gmime (as of
2.6.22) is returning a date before 1900 on bad date input. Since this
confuses some other software, we clamp such dates to 0,
i.e. 1970-01-01.
The retries are hardcoded to a small number, and error handling aborts
than propagating errors from notmuch_database_reopen. These are both
somewhat justified by the assumption that most things that can go
wrong in Xapian::Database::reopen are rare and fatal. Here's the brief
discussion with Xapian upstream:
24-02-2017 08:12:57 < bremner> any intuition about how likely
Xapian::Database::reopen is to fail? I'm catching a
DatabaseModifiedError somewhere where handling any further errors is
tricky, and wondering about treating a failed reopen as as "the
impossible happened, stopping"
24-02-2017 16:22:34 < olly> bremner: there should not be much scope for
failure - stuff like out of memory or disk errors, which are probably a
good enough excuse to stop
Many of the external links found in the notmuch source can be resolved
using https instead of http. This changeset addresses as many as i
could find, without touching the e-mail corpus or expected outputs
found in tests.
Cleaned the following whitespace in lib/* files:
lib/index.cc: 1 line: trailing whitespace
lib/database.cc 5 lines: 8 spaces at the beginning of line
lib/notmuch-private.h: 4 lines: 8 spaces at the beginning of line
lib/message.cc: 1 line: trailing whitespace
lib/sha1.c: 1 line: empty lines at the end of file
lib/query.cc: 2 lines: 8 spaces at the beginning of line
lib/gen-version-script.sh: 1 line: trailing whitespace
To fully complete the ghost-on-removal-when-shared-thread-exists
proposal, we need to clear all ghost messages when the last active
message is removed from a thread.
Amended by db: Remove the last test of T530, as it no longer makes sense
if we are garbage collecting ghost messages.
There is no need to add a ghost message upon deletion if there are no
other active messages in the thread.
Also, if the message being deleted was a ghost already, we can just go
ahead and delete it.
implement ghost-on-removal, the solution to T590-thread-breakage.sh
that just adds a ghost message after removing each message.
It leaks information about whether we've ever seen a given message id,
but it's a fairly simple implementation.
Note that _resolve_message_id_to_thread_id already introduces new
message_ids to the database, so i think just searching for a given
message ID may introduce the same metadata leakage.
This adds a new document value that stores the revision of the last
modification to message metadata, where the revision number increases
monotonically with each database commit.
An alternative would be to store the wall-clock time of the last
modification of each message. In principle this is simpler and has
the advantage that any process can determine the current timestamp
without support from libnotmuch. However, even assuming a computer's
clock never goes backward and ignoring clock skew in networked
environments, this has a fatal flaw. Xapian uses (optimistic)
snapshot isolation, which means reads can be concurrent with writes.
Given this, consider the following time line with a write and two read
transactions:
write |-X-A--------------|
read 1 |---B---|
read 2 |---|
The write transaction modifies message X and records the wall-clock
time of the modification at A. The writer hangs around for a while
and later commits its change. Read 1 is concurrent with the write, so
it doesn't see the change to X. It does some query and records the
wall-clock time of its results at B. Transaction read 2 later starts
after the write commits and queries for changes since wall-clock time
B (say the reads are performing an incremental backup). Even though
read 1 could not see the change to X, read 2 is told (correctly) that
X has not changed since B, the time of the last read. In fact, X
changed before wall-clock time A, but the change was not visible until
*after* wall-clock time B, so read 2 misses the change to X.
This is tricky to solve in full-blown snapshot isolation, but because
Xapian serializes writes, we can use a simple, monotonically
increasing database revision number. Furthermore, maintaining this
revision number requires no more IO than a wall-clock time solution
because Xapian already maintains statistics on the upper (and lower)
bound of each value stream.
Previously, we updated the database copy of a message on every call to
_notmuch_message_sync, even if nothing had changed. In particular,
this always happens on a thaw, so a freeze/thaw pair with no
modifications between still caused a database update.
We only modify message documents in a handful of places, so keep track
of whether the document has been modified and only sync it when
necessary. This will be particularly important when we add message
revision tracking.
You may wonder why _notmuch_message_file_open_ctx has two parameters.
This is because we need sometime to use a ctx which is a
notmuch_message_t. While we could get the database from this, there is
no easy way in C to tell type we are getting.
This is not supposed to change any functionality from an end user
point of view. Note that it will eliminate some output to stderr. The
query debugging output is left as is; it doesn't really fit with the
current primitive logging model. The remaining "bad" fprintf will need
an internal API change.
Tamas Szakaly points out [1] that the bug fixed in 51b073c still
exists in at least one place. This change follows the suggestion of
[2] and creates a block scope temporary std::string to avoid the rules
of iterators temporaries.
[1]: id:20141226113755.GA64154@pamparam
[2]: id:20141226230655.GA41992@pamparam
This updates the message abstraction to support ghost messages: it
adds a message flag that distinguishes regular messages from ghost
messages, and an internal function for initializing a newly created
(blank) message as a ghost message.