As we prepare to handle S/MIME-encrypted PKCS#7 EnvelopedData (which
is not multipart), we don't want to be limited to passing only
GMimeMultipartEncrypted MIME parts to _notmuch_crypto_decrypt.
There is no functional change here, just a matter of adjusting how we
pass arguments internally.
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
When we are indexing, we should treat SignedData parts the same way
that we treat a multipart object, indexing the wrapped part as a
distinct MIME object.
Unfortunately, this means doing some sort of cryptographic
verification whose results we throw away, because GMime doesn't offer
us any way to unwrap without doing signature verification.
I've opened https://github.com/jstedfast/gmime/issues/67 to request
the capability from GMime but for now, we'll just accept the
additional performance hit.
As we do this indexing, we also apply the "signed" tag, by analogy
with how we handle multipart/signed messages. These days, that kind
of change should probably be done with a property instead, but that's
a different set of changes. This one is just for consistency.
Note that we are currently *only* handling signedData parts, which are
basically clearsigned messages. PKCS#7 parts can also be
envelopedData and authEnvelopedData (which are effectively encryption
layers), and compressedData (which afaict isn't implemented anywhere,
i've never encountered it). We're laying the groundwork for indexing
these other S/MIME types here, but we're only dealing with signedData
for now.
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
When encountering a message that has been mangled in the "mixed up"
way by an intermediate MTA, notmuch should instead repair it and index
the repaired form.
When it does this, it also associates the index.repaired=mixedup
property with the message. If a problem is found with this repair
process, or an improved repair process is proposed later, this should
make it easy for people to reindex the relevant message. The property
will also hopefully make it easier to diagnose this particular problem
in the future.
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
When we notice a legacy-display part during indexing, it makes more
sense to avoid indexing it as part of the message body.
Given that the protected subject will already be indexed, there is no
need to index this part at all, so we skip over it.
If this happens during indexing, we set a property on the message:
index.repaired=skip-protected-headers-legacy-display
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
Our _notmuch_message_crypto_potential_payload implementation could
only return a failure if bad arguments were passed to it. It is an
internal function, so if that happens it's an entirely internal bug
for notmuch.
It will be more useful for this function to return whether or not the
part is in fact a cryptographic payload, so we dispense with the
status return.
If some future change suggests adding a status return back, there are
only a handful of call sites, and no pressure to retain a stable API,
so it could be changed easily. But for now, go with the simpler
function.
We will use this return value in future patches, to make different
decisions based on whether a part is the cryptographic payload or not.
But for now, we just leave the places where it gets invoked marked
with (void) to show that the result is ignored.
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
When indexing the cleartext of an encrypted message, record any
protected subject in the database, which should make it findable and
visible in search.
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
This means dropping GMimeCryptoContext and notmuch_config arguments.
All the argument changes are to internal functions, so this is not an
API or ABI break.
We also get to drop the #define for g_mime_3_unused.
signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
In _index_mime_part, we don't need to extract the content-type from
the part until just before we use it, so we also defer it lazily.
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
Use explicit labels for GTypeInfo member initializers, rather than
relying on comments and ordering. This is both easier to read, and
harder to screw up. This also makes it clear that we're mis-casting
GObject class initializers for gcc.
Without this patch, g++ 8.2.0-7 produces this warning:
CXX -g -O2 lib/index.o
lib/index.cc: In function ‘GMimeFilter* notmuch_filter_discard_non_term_new(GMimeContentType*)’:
lib/index.cc:252:23: warning: cast between incompatible function types from ‘void (*)(NotmuchFilterDiscardNonTermClass*)’ {aka ‘void (*)(_NotmuchFilterDiscardNonTermClass*)’} to ‘GClassInitFunc’ {aka ‘void (*)(void*, void*)’} [-Wcast-function-type]
(GClassInitFunc) notmuch_filter_discard_non_term_class_init,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The definition of GClassInitFunc in
/usr/include/glib-2.0/gobject/gtype.h suggests that this function will
always be called with the class_data member of the GTypeInfo. We set
that value to NULL in both GObject definitions in notmuch. So we mark
it as explicitly unused.
There is no functional change here, just code cleanup.
We've had _notmuch_message_database() internally for a while, and it's
useful. It turns out to be useful on the other side of the library
interface as well (i'll use it later in this series for "notmuch
show"), so we expose it publicly now.
If you're going to store the cleartext index of an encrypted message,
in most situations you might just as well store the session key.
Doing this storage has efficiency and recoverability advantages.
Combined with a schedule of regular OpenPGP subkey rotation and
destruction, this can also offer security benefits, like "deletable
e-mail", which is the store-and-forward analog to "forward secrecy".
But wait, i hear you saying, i have a special need to store cleartext
indexes but it's really bad for me to store session keys! Maybe
(let's imagine) i get lots of e-mails with incriminating photos
attached, and i want to be able to search for them by the text in the
e-mail, but i don't want someone with access to the index to be
actually able to see the photos themselves.
Fret not, the next patch in this series will support your wacky
uncommon use case.
In our consolidation of _notmuch_crypto_decrypt, the callers lost
track a little bit of whether any actual decryption was attempted.
Now that we have the more-subtle "auto" policy, it's possible that
_notmuch_crypto_decrypt could be called without having any actual
decryption take place.
This change lets the callers be a little bit smarter about whether or
not any decryption was actually attempted.
This new automatic decryption policy should make it possible to
decrypt messages that we have stashed session keys for, without
incurring a call to the user's asymmetric keys.
Future patches in this series will introduce new policies; this merely
readies the way for them.
We also convert --try-decrypt to a keyword argument instead of a boolean.
When doing any decryption, if the notmuch database knows of any
session keys associated with the message in question, try them before
defaulting to using default symmetric crypto.
This changeset does the primary work in _notmuch_crypto_decrypt, which
grows some new parameters to handle it.
The primary advantage this patch offers is a significant speedup when
rendering large encrypted threads ("notmuch show") if session keys
happen to be cached.
Additionally, it permits message composition without access to
asymmetric secret keys ("notmuch reply"); and it permits recovering a
cleartext index when reindexing after a "notmuch restore" for those
messages that already have a session key stored.
Note that we may try multiple decryptions here (e.g. if there are
multiple session keys in the database), but we will ignore and throw
away all the GMime errors except for those that come from last
decryption attempt. Since we don't necessarily know at the time of
the decryption that this *is* the last decryption attempt, we'll ask
for the errors each time anyway.
This does nothing if no session keys are stashed in the database,
which is fine. Actually stashing session keys in the database will
come as a subsequent patch.
We will use this centralized function to consolidate the awkward
behavior around different gmime versions.
It's only invoked from two places: mime-node.c's
node_decrypt_and_verify() and lib/index.cc's
_index_encrypted_mime_part().
However, those two places have some markedly distinct logic, so the
interface for this _notmuch_crypto_decrypt function is going to get a
little bit clunky. It's worthwhile, though, for the sake of keeping
these #if directives reasonably well-contained.
If we see index options that ask us to decrypt when indexing a
message, and we encounter an encrypted part, we'll try to descend into
it.
If we can decrypt, we add the property index.decryption=success.
If we can't decrypt (or recognize the encrypted type of mail), we add
the property index.decryption=failure.
Note that a single message may have both values of the
"index.decryption" property: "success" and "failure". For example,
consider a message that includes multiple layers of encryption. If we
manage to decrypt the outer layer ("index.decryption=success"), but
fail on the inner layer ("index.decryption=failure").
Because of the property name, this will be automatically cleared (and
possibly re-set) during re-indexing. This means it will subsequently
correspond to the actual semantics of the stored index.
C99 stdbool turned 18 this year. There really is no reason to use our
own, except in the library interface for backward
compatibility. Convert the lib internally to stdbool.
This is a logical followup to "lib: index the content type of
signature parts", which will make it easier to record the message
structure of all messages.
It's useful (*) to be able to easily find messages with certain types
of signatures. Having the mimetype: prefix searches fail for some
content types is also genuinely surprising (*). Index the content type
of signature parts.
While at it, switch to the gmime convenience constants for content and
signature part indexes.
*) At least for developers of email software!
'g_object_newv' is deprecated, and prints annoying warnings. The
warnings suggest using 'g_object_new_with_properties', but that's only
available since glib 2.55 (i.e. a month ago as of this writing).
Since we don't actuall pass any properties, it seems we can just call
'g_object_new'.
We want to reuse the scanner definition with a different table. This
is mainly code movement, and making the state table part of the filter
struct/class.
Many of the external links found in the notmuch source can be resolved
using https instead of http. This changeset addresses as many as i
could find, without touching the e-mail corpus or expected outputs
found in tests.
Cleaned the following whitespace in lib/* files:
lib/index.cc: 1 line: trailing whitespace
lib/database.cc 5 lines: 8 spaces at the beginning of line
lib/notmuch-private.h: 4 lines: 8 spaces at the beginning of line
lib/message.cc: 1 line: trailing whitespace
lib/sha1.c: 1 line: empty lines at the end of file
lib/query.cc: 2 lines: 8 spaces at the beginning of line
lib/gen-version-script.sh: 1 line: trailing whitespace
Per RFC 2183, the values for Content-Disposition values are not
case-sensitive. While at it, use the gmime function for getting at the
disposition string instead of referencing the field directly.
This fixes "attachment" tagging and filename term generation for
attachments while indexing.
This is not supposed to change any functionality from an end user
point of view. Note that it will eliminate some output to stderr. The
query debugging output is left as is; it doesn't really fit with the
current primitive logging model. The remaining "bad" fprintf will need
an internal API change.
This adds the indexing support for the "mimetype:" term and removes
the broken test flag. The indexing is probablistic in Xapian terms,
which gives a better experience to end users. Standard content-types
of the form "foo/bar" are automatically interpreted as phrases in
Xapian due to the embedded slash.
Assume, separate messages with application/pdf and application/x-pdf
are indexed, then:
- mimetype:application/x-pdf will find only the application/x-pdf
- mimetype:application/pdf will find only the application/pdf
- mimetype:pdf will find both of the messages
Previously, we indexed the name and address parts of from/to headers
with two calls to _notmuch_message_gen_terms. In general, this
indicates that these parts are separate phrases. However, because of
an implementation quirk, the two calls to _notmuch_message_gen_terms
generated adjacent term positions for the prefixed terms, which
happens to be the right thing to do in this case, but the wrong thing
to do for all other calls. Furthermore, _notmuch_message_gen_terms
produced potentially overlapping term positions for the un-prefixed
copies of the terms, which is simply wrong.
This change indexes both the name and address in a single call to
_notmuch_message_gen_terms, indicating that they should be part of a
single phrase. This masks the problem with the un-prefixed terms
(fixing the two known-broken tests) and puts us in a position to fix
the unintentionally phrases generated by other calls to
_notmuch_message_gen_terms.
The notmuch library includes a full blown message header parser. Yet
the same message headers are parsed by gmime during indexing. Switch
to gmime parsing completely.
These are the main changes:
* Gmime stops header parsing at the first invalid header, and presumes
the message body starts from there. The current parser is quite
liberal in accepting broken headers. The change means we will be
much pickier about accepting invalid messages.
* The current parser converts tabs used in header folding to
spaces. Gmime preserve the tabs. Due to a broken python library used
in mailman, there are plenty of mailing lists that produce headers
with tabs in header folding, and we'll see plenty of tabs. (This
change has been mitigated in preparatory patches.)
* For pure header parsing, the current parser is likely faster than
gmime, which parses the whole message rather than just the
headers. Since we parse the message and its headers using gmime for
indexing anyway, this avoids and extra header parsing round when
adding new messages. In case of duplicate messages, we'll end up
parsing the full message although just headers would be
sufficient. All in all this should still speed up 'notmuch new'.
* Calls to notmuch_message_get_header() may be slightly slower than
previously for headers that are not indexed in the database, due to
parsing of the whole message. Within the notmuch code base, notmuch
reply is the only such user.
We've supported mbox files containing a single message for historical
reasons, but the support has been deprecated, with a warning message
while indexing, since Notmuch 0.15. Finally drop the support, and
consider all mbox files non-email.
As explained by Jeffrey Stedfast, the author of GMime, quoted in [1]:
> Passing the GMIME_ENABLE_RFC2047_WORKAROUNDS flag to g_mime_init()
> *should* solve the decoding problem mentioned in the thread. This
> flag should be safe to pass into g_mime_init() without any bad side
> effects and my unit tests do test that code-path.
The thread being referred to is [2].
[1] id:87bo56viyo.fsf@nikula.org
[2] id:08cb1dcd-c5db-4e33-8b09-7730cb3d59a2@gmail.com
Apparently as of GMime 2.4, you don't need to call
internet_address_list_destroy anymore, but you still need to call
g_object_unref (from the GMime Changelog).
On the medium performance corpus, valgrind shows "possibly lost"
leakage in "notmuch new" dropping from 7M to 300k.
Previously, we would treat multi-message mboxes as one giant email,
which, besides the obvious incorrect indexing, often led to
out-of-memory errors for archival mboxes. Now we explicitly reject
multi-message mboxes. For historical reasons, we retain support for
single-message mboxes, but official deprecate this behavior.
This fixes a bug that didn't allow to search for non-ASCII words such
parts. The code here was copied from show_text_part_content(), because
the show command already does the needed conversion when showing the
message.
It appears to be an oversight that encrypted parts were indexed
previously. The terms generated from encrypted parts are meaningless
and do nothing but add bloat to the database. It is not worth
indexing the encrypted content, just as it's not worth indexing the
signatures in signed parts.