notmuch/test/T590-thread-breakage.sh

125 lines
3.4 KiB
Bash
Raw Permalink Normal View History

test thread breakage when messages are removed and re-added This test (T590-thread-breakage.sh) has known-broken subtests. If you have a two-message thread where message "B" is in-reply-to "A", notmuch rightly sees this as a single thread. But if you: * remove "A" from the message store * run "notmuch new" * add "A" back into the message store * re-run "notmuch new" Then notmuch sees the messages as distinct threads. This happens because if you insert "B" initially (before anything is known about "A"), then a "ghost message" gets added to the database in reference to "A" that is in the same thread, which "A" takes over when it appears. But if "A" is subsequently removed, no ghost message is retained, so when "A" appears, it is treated as a new thread. I see a few options to fix this: ghost-on-removal ---------------- We could unilaterally add a ghost upon message removal. This has a few disadvantages: the message index would leak information about what messages the user has ever been exposed to, and we also create a perpetually-growing dataset -- the ghosts can never be removed. ghost-on-removal-when-shared-thread-exists ------------------------------------------ We could add a ghost upon message removal iff there are other non-ghost messages with the same thread ID. We'd also need to remove all ghost messages that share a thread when the last non-ghost message in that thread is removed. This still has a bit of information leakage, though: the message index would reveal that i've seen a newer message in a thread, even if i had deleted it from my message store track-dependencies ------------------ rather than a simple "ghost-message" we could store all the (A,B) message-reference pairs internally, showing which messages A reference which other messages B. Then removal of message X would require deleting all message-reference pairs (X,B), and only deleting a ghost message if no (A,X) reference pair exists. This requires modifying the database by adding a new and fairly weird table that would need to be indexed by both columns. I don't know whether xapian has nice ways to do that. scan-dependencies ----------------- Without modifying the database, we could do something less efficient. Upon removal of message X, we could scan the headers of all non-ghost messages that share a thread with X. If any of those messages refers to X, we would add a ghost message. If none of them do, then we would just drop X entirely from the table. --------------------- One risk of attempted fixes to this problem is that we could fail to remove the search term indexes entirely. This test contains additional subtests to guard against that. This test also ensures that the right number of ghost messages exist in each situation; this will help us ensure we don't accumulate ghosts indefinitely or leak too much information about what messages we've seen or not seen, while still making it easy to reassemble threads when messages come in out-of-order.
2016-04-09 03:54:47 +02:00
#!/usr/bin/env bash
#
# Copyright (c) 2016 Daniel Kahn Gillmor
#
test_description='thread breakage during reindexing'
test thread breakage when messages are removed and re-added This test (T590-thread-breakage.sh) has known-broken subtests. If you have a two-message thread where message "B" is in-reply-to "A", notmuch rightly sees this as a single thread. But if you: * remove "A" from the message store * run "notmuch new" * add "A" back into the message store * re-run "notmuch new" Then notmuch sees the messages as distinct threads. This happens because if you insert "B" initially (before anything is known about "A"), then a "ghost message" gets added to the database in reference to "A" that is in the same thread, which "A" takes over when it appears. But if "A" is subsequently removed, no ghost message is retained, so when "A" appears, it is treated as a new thread. I see a few options to fix this: ghost-on-removal ---------------- We could unilaterally add a ghost upon message removal. This has a few disadvantages: the message index would leak information about what messages the user has ever been exposed to, and we also create a perpetually-growing dataset -- the ghosts can never be removed. ghost-on-removal-when-shared-thread-exists ------------------------------------------ We could add a ghost upon message removal iff there are other non-ghost messages with the same thread ID. We'd also need to remove all ghost messages that share a thread when the last non-ghost message in that thread is removed. This still has a bit of information leakage, though: the message index would reveal that i've seen a newer message in a thread, even if i had deleted it from my message store track-dependencies ------------------ rather than a simple "ghost-message" we could store all the (A,B) message-reference pairs internally, showing which messages A reference which other messages B. Then removal of message X would require deleting all message-reference pairs (X,B), and only deleting a ghost message if no (A,X) reference pair exists. This requires modifying the database by adding a new and fairly weird table that would need to be indexed by both columns. I don't know whether xapian has nice ways to do that. scan-dependencies ----------------- Without modifying the database, we could do something less efficient. Upon removal of message X, we could scan the headers of all non-ghost messages that share a thread with X. If any of those messages refers to X, we would add a ghost message. If none of them do, then we would just drop X entirely from the table. --------------------- One risk of attempted fixes to this problem is that we could fail to remove the search term indexes entirely. This test contains additional subtests to guard against that. This test also ensures that the right number of ghost messages exist in each situation; this will help us ensure we don't accumulate ghosts indefinitely or leak too much information about what messages we've seen or not seen, while still making it easy to reassemble threads when messages come in out-of-order.
2016-04-09 03:54:47 +02:00
# notmuch uses ghost documents to track messages we have seen references
# to but have never seen. Regardless of the order of delivery, message
# deletion, and reindexing, the list of ghost messages for a given
# stored corpus should not vary, so that threads can be reassmebled
# cleanly.
#
# In practice, we accept a small amount of variation (and therefore
# traffic pattern metadata leakage to be stored in the index) for the
# sake of efficiency.
#
# This test also embeds some subtests to ensure that indexing actually
# works properly and attempted fixes to threading issues do not break
# the expected contents of the index.
test thread breakage when messages are removed and re-added This test (T590-thread-breakage.sh) has known-broken subtests. If you have a two-message thread where message "B" is in-reply-to "A", notmuch rightly sees this as a single thread. But if you: * remove "A" from the message store * run "notmuch new" * add "A" back into the message store * re-run "notmuch new" Then notmuch sees the messages as distinct threads. This happens because if you insert "B" initially (before anything is known about "A"), then a "ghost message" gets added to the database in reference to "A" that is in the same thread, which "A" takes over when it appears. But if "A" is subsequently removed, no ghost message is retained, so when "A" appears, it is treated as a new thread. I see a few options to fix this: ghost-on-removal ---------------- We could unilaterally add a ghost upon message removal. This has a few disadvantages: the message index would leak information about what messages the user has ever been exposed to, and we also create a perpetually-growing dataset -- the ghosts can never be removed. ghost-on-removal-when-shared-thread-exists ------------------------------------------ We could add a ghost upon message removal iff there are other non-ghost messages with the same thread ID. We'd also need to remove all ghost messages that share a thread when the last non-ghost message in that thread is removed. This still has a bit of information leakage, though: the message index would reveal that i've seen a newer message in a thread, even if i had deleted it from my message store track-dependencies ------------------ rather than a simple "ghost-message" we could store all the (A,B) message-reference pairs internally, showing which messages A reference which other messages B. Then removal of message X would require deleting all message-reference pairs (X,B), and only deleting a ghost message if no (A,X) reference pair exists. This requires modifying the database by adding a new and fairly weird table that would need to be indexed by both columns. I don't know whether xapian has nice ways to do that. scan-dependencies ----------------- Without modifying the database, we could do something less efficient. Upon removal of message X, we could scan the headers of all non-ghost messages that share a thread with X. If any of those messages refers to X, we would add a ghost message. If none of them do, then we would just drop X entirely from the table. --------------------- One risk of attempted fixes to this problem is that we could fail to remove the search term indexes entirely. This test contains additional subtests to guard against that. This test also ensures that the right number of ghost messages exist in each situation; this will help us ensure we don't accumulate ghosts indefinitely or leak too much information about what messages we've seen or not seen, while still making it easy to reassemble threads when messages come in out-of-order.
2016-04-09 03:54:47 +02:00
. $(dirname "$0")/test-lib.sh || exit 1
test thread breakage when messages are removed and re-added This test (T590-thread-breakage.sh) has known-broken subtests. If you have a two-message thread where message "B" is in-reply-to "A", notmuch rightly sees this as a single thread. But if you: * remove "A" from the message store * run "notmuch new" * add "A" back into the message store * re-run "notmuch new" Then notmuch sees the messages as distinct threads. This happens because if you insert "B" initially (before anything is known about "A"), then a "ghost message" gets added to the database in reference to "A" that is in the same thread, which "A" takes over when it appears. But if "A" is subsequently removed, no ghost message is retained, so when "A" appears, it is treated as a new thread. I see a few options to fix this: ghost-on-removal ---------------- We could unilaterally add a ghost upon message removal. This has a few disadvantages: the message index would leak information about what messages the user has ever been exposed to, and we also create a perpetually-growing dataset -- the ghosts can never be removed. ghost-on-removal-when-shared-thread-exists ------------------------------------------ We could add a ghost upon message removal iff there are other non-ghost messages with the same thread ID. We'd also need to remove all ghost messages that share a thread when the last non-ghost message in that thread is removed. This still has a bit of information leakage, though: the message index would reveal that i've seen a newer message in a thread, even if i had deleted it from my message store track-dependencies ------------------ rather than a simple "ghost-message" we could store all the (A,B) message-reference pairs internally, showing which messages A reference which other messages B. Then removal of message X would require deleting all message-reference pairs (X,B), and only deleting a ghost message if no (A,X) reference pair exists. This requires modifying the database by adding a new and fairly weird table that would need to be indexed by both columns. I don't know whether xapian has nice ways to do that. scan-dependencies ----------------- Without modifying the database, we could do something less efficient. Upon removal of message X, we could scan the headers of all non-ghost messages that share a thread with X. If any of those messages refers to X, we would add a ghost message. If none of them do, then we would just drop X entirely from the table. --------------------- One risk of attempted fixes to this problem is that we could fail to remove the search term indexes entirely. This test contains additional subtests to guard against that. This test also ensures that the right number of ghost messages exist in each situation; this will help us ensure we don't accumulate ghosts indefinitely or leak too much information about what messages we've seen or not seen, while still making it easy to reassemble threads when messages come in out-of-order.
2016-04-09 03:54:47 +02:00
message_a() {
mkdir -p ${MAIL_DIR}/cur
cat > ${MAIL_DIR}/cur/a <<EOF
Subject: First message
Message-ID: <a@example.net>
From: Alice <alice@example.net>
To: Bob <bob@example.net>
Date: Thu, 31 Mar 2016 20:10:00 -0400
This is the first message in the thread.
Apple
EOF
}
message_b() {
mkdir -p ${MAIL_DIR}/cur
cat > ${MAIL_DIR}/cur/b <<EOF
Subject: Second message
Message-ID: <b@example.net>
In-Reply-To: <a@example.net>
References: <a@example.net>
From: Bob <bob@example.net>
To: Alice <alice@example.net>
Date: Thu, 31 Mar 2016 20:15:00 -0400
This is the second message in the thread.
Banana
EOF
}
test_content_count() {
test_begin_subtest "${3:-looking for $2 instance of '$1'}"
count=$(notmuch count --output=threads "$1")
test_expect_equal "$count" "$2"
}
test_thread_count() {
test_begin_subtest "${2:-Expecting $1 thread(s)}"
count=$(notmuch count --output=threads)
test_expect_equal "$count" "$1"
}
test_ghost_count() {
test_begin_subtest "${2:-Expecting $1 ghosts(s)}"
ghosts=$($NOTMUCH_BUILDDIR/test/ghost-report ${MAIL_DIR}/.notmuch/xapian)
test thread breakage when messages are removed and re-added This test (T590-thread-breakage.sh) has known-broken subtests. If you have a two-message thread where message "B" is in-reply-to "A", notmuch rightly sees this as a single thread. But if you: * remove "A" from the message store * run "notmuch new" * add "A" back into the message store * re-run "notmuch new" Then notmuch sees the messages as distinct threads. This happens because if you insert "B" initially (before anything is known about "A"), then a "ghost message" gets added to the database in reference to "A" that is in the same thread, which "A" takes over when it appears. But if "A" is subsequently removed, no ghost message is retained, so when "A" appears, it is treated as a new thread. I see a few options to fix this: ghost-on-removal ---------------- We could unilaterally add a ghost upon message removal. This has a few disadvantages: the message index would leak information about what messages the user has ever been exposed to, and we also create a perpetually-growing dataset -- the ghosts can never be removed. ghost-on-removal-when-shared-thread-exists ------------------------------------------ We could add a ghost upon message removal iff there are other non-ghost messages with the same thread ID. We'd also need to remove all ghost messages that share a thread when the last non-ghost message in that thread is removed. This still has a bit of information leakage, though: the message index would reveal that i've seen a newer message in a thread, even if i had deleted it from my message store track-dependencies ------------------ rather than a simple "ghost-message" we could store all the (A,B) message-reference pairs internally, showing which messages A reference which other messages B. Then removal of message X would require deleting all message-reference pairs (X,B), and only deleting a ghost message if no (A,X) reference pair exists. This requires modifying the database by adding a new and fairly weird table that would need to be indexed by both columns. I don't know whether xapian has nice ways to do that. scan-dependencies ----------------- Without modifying the database, we could do something less efficient. Upon removal of message X, we could scan the headers of all non-ghost messages that share a thread with X. If any of those messages refers to X, we would add a ghost message. If none of them do, then we would just drop X entirely from the table. --------------------- One risk of attempted fixes to this problem is that we could fail to remove the search term indexes entirely. This test contains additional subtests to guard against that. This test also ensures that the right number of ghost messages exist in each situation; this will help us ensure we don't accumulate ghosts indefinitely or leak too much information about what messages we've seen or not seen, while still making it easy to reassemble threads when messages come in out-of-order.
2016-04-09 03:54:47 +02:00
test_expect_equal "$ghosts" "$1"
}
notmuch new >/dev/null
test_thread_count 0 'There should be no threads initially'
test_ghost_count 0 'There should be no ghosts initially'
message_a
notmuch new >/dev/null
test_thread_count 1 'One message in: one thread'
test_content_count apple 1
test_content_count banana 0
test_ghost_count 0
message_b
notmuch new >/dev/null
test_thread_count 1 'Second message in the same thread: one thread'
test_content_count apple 1
test_content_count banana 1
test_ghost_count 0
rm -f ${MAIL_DIR}/cur/a
notmuch new >/dev/null
test_thread_count 1 'First message removed: still only one thread'
test_content_count apple 0
test_content_count banana 1
test_ghost_count 1 'should be one ghost after first message removed'
test thread breakage when messages are removed and re-added This test (T590-thread-breakage.sh) has known-broken subtests. If you have a two-message thread where message "B" is in-reply-to "A", notmuch rightly sees this as a single thread. But if you: * remove "A" from the message store * run "notmuch new" * add "A" back into the message store * re-run "notmuch new" Then notmuch sees the messages as distinct threads. This happens because if you insert "B" initially (before anything is known about "A"), then a "ghost message" gets added to the database in reference to "A" that is in the same thread, which "A" takes over when it appears. But if "A" is subsequently removed, no ghost message is retained, so when "A" appears, it is treated as a new thread. I see a few options to fix this: ghost-on-removal ---------------- We could unilaterally add a ghost upon message removal. This has a few disadvantages: the message index would leak information about what messages the user has ever been exposed to, and we also create a perpetually-growing dataset -- the ghosts can never be removed. ghost-on-removal-when-shared-thread-exists ------------------------------------------ We could add a ghost upon message removal iff there are other non-ghost messages with the same thread ID. We'd also need to remove all ghost messages that share a thread when the last non-ghost message in that thread is removed. This still has a bit of information leakage, though: the message index would reveal that i've seen a newer message in a thread, even if i had deleted it from my message store track-dependencies ------------------ rather than a simple "ghost-message" we could store all the (A,B) message-reference pairs internally, showing which messages A reference which other messages B. Then removal of message X would require deleting all message-reference pairs (X,B), and only deleting a ghost message if no (A,X) reference pair exists. This requires modifying the database by adding a new and fairly weird table that would need to be indexed by both columns. I don't know whether xapian has nice ways to do that. scan-dependencies ----------------- Without modifying the database, we could do something less efficient. Upon removal of message X, we could scan the headers of all non-ghost messages that share a thread with X. If any of those messages refers to X, we would add a ghost message. If none of them do, then we would just drop X entirely from the table. --------------------- One risk of attempted fixes to this problem is that we could fail to remove the search term indexes entirely. This test contains additional subtests to guard against that. This test also ensures that the right number of ghost messages exist in each situation; this will help us ensure we don't accumulate ghosts indefinitely or leak too much information about what messages we've seen or not seen, while still making it easy to reassemble threads when messages come in out-of-order.
2016-04-09 03:54:47 +02:00
message_a
notmuch new >/dev/null
test_thread_count 1 'First message reappears: should return to the same thread'
test thread breakage when messages are removed and re-added This test (T590-thread-breakage.sh) has known-broken subtests. If you have a two-message thread where message "B" is in-reply-to "A", notmuch rightly sees this as a single thread. But if you: * remove "A" from the message store * run "notmuch new" * add "A" back into the message store * re-run "notmuch new" Then notmuch sees the messages as distinct threads. This happens because if you insert "B" initially (before anything is known about "A"), then a "ghost message" gets added to the database in reference to "A" that is in the same thread, which "A" takes over when it appears. But if "A" is subsequently removed, no ghost message is retained, so when "A" appears, it is treated as a new thread. I see a few options to fix this: ghost-on-removal ---------------- We could unilaterally add a ghost upon message removal. This has a few disadvantages: the message index would leak information about what messages the user has ever been exposed to, and we also create a perpetually-growing dataset -- the ghosts can never be removed. ghost-on-removal-when-shared-thread-exists ------------------------------------------ We could add a ghost upon message removal iff there are other non-ghost messages with the same thread ID. We'd also need to remove all ghost messages that share a thread when the last non-ghost message in that thread is removed. This still has a bit of information leakage, though: the message index would reveal that i've seen a newer message in a thread, even if i had deleted it from my message store track-dependencies ------------------ rather than a simple "ghost-message" we could store all the (A,B) message-reference pairs internally, showing which messages A reference which other messages B. Then removal of message X would require deleting all message-reference pairs (X,B), and only deleting a ghost message if no (A,X) reference pair exists. This requires modifying the database by adding a new and fairly weird table that would need to be indexed by both columns. I don't know whether xapian has nice ways to do that. scan-dependencies ----------------- Without modifying the database, we could do something less efficient. Upon removal of message X, we could scan the headers of all non-ghost messages that share a thread with X. If any of those messages refers to X, we would add a ghost message. If none of them do, then we would just drop X entirely from the table. --------------------- One risk of attempted fixes to this problem is that we could fail to remove the search term indexes entirely. This test contains additional subtests to guard against that. This test also ensures that the right number of ghost messages exist in each situation; this will help us ensure we don't accumulate ghosts indefinitely or leak too much information about what messages we've seen or not seen, while still making it easy to reassemble threads when messages come in out-of-order.
2016-04-09 03:54:47 +02:00
test_content_count apple 1
test_content_count banana 1
test_ghost_count 0
rm -f ${MAIL_DIR}/cur/b
notmuch new >/dev/null
test_thread_count 1 'Removing second message: still only one thread'
test_content_count apple 1
test_content_count banana 0
test_begin_subtest 'No ghosts should remain after deletion of second message'
# this is known to fail; we are leaking ghost messages deliberately
test_subtest_known_broken
ghosts=$($NOTMUCH_BUILDDIR/test/ghost-report ${MAIL_DIR}/.notmuch/xapian)
test_expect_equal "$ghosts" "0"
test thread breakage when messages are removed and re-added This test (T590-thread-breakage.sh) has known-broken subtests. If you have a two-message thread where message "B" is in-reply-to "A", notmuch rightly sees this as a single thread. But if you: * remove "A" from the message store * run "notmuch new" * add "A" back into the message store * re-run "notmuch new" Then notmuch sees the messages as distinct threads. This happens because if you insert "B" initially (before anything is known about "A"), then a "ghost message" gets added to the database in reference to "A" that is in the same thread, which "A" takes over when it appears. But if "A" is subsequently removed, no ghost message is retained, so when "A" appears, it is treated as a new thread. I see a few options to fix this: ghost-on-removal ---------------- We could unilaterally add a ghost upon message removal. This has a few disadvantages: the message index would leak information about what messages the user has ever been exposed to, and we also create a perpetually-growing dataset -- the ghosts can never be removed. ghost-on-removal-when-shared-thread-exists ------------------------------------------ We could add a ghost upon message removal iff there are other non-ghost messages with the same thread ID. We'd also need to remove all ghost messages that share a thread when the last non-ghost message in that thread is removed. This still has a bit of information leakage, though: the message index would reveal that i've seen a newer message in a thread, even if i had deleted it from my message store track-dependencies ------------------ rather than a simple "ghost-message" we could store all the (A,B) message-reference pairs internally, showing which messages A reference which other messages B. Then removal of message X would require deleting all message-reference pairs (X,B), and only deleting a ghost message if no (A,X) reference pair exists. This requires modifying the database by adding a new and fairly weird table that would need to be indexed by both columns. I don't know whether xapian has nice ways to do that. scan-dependencies ----------------- Without modifying the database, we could do something less efficient. Upon removal of message X, we could scan the headers of all non-ghost messages that share a thread with X. If any of those messages refers to X, we would add a ghost message. If none of them do, then we would just drop X entirely from the table. --------------------- One risk of attempted fixes to this problem is that we could fail to remove the search term indexes entirely. This test contains additional subtests to guard against that. This test also ensures that the right number of ghost messages exist in each situation; this will help us ensure we don't accumulate ghosts indefinitely or leak too much information about what messages we've seen or not seen, while still making it easy to reassemble threads when messages come in out-of-order.
2016-04-09 03:54:47 +02:00
rm -f ${MAIL_DIR}/cur/a
notmuch new >/dev/null
test_thread_count 0 'All messages gone: no threads'
test_content_count apple 0
test_content_count banana 0
test_ghost_count 0 'No ghosts should remain after full thread deletion'
test thread breakage when messages are removed and re-added This test (T590-thread-breakage.sh) has known-broken subtests. If you have a two-message thread where message "B" is in-reply-to "A", notmuch rightly sees this as a single thread. But if you: * remove "A" from the message store * run "notmuch new" * add "A" back into the message store * re-run "notmuch new" Then notmuch sees the messages as distinct threads. This happens because if you insert "B" initially (before anything is known about "A"), then a "ghost message" gets added to the database in reference to "A" that is in the same thread, which "A" takes over when it appears. But if "A" is subsequently removed, no ghost message is retained, so when "A" appears, it is treated as a new thread. I see a few options to fix this: ghost-on-removal ---------------- We could unilaterally add a ghost upon message removal. This has a few disadvantages: the message index would leak information about what messages the user has ever been exposed to, and we also create a perpetually-growing dataset -- the ghosts can never be removed. ghost-on-removal-when-shared-thread-exists ------------------------------------------ We could add a ghost upon message removal iff there are other non-ghost messages with the same thread ID. We'd also need to remove all ghost messages that share a thread when the last non-ghost message in that thread is removed. This still has a bit of information leakage, though: the message index would reveal that i've seen a newer message in a thread, even if i had deleted it from my message store track-dependencies ------------------ rather than a simple "ghost-message" we could store all the (A,B) message-reference pairs internally, showing which messages A reference which other messages B. Then removal of message X would require deleting all message-reference pairs (X,B), and only deleting a ghost message if no (A,X) reference pair exists. This requires modifying the database by adding a new and fairly weird table that would need to be indexed by both columns. I don't know whether xapian has nice ways to do that. scan-dependencies ----------------- Without modifying the database, we could do something less efficient. Upon removal of message X, we could scan the headers of all non-ghost messages that share a thread with X. If any of those messages refers to X, we would add a ghost message. If none of them do, then we would just drop X entirely from the table. --------------------- One risk of attempted fixes to this problem is that we could fail to remove the search term indexes entirely. This test contains additional subtests to guard against that. This test also ensures that the right number of ghost messages exist in each situation; this will help us ensure we don't accumulate ghosts indefinitely or leak too much information about what messages we've seen or not seen, while still making it easy to reassemble threads when messages come in out-of-order.
2016-04-09 03:54:47 +02:00
test_done