The bug here was that we would see that the database did not know
anything about a directory so would get results from the filesystem in
inode rather than strcmp order.
However, we wouldn't actually ask for the list of files from the
database until after recursing into the sub-directories. So by the
time we traverse the filenames looking for deletions, the database
*does* have entries and we end up detecting erroneous deletions
because our filename list from the filesystem isn't in strcmp order.
So ask for the list of names from the database before doing any
additions to avoid this problem.
Previously we only scanned the list of filenames in the filesystem and
detected a deletion whenever that scan skipped a name that existed in
the database. That much was fine, but we *also* need to continue
walking the list of names from the database when the filesystem list
is exhausted.
Without this, removing the last file or directory within any
particular directory would go undetected.
As described in the previous commit message, we introduced multiple
symlink-based regressions in commit
3df737bc4addfce71c647792ee668725e5221a98
Here, we fix the case of symlinks to regular files by doing an extra
stat of any DT_LNK files to determine if they do, in fact, link to
regular files.
In commit 3df737bc4addfce71c647792ee668725e5221a98 we switched from
using stat() to using the d_type field in the result of scandir() to
determine whether a filename is a regular file or a directory. This
change introduced a regression in that the recursion would no longer
traverse through a symlink to a directory. (Since stat() would resolve
the symlink but with scandir() we see a distinct DT_LNK value in
d_type).
We fix this for directories by allowing both DT_DIR and DT_LNK values
to recurse, and then downgrading the existing not-a-directory check
within the recursion to not be an error. We also add a new
not-a-directory check outside the recursion that is an error.
Similar to the return value of notmuch_database_add_message, we now
enhance the return value of notmuch_database_remove_message to
indicate whether the message document was entirely removed (SUCCESS)
or whether only this filename was removed and the document exists
under other filenamed (DUPLICATE_MESSAGE_ID).
Previously, adding a filename with the same message ID as an existing
message would do nothing. But we recently fixed this to instead add
the new filename to the existing message document. So update the
documentation to match now.
In the presentation we often omit citations and signatures, but this
is not content that should be omitted from the index, (especially
when the citation detection is wrong---see cases where a line
beginning with "From" is corrupted to ">From" by mail processing
tools).
The "notmuch new" command will now efficiently notice if any files or
directories have been removed from the mail store and will
appropriately update its database.
Any given mail message (as determined by the message ID) may have
multiple corresponding filenames, and notmuch will return one of
them. When a filen is deleted, the corresponding filename will be
removed from the message in the database. When the last filename is
removed from a message, that message will be entirely removed from the
database.
All file additions are handled before any file removals so that rename
is supported properly.
It is essential to defer the actual removal of any filenames from the
database until we are entirely done adding any new files. This is to
avoid any information loss from the database in the case of a renamed
file or directory.
Note that we're *still* not actually doing any removal---still just
printing messages indicating the filenames that were detected as
removed. But we're at least now printing those messages at a time when
we actually *can* do the actual removal.
This takes advantage of the notmuch_directory_t interfaces added
recently (with cooresponding storage of directory documents in the
database) to detect when files or entire directories are deleted or
renamed within the mail store.
This also fixes the recent regression where *all* files would be
processed by every run of "notmuch new", (now only new files are
processed once again).
The deleted files and directories are only detected so far. They
aren't properly removed from the database.
Previously, we were re-scanning the entire list of entries for every
directory entry. Instead, we can simply check if the entries look like
a maildir once, up-front.
We now do two scans over the entries returned from scandir. The first
scan is looking for directories (and making the recursive call). The
second scan is looking for new files to add to the database.
This is easier to read than the previous code which had a single loop
and some if statements with ridiculously long bodies. It also has the
advantage that once the directory scan is complete we can do a single
comparison of the filesystem and database mtimes and entirely skip the
second scan if it's not needed.
Previously we had an array named "namelist" and its count named
"num_entries". We now use an array name of "fs_entries" and a count
named "num_fs_entries" to try to preserve sanity.
We were previousl using the stat for two reasons. One was to obtain
the mtime of the file. This usage was removed in the previous commit,
(since the mtime is unreliable in the case of a file being moved into
the mail store).
The second reason was to identify regular and directory file
types. But this information is already available in the result we get
from scandir.
What's left is simply a stat for each directory in the mailstore,
(which we are still using to compare filesystem mtime with the mtime
stored in the database).
This check was buggy in that moving a pre-existing file into the mail
store, (where the file existed before the last run of "notmuch new"),
does not update the mtime of the file. So the message would never be
added to the database.
The fix here is not practical in the long run, (since it causes *all*
files in the mail store to be processed in every run of "notmuch new"
(!)). But this change will let us drop a stat() call that we don't
otherwise need and will help move us toward proper database-backed
detection of new files, (which will fix the bug without the
performance impact of the current fix).
The previous name of "path_mtime" was very ambiguous. The new names
are much more obvious (fs_mtime is the mtime from the filesystem and
db_mtime is the mtime from the database).
This was a very dangerous bug. An interrupted "notmuch new" session
would still update the timestamp for the directory in the
database. This would result in mail files that were not processed due
to the original interruption *never* being picked up by future runs of
"notmuch new". Yikes!
This new directory ojbect provides all the infrastructure needed to
detect when files or directories are deleted or renamed. There's still
code needed on top of this (within "notmuch new") to actually do that
detection.
This commit contains my changes to the API proposed by Keith. Nothing
is dramatically different. There are minor things like changing
notmuch_files_t to notmuch_filenames_t and then various things needed
for completeness as noticed while implementing this, (such as
notmuch_directory_destroy and notmuch_directory_set_mtime).
This will allow applications to support the removal of messages, (such
as when a file is deleted from the mail store). No removal support is
provided yet in commands such as "notmuch new".
The existing find_doc_ids function is convenient when the caller
doesn't want to be bothered constructing a term. But when the caller
*does* have the term already, that interface is just wasteful. So we
export a lower-level interface that maps a pre-constructed term to a
document-ID iterators.
The code to map a filename to a direntry is something that we're going
to want in a future _remove_message function, so put it in a new
function _notmuch_database_filename_to_direntry .
The library interface is unchanged so far, (still just
notmuch_database_add_message), but internally, the old
_set_filename function is now _add_filename instead.
Instead of storing the complete message filename in the data portion
of a mail document we now store a 'direntry' term that contains the
document ID of a directory document and also the basename of the
message filename within that directory. This will allow us to easily
store multple filenames for a single message, and will also allow us
to find mail documents for files that previously existed in a
directory but that have since been deleted.
Some pending commits want the _split_path functionality separate from
mapping a directory to a document ID. The split_path function now
returns the basename as well as the directory name.
We're planning to have mail documents refer to directory documents for
the path of the containing directory. To support this, we need the
path in the data, (since the path in the 'directory' term can be
irretrievable as it will be the SHA1 sum of the path for a very long
path).
We'll soon have mail documents referring to their parent directory's
directory documents, so we'll need access to _find_parent_id in files
such as message.cc.
Storing the document ID of the parent of each directory document will
allow us to find all child-directory documents for a given directory
document. We will need this in order to detect directories that have
been removed from the mail store, (though we aren't yet doing this).
The recent change from storing absolute paths to relative paths means
that new directory documents will already be created, (and the old
ones will just linger stale in the database). Given that, we might as
well put a clean name on the term in the new documents, (and no real
flag day is needed).
We were already storing relative mail filenames, so this is consistent
with that. Additionally, it means that directory documents remain
valid even if the database is relocated within its containing
filesystem.
We'll soon be having multiple entry points that accept a filename
path, so we want common code for getting a relative path from a
potentially absolute path.
This was really the last thing keeping the initial run of "notmuch
new" being different from all other runs. And I'm taking a fresh
look at the performance of "notmuch new" anyway, so I think we can
safely drop this optimization.
And fix the initialization such that the private enum will always have
distinct values from the public enum even if we similarly miss the
addition of a new public value in the future.
Several people complained that the humor wore thin very quickly. The
most significant case of "not much mail" is when counting the user's
initial mail collection. We've promised on the web page that no matter
how much mail the user has, notmuch will consider it to be "not much"
so let's say so. (This message was in place very early on, but was
inadvertently dropped at some point.)
The in-development version of Xapian provides a config program named
xapian-config-1.1 while the released version provides a program named
xapian-config instead. By default, we now try each of these in turn,
and we also allow the user to set a XAPIAN_CONFIG environment variable
to explicitly specify a particular program.
We've received a user report that the hidden citations were annoying
since the user couldn't tell what was being referred to by subsequent
text. Apparently it wasn't obvious enough that the hidden citation
could be revealed by clicking or by pressing Enter. So make the button
text say as much.
In the message mentioned in the previous commit, an ASCII diagram was
included in which '>' was used as the first non-whitespace character
in a line. Notmuch previously (and mistakenly) regarded this as a
citation.
We fix this by only regarding a '>' in the first column of an email as
introducing a citation.