This is a simplified version of a patch originally by Michal Sojka
<sojkam1@fel.cvut.cz> which is designed to have the same performance
benefits. Michal said the following:
When notmuch new is run for the first time, it is not necessary to
defer maildir flags synchronization to later because we already know
that no files will be removed.
Performing the maildinr flag synchronization immediately after the
message is added to the database has the advantage that the message
is likely hot in the disk cache so the synchronization is faster.
Additionally, we also save one database query for each message,
which must be performed when the operation is deferred.
Without this patch, the first notmuch new of 200k messages (3 GB)
took 1h and 46m out of which 20m was maildir flags
synchronization. With this patch, the whole operation took only 1h
and 36m.
Unlike Michal's patch, this version does the deferral for any new
message, rather than doing it only on the first run of "notmuch new".
Previously, we would only scan a directory if the filesystem
modification time was strictly newer than the database modification
time for the directory. This would cause a problem for systems with an
unstable clock, (if a new mail was added to the filesystem, then the
system clock rolled backward, "notmuch new" would not find the message
until the clock caught up and the directory was modified again).
Now, we always scan the directory if the modification time of the
directory is not exactly the same between the filesystem and the
database. This avoids the problem described above even with an
unstable system clock.
When a file in the mailstore is renamed, this appears to "notmuch new"
as both an added file and a removed file (for the same message). We
want the synchronization of the maildir_flags to reflect the final
state, (after the rename is complete). Therefore, it's incorrect to
perform the synchronization immediately after adding a new
file. Instead we queue up these synchronizations (by message ID[*])
and perform them after the removals are complete.
With this change, the "dump/restore" case of the maildir-sync tests,
as well as the recent "remove 'S'" case both now pass where they were
failing before.
Interestingly, the "remove info" test was passing before, but now
fails. This is actually due to a separate bug, (and the bug just fixed
was masking it, by preventing the test from performing as desired).
[*] It's important to queue by message ID---queueing actual message
objects does not work since the message objects will retain stale data
such as the old filenames.
Instead of having an API for setting a library-wide flag for
synchronization (notmuch_database_set_maildir_sync) we instead
implement maildir synchronization with two new library functions:
notmuch_message_maildir_flags_to_tags
and notmuch_message_tags_to_maildir_flags
These functions are nicely documented here, (though the implementation
does not quite match the documentation yet---as plainly evidenced by
the current results of the test suite).
Since the name of the configuration parameter here is:
maildir.synchronize_flags
the convention is that the functions to get and set this parameter
should match it in name. Hence:
notmuch_config_get_maildir_synchronize_flags
etc. (as opposed to notmuch_config_get_maildir_sync).
This adds group [maildir] and key 'synchronize_flags' to the
configuration file. Its value enables (true) or diables (false) the
synchronization between notmuch tags and maildir flags. By default,
the synchronization is disabled.
This patch allows bi-directional synchronization between maildir
flags and certain tags. The flag-to-tag mapping is defined by flag2tag
array.
The synchronization works this way:
1) Whenever notmuch new is executed, the following happens:
o New messages are tagged with configured new_tags.
o For new or renamed messages with maildir info present in the file
name, the tags defined in flag2tag are either added or removed
depending on the flags from the file name.
2) Whenever notmuch tag (or notmuch restore) is executed, a new set of
flags based on the tags is constructed for every message and a new
file name is prepared based on the old file name but with the new
flags. If the flags differs and the old message was in 'new'
directory then this is replaced with 'cur' in the new file name. If
the new and old file names differ, the file is renamed and notmuch
database is updated accordingly.
The rename happens before the database is updated. In case of crash
between rename and database update, the next run of notmuch new
brings the database in sync with the mail store again.
Add a new_tags option in the [messages] section of the configuration
file to allow the user to specify which tags should be added to new
messages by notmuch new.
When Ctrl-C is pressed in a wrong time during notmuch new, it can lead
to removal of messages from the database even if the files were not
removed.
It happened at least once to me.
Signed-off-by: Michal Sojka <sojkam1@fel.cvut.cz>
We rename 'has_more' to 'valid' so that it can function whether
iterating in a forward or reverse direction. We also rename
'advance' to 'move_to_next' to setup parallel naming with
the proposed functions 'move_to_first', 'move_to_last', and
'move_to_previous'.
Such as reiserfs or xfs. This has been broken since the merge of
support for rename and deletion of files from the mail store.
Here's the original justification for the patch:
A review of notmuch-new.c shows three uses of ->d_type:
Near line 153, in _entries_resemble_maildir() we can simply allow for
DT_UNKNOWN. This would fail if people have MH-style folders which have
three folders called "new" "cur" and "tmp", but that seems unlikely, in
which case the "tmp" folder would simply not be scanned.
Near line 273 in add_files_recursive() we have another check. If
DT_UNKNOWN, we fall through, then add_files_recursive() does a stat
almost immediately, returning with success if the path isn't a
directory.
Thus, the fallback is already written.
Finally, near line 343, in add_files_recursive() (a long function) we
have another check. Here we can simply treat DT_UNKNOWN as DT_LNK, since
the logic for the stat() results are the same.
Previously we were printing a number of messages upgraded so far. The
original motivation for this was to accurately reflect the fact that
there are two passes, (so each message is processed twice and it's not
accurate to represent with a single count). But as it turns out, the
second pass takes zero time (relatively speaking) so we're still not
accounting for it.
If nothing else, the percentage-based reporting makes for a cleaner
API for the progress_notify function.
Our signal handler is designed to quickly flush out changes and then
exit. But if a database upgrade is in progress when the user
interrupts, then we just want to immediately abort. We could do
something fancy like add a return value to our progress_notify
function to allow it to tell the upgrade process to abort. But it's
actually much cleaner and robust to delay the installation of our
signal handler so that the default abort happens on SIGINT.
This takes advantage of the recently added library support to detect
if the database needs to be upgraded and then automatically performs
that upgrade, (with a nice progress report).
Previously, when notmuch detected that a directory had been deleted it
was only removing files immediately in that directory. We now
correctly recurse to also remove any directories (and files, etc.)
within sub-directories, etc.
Previously we had NOTMUCH_DATABASE_MODE_READ_ONLY but
NOTMUCH_STATUS_READONLY_DATABASE which was ugly and confusing. Rename
the latter to NOTMUCH_STATUS_READ_ONLY_DATABASE for consistency.
When we know that we are adding a new directory to the database, (and
we therefore are using inode rather than strcmp-based sorting of the
filenames), then we *never* want to see any names from the
database. If we get any names that could only make us inadvertently
remove files that we just added.
Since it's not obvious from the Xapian documentation whether new terms
being added as part of new documents will appear in the in-progress
all-terms iteration we are using, (and this might differ based on
Xapian backend and also might differ based on how many new directories
are added and whether a flush threshold is reached).
For all of these reasons, we play it safe and use NULL rather than a
real notmuch_filenames_t iterator in this case to avoid any problem.
The bug here was that we would see that the database did not know
anything about a directory so would get results from the filesystem in
inode rather than strcmp order.
However, we wouldn't actually ask for the list of files from the
database until after recursing into the sub-directories. So by the
time we traverse the filenames looking for deletions, the database
*does* have entries and we end up detecting erroneous deletions
because our filename list from the filesystem isn't in strcmp order.
So ask for the list of names from the database before doing any
additions to avoid this problem.
Previously we only scanned the list of filenames in the filesystem and
detected a deletion whenever that scan skipped a name that existed in
the database. That much was fine, but we *also* need to continue
walking the list of names from the database when the filesystem list
is exhausted.
Without this, removing the last file or directory within any
particular directory would go undetected.
As described in the previous commit message, we introduced multiple
symlink-based regressions in commit
3df737bc4addfce71c647792ee668725e5221a98
Here, we fix the case of symlinks to regular files by doing an extra
stat of any DT_LNK files to determine if they do, in fact, link to
regular files.
In commit 3df737bc4addfce71c647792ee668725e5221a98 we switched from
using stat() to using the d_type field in the result of scandir() to
determine whether a filename is a regular file or a directory. This
change introduced a regression in that the recursion would no longer
traverse through a symlink to a directory. (Since stat() would resolve
the symlink but with scandir() we see a distinct DT_LNK value in
d_type).
We fix this for directories by allowing both DT_DIR and DT_LNK values
to recurse, and then downgrading the existing not-a-directory check
within the recursion to not be an error. We also add a new
not-a-directory check outside the recursion that is an error.
The "notmuch new" command will now efficiently notice if any files or
directories have been removed from the mail store and will
appropriately update its database.
Any given mail message (as determined by the message ID) may have
multiple corresponding filenames, and notmuch will return one of
them. When a filen is deleted, the corresponding filename will be
removed from the message in the database. When the last filename is
removed from a message, that message will be entirely removed from the
database.
All file additions are handled before any file removals so that rename
is supported properly.
It is essential to defer the actual removal of any filenames from the
database until we are entirely done adding any new files. This is to
avoid any information loss from the database in the case of a renamed
file or directory.
Note that we're *still* not actually doing any removal---still just
printing messages indicating the filenames that were detected as
removed. But we're at least now printing those messages at a time when
we actually *can* do the actual removal.
This takes advantage of the notmuch_directory_t interfaces added
recently (with cooresponding storage of directory documents in the
database) to detect when files or entire directories are deleted or
renamed within the mail store.
This also fixes the recent regression where *all* files would be
processed by every run of "notmuch new", (now only new files are
processed once again).
The deleted files and directories are only detected so far. They
aren't properly removed from the database.
Previously, we were re-scanning the entire list of entries for every
directory entry. Instead, we can simply check if the entries look like
a maildir once, up-front.
We now do two scans over the entries returned from scandir. The first
scan is looking for directories (and making the recursive call). The
second scan is looking for new files to add to the database.
This is easier to read than the previous code which had a single loop
and some if statements with ridiculously long bodies. It also has the
advantage that once the directory scan is complete we can do a single
comparison of the filesystem and database mtimes and entirely skip the
second scan if it's not needed.
Previously we had an array named "namelist" and its count named
"num_entries". We now use an array name of "fs_entries" and a count
named "num_fs_entries" to try to preserve sanity.
We were previousl using the stat for two reasons. One was to obtain
the mtime of the file. This usage was removed in the previous commit,
(since the mtime is unreliable in the case of a file being moved into
the mail store).
The second reason was to identify regular and directory file
types. But this information is already available in the result we get
from scandir.
What's left is simply a stat for each directory in the mailstore,
(which we are still using to compare filesystem mtime with the mtime
stored in the database).
This check was buggy in that moving a pre-existing file into the mail
store, (where the file existed before the last run of "notmuch new"),
does not update the mtime of the file. So the message would never be
added to the database.
The fix here is not practical in the long run, (since it causes *all*
files in the mail store to be processed in every run of "notmuch new"
(!)). But this change will let us drop a stat() call that we don't
otherwise need and will help move us toward proper database-backed
detection of new files, (which will fix the bug without the
performance impact of the current fix).
The previous name of "path_mtime" was very ambiguous. The new names
are much more obvious (fs_mtime is the mtime from the filesystem and
db_mtime is the mtime from the database).
This was a very dangerous bug. An interrupted "notmuch new" session
would still update the timestamp for the directory in the
database. This would result in mail files that were not processed due
to the original interruption *never* being picked up by future runs of
"notmuch new". Yikes!
This new directory ojbect provides all the infrastructure needed to
detect when files or directories are deleted or renamed. There's still
code needed on top of this (within "notmuch new") to actually do that
detection.
This was really the last thing keeping the initial run of "notmuch
new" being different from all other runs. And I'm taking a fresh
look at the performance of "notmuch new" anyway, so I think we can
safely drop this optimization.
Several people complained that the humor wore thin very quickly. The
most significant case of "not much mail" is when counting the user's
initial mail collection. We've promised on the web page that no matter
how much mail the user has, notmuch will consider it to be "not much"
so let's say so. (This message was in place very early on, but was
inadvertently dropped at some point.)
Glibc (at least) provides the warn_unused_result attribute on write,
(if optimizing and _FORTIFY_SOURCE is defined). So we explicitly
ignore the return value in our signal handler, where we couldn't do
anything anyway.
Compile with:
make CFLAGS="-O -D_FORTIFY_SOURCE"
before this commit to see the warning.
Currently we assume that all errors on stat() a dname is fatal (but
continue anyway and report the error at the end). However, some errors
reported by stat() such as a missing file or insufficient privilege,
we can simply ignore and skip the file. For the others, such as a fault
(unlikely!) or out-of-memory, we handle like the other fatal errors by
jumping to the end.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
'notmuch new' skips directory entries with the name 'tmp'. This is to
prevent notmuch from processing possibly incomplete Maildir messages
stored in that directory.
This patch attempts to refine the feature. If "tmp" entry is found,
it first checks if the containing directory looks like a Maildir
directory. This is done by searching for other common Maildir
subdirectories. If they exist and if the entry "tmp" is a directory
then it is skipped.
Files and subdirectories with the name "tmp" that do not look like
Maildir will still be processed by 'notmuch new'.
Signed-off-by: Jan Janak <jan@ryngle.com>
We look at the modified time of the database and the directory
to decide whether we need to look at only the subdirectories.
ie, if directory modified time is < database modified time
then we have already looking at all the files withing the
directory. So we just need to iterate through the subdirectories
But with symlinks we need to make sure we follow them even if
the directory modified time is less than database modified time
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
When running "notmuch new --verbose", ANSI escapes are used. This may not be
desirable when the output of the command is *not* being sent to a terminal
(e.g. when piping output into another command). In that case each file
processed is printed in a new line and ANSI escapes are not used at all.