Commit graph

3424 commits

Author SHA1 Message Date
Carl Worth
371091139a Rework message parsing to use getline rather than mmap.
The line-based parsing can be a bit awkward when wanting to peek
ahead, (say, for folded header values), but it's so convenient
to be able to trust that a string terminator exists on every
line so it cleans up the code considerably.
2009-10-19 16:38:44 -07:00
Carl Worth
45f0d7bcab Don't hash headers we won't end up using.
Just saving a little work here.
2009-10-19 13:48:13 -07:00
Carl Worth
c5eea2b77e Document which pieces of glib we're still using.
Looks like we can copy in a hash-table implementation, (from cairo,
say), and then a few _ascii_ functions from glib, (we'll need to
switch a few current uses if things like isspace, etc. to locale-
independent versions as well). So not too hard to free ourselves
of glib for now, (until we add GMime back in later, of course).
2009-10-19 13:40:56 -07:00
Carl Worth
fa562fa22b Hook up our fancy new notmuch_parse_date function.
With all the de-glib-ification out of the way, we can now use it
to allow for date-based sorting of Xapian search results.
2009-10-19 13:35:29 -07:00
Carl Worth
401c6cc579 notmuch_parse_date: Handle a NULL date string gracefully.
The obvious thing to do is to treat a missing date as the beginning
of time. Also, remove a useless cast from another return of 0.
2009-10-19 13:24:12 -07:00
Carl Worth
8e4e0559e7 date.c: Rename function to notmuch_parse_date
Now completing the process of making this function "our own".

The documentation is deleted here, because we already have
the documentation we want in notmuch-private.h.
2009-10-19 13:24:07 -07:00
Carl Worth
747f610901 date.c: Add hard-coded definition of HAVE_TIMEZONE
The original code expected this to be set by running configure.
We'll just manually set it here for now. This isn't as portable
as if we were doing some compile-time examination of the current
system, but I don't need portability now.

When someone comes along that wants to port notmuch to another
system, they will already have all the #ifdefs in place and
will simply need to add the appropriate machinery to set the
defines.
2009-10-19 13:19:37 -07:00
Carl Worth
c2c50d50c5 date.c: Don't use glib's slice allocator.
This change is gratuitous. For now, notmuch is still linking
against glib, so I don't have any requirement to remove this,
(unlike the last few changes where good taste really did
require the changes).

The motivation here is two-fold:

1. I'm considering switching away from all glib-based allocation
soon so that I can more easily verify that the memory management
is solid. I want valgrind to say "no leaks are possible" not
"there is tons of memory still allocated, but probably reachable
so who knows if there are leaks or not?". And glib seems to make
that impossible.

2. I don't think there's anything performance-sensitive about the
allocation here. (In fact, if there is, then the right answer
would be to do this parsing without any allocation whatsoever.)
2009-10-19 13:14:37 -07:00
Carl Worth
c777524834 date.c: Remove occurrences of gboolean.
While this is surely one of the most innocent typedefs, it still
annoys me to have basic types like 'int' re-defined like this.
It just makes it harder to copy the code between projects, with
very little benefit in readability.

For readability, predicate functions and variables should be
obviously Boolean-natured by their actual *names*.
2009-10-19 13:11:57 -07:00
Carl Worth
dbadca9a63 date.c: Remove all occurrences of g_return_val_if_fail
That's got to be one of the hardest macro names to read, ever,
(it's phrased with an implicit negative in the condition,
rather than something simple like "assert").

Plus, it's evil, since it's a macro with a return in it.

And finally, it's actually *longer* than just typing "if"
and "return". So what's the point of this ugly idiom?
2009-10-19 13:09:19 -07:00
Carl Worth
9f3649370c date.c: Keep the comments clean.
Never know when the children might be reading over my shoulder,
for example. :-)
2009-10-19 13:07:58 -07:00
Carl Worth
f638fbf8d6 date.c: Change headers/defines t owork within notmuch.
We can't rely on any gmime-internal headers, (and fortunately we
don't need to). We also aren't burdened with any autconf machinery
so don't reference any of that.
2009-10-19 13:06:55 -07:00
Carl Worth
e26a2bf48b date.c: Remove a bunch of undesired code.
We're only interested in the date-parsing code here.
2009-10-19 13:06:37 -07:00
Carl Worth
4f9aa77d80 date.c: Convert from LGPL-2+ to GPL-3+
As authorized by LGPL-2 term (3).
2009-10-19 13:02:17 -07:00
Carl Worth
f5f8dcf2a0 date.c: Add new file directly from gmime2.4-2.4.6/gmime/gmime-utils.c
We're sucking in one gmime implementation file just to get the
piece that parses an RFC 822 date, because I don't want to go
through the pain of replicating that.
2009-10-19 13:00:51 -07:00
Carl Worth
0e777a8f80 notmuch: Switch from gmime to custom, ad-hoc parsing of headers.
Since we're currently just trying to stitch together In-Reply-To
and References headers we don't need that much sophistication.
It's when we later add full-text searching that GMime will be
useful.

So for now, even though my own code here is surely very buggy
compared to GMime it's also a lot faster. And speed is what
we're after for the initial index creation.
2009-10-19 13:00:43 -07:00
Carl Worth
9bc4253fa8 notmuch: Ignore .notmuch when counting files.
We were correctly ignoring this when adding files, but not when
doing the initial count. Clearly we need better code sharing
here.
2009-10-19 12:52:46 -07:00
Carl Worth
10c176ba0e notmuch: Start actually adding messages to the index.
This is the beginning of the notmuch library as well, with its
interface in notmuch.h. So far we've got create, open, close, and
add_message (all with a notmuch_database prefix).

The current add_message function has already been whittled down from
what we have in notmuch-index-message to add only references,
message-id, and thread-id to the index, (that is---just enough to do
thread-linkage but nothing for full-text searching).

The concept here is to do something quickly so that the user can get
some data into notmuch and start using it. (The most interesting stuff
is then thread-linkage and labels like inbox and unread.)  We can
defer the full-text indexing of the body of the messages for later,
(such as in the background while the user is reading mail).

The initial thread-stitching step is still slower than I would like.
We may have to stop using libgmime for this step as its overhead is
not worth it for the simple case of just parsing the message-id,
references, and in-reply-to headers.
2009-10-18 20:56:30 -07:00
Carl Worth
512f7bb0f6 xapian-dump: Rewrite to generate C code as output.
This was for some time testing, (to see how fast xapian could be
if we were strictly adding documents and not doing any other IO
or computation). The answer is that xapian is quite fast, (on
the order of 1000 documents per second).
2009-10-18 20:49:43 -07:00
Carl Worth
36640b303e Start a new top-level executable: notmuch.
Of course, there's not much that this program does yet. It's got
some structure for some sub-commands that don't do anything. And
it has a main command that prints some explanatory text and then
counts all the regular files in your mail archive.
2009-10-17 08:26:58 -07:00
Carl Worth
9c3807e688 Fix more memory leaks.
These were more significant than the previous leak because these were
in the loop and leaking memory for every message being parsed. It
turns out that g_hash_table_new should probably be named
g_hash_table_new_and_leak_memory_please. The actually useful function
is g_hash_table_new_full which lets us pass a free function, (to free
keys when inserting duplicates into the hash table). And after all,
weeding out duplicates is the only reason we are using this hash table
in the first place.

It almost goes without saying, valgrind found these leaks.
2009-10-16 13:45:17 -07:00
Carl Worth
28c0691ab9 Fix a one-time memory leak.
This was a single object in main outside any loops, so there was
no impact on performance or anything, but obviously we still want
to patch this.

Of course, valgrind gets the credit for seeing this.
2009-10-16 13:41:37 -07:00
Carl Worth
dcebf35ec9 Avoid reading a byte just before our allocated buffer.
When looking for a trailing ':' to introduce a quotation we peek at
the last character before a newline. But for blank lines, that's not
where we want to look. And when the first line in our buffer is a
blank line, we're underrunning our buffer. The fix is easy---just
bail early on blank lines since they have no terms anyway.

Thanks to valgrind for pointing out this error.
2009-10-16 13:38:43 -07:00
Carl Worth
387a28281c Generate random thread IDs instead of using an arbitrary Message-ID.
Previously, we used as the thread-id the message-id of the first
message in the thread that we happened to find. In fact, this is a
totally arbitrary identifier, so it might as well be random. And an
advantage of actually using a random identifier is that we now have
fixed-length thead identifiers, (and the way is open to even allow
abbreviated identifiers like git does---though we're less likely to
show these identifiers to actual users).
2009-10-16 13:33:39 -07:00
Carl Worth
5fbdbeb333 Change progress report to show "instantaneous" rate. Also print total time.
Instead of always showing the overall rate, we wait until the end
to show that. Then, on incremental updates we show the rate over the
last increment. This makes it much easier to actually watch what's
happening, (and it's easy to see the efect of xapian's internal
10,000 document flush).
2009-10-15 09:04:31 -07:00
Keith Packard
a2c467242a Protect against missing message id while indexing files 2009-10-14 21:46:54 -07:00
Keith Packard
8f3ccda00f Walk address groups and parse each address separately
Signed-off-by: Keith Packard <keithp@keithp.com>
2009-10-14 21:17:39 -07:00
Carl Worth
5166406bef Reduce the verbosity of the progress indicator.
It's fast enough that we can wait for 1000 messages before updating.
2009-10-14 17:26:28 -07:00
Carl Worth
a5865d0574 Add support for message-part mime parts.
We could (and probably should) reparse and index all the headers from
the embedded message, but I'm not choosing to do that now---I'm just
indexing the body of the embedded message.
2009-10-14 17:25:20 -07:00
Carl Worth
914df660c4 Avoid segfault on message with no subject.
It's fun how turning a program loose on 500,000 messages will find
lots of littel corner cases.
2009-10-14 17:24:28 -07:00
Carl Worth
d643f7d776 Add some sort of progress indicator.
It's nice to let the user know that something is happening.
2009-10-14 17:10:14 -07:00
Carl Worth
71bd250cb6 Avoid complaints about messages with empty mime parts. 2009-10-14 17:09:56 -07:00
Carl Worth
48d2e2dc44 Avoid complaints about empty address lists. 2009-10-14 17:09:30 -07:00
Carl Worth
bae1ce09a3 Document the little details separating the sup and notmuch indexes.
As can be seen here, there are not a lot of differences. I've verified
this by using sup-sync to import a month of mail from the sup mailing
list, and comparing the database term-by-term, value-by-value, and
data-by-data with that created by notmuch. There are no differences
other than those documented here.
2009-10-14 16:49:26 -07:00
Carl Worth
784779fb67 Avoid trimming initial whitespace while looking for signatures.
I ran into a message with an indented stack trace that my indexer
was mistaking for a signature.
2009-10-14 16:38:21 -07:00
Carl Worth
30ed705fda Index an attachment's filename extension as well.
I hadn't realized that sup used a special term for this. But there
you go.
2009-10-14 16:35:03 -07:00
Carl Worth
29974af08f Index the filename of any attachment. 2009-10-14 16:28:07 -07:00
Carl Worth
653ff260f5 [sup-compat] Don't index mime parts with content-disposition of attachment
Here's another change which I'm making for sup compatibility against
my better judgment. It seems that sup never indexes content from
mime parts with content-disposition of attachment. But these
attachments are often very indexable, (for example, the first one
I encountered was a small shell script).

So I'll have to think a bit about whether or not I want to revert
this commit. To do this properly we would really want to distinguish
between attachments that are indexable, (such as text), and those
that aren't, (such as binaries). I know the mime-type alone isn't
alwas sufficient here as even this little plaintext shell script
was attached as octet-stream.

And if we wanted to get really fancy we could run things like antiword
to generate text from non-text attachments and index their output.
2009-10-14 16:20:45 -07:00
Carl Worth
7c9dbbad40 Add label "attachment" when an attachment is seen. 2009-10-14 16:20:26 -07:00
Carl Worth
870b398726 Split thread_id value on commas before inserting into hash.
One thread_id value may have multiple thread IDs in it so we need
to separate them out before inserting into our hash.
2009-10-14 16:04:25 -07:00
Carl Worth
27c01802c8 Add missing null terminator before using byte-array contents as string.
Thanks to valgrind for spotting this one.
2009-10-14 15:55:07 -07:00
Carl Worth
7878175ed9 notmuch-index-message: Add explicit support for multipart mime.
Instead of using the recursive "foreach" method, we implement our
own recursive function. This allows us to ignore the signature
component of a multipart/signed message, (which we certainly
don't need to index).
2009-10-14 15:36:13 -07:00
Carl Worth
6363ab32ea [sup-compat] Don't trim trailing whitespace on line introducing quotation.
Ignoring this whitespace seems like a good idea to me, but it's
interfering with my comparisons with sup since sup doesn't do this.

This might be a commit worth dropping in the future since it exists
only for pedantic consistency with sup and not for any reason of its
own.
2009-10-14 14:06:06 -07:00
Carl Worth
736bad40ac notmuch-index-message: Fix handling of thread_id terms.
We now emit one term per thread_id, rather than the comma-separated
super-term we were doing previously.
2009-10-14 14:00:10 -07:00
Carl Worth
535b14dcba notmuch-index-message: Use local-part of email addres in lieu of name.
If there's no name given, take the portion of the email addres
before the '@' sign.

One step closer to matching sup's terms in the database.
2009-10-14 13:47:18 -07:00
Carl Worth
be72bf3070 Use gmime's own reference-parsing code.
Here's another instance where I "knew" gmime must have support for
some functionality, but not finding it, I rolled my own. Now that
I found g_mime_references_decode I'm glad to drop my ugly code.
2009-10-14 13:30:33 -07:00
Carl Worth
1c63ec7031 notmuch-index-message: Correctly parse and index encoded mime parts.
This cleans up some old code that was very ugly, (separately opening
the mail file and seeking to the end of the headers to parse the
body). I knew gmime must have had support for transparently decoding
mime content, but I just couldn't find it previously.

Note: Multipart and MultipartSigned parts are not handled yet.

Things are quite happy now. The few differences I see with sup are:

1. sup forces email address domains to lowercase, (I don't think I care)

2. sup and notmuch disagree on ordering of multiple thread_id values
   (another thing that's of no concern)

We are still doing one thing wrong when a message belongs to multiple
threads. We've got a nice comma-separated thread-value just like sup,
but then we're also putting in a comma-separated thread-term where
sup does multiple thread terms. That should be an easy fix.

Beyond that, sup and notmuch are still disagreeing on the term lists
for some messages, (I think attachment vs. inline content-disposition
is at least one piece of this). But there are likley still differences
in the heuristics for which chunks of the message body to index. I'll
be looking into this more.
2009-10-14 13:29:52 -07:00
Carl Worth
9ab2447e89 notmuch-index-message: Lookup children for thread_id as well.
This provides the thread_id linkage for when a child message is
indexed before the parent.
2009-10-14 10:34:05 -07:00
Carl Worth
ed320cb45b notmuch-index-message: Use more meaningful variable names.
The abuse of the generic "value" name was getting very hard to read.
2009-10-14 09:57:59 -07:00
Carl Worth
7d1227c4a8 notmuch-index-message: Start generating correct thread_id values.
Currently we're looking up all parents (based on In-reply-to and
References header) and using the list of all thread_id values
from those as our thread_id value. We're missing one step which
sup does which is to also look up any children in the database
that have reference our message ID. So we'll need to do that next.
2009-10-14 09:54:05 -07:00