Commit graph

17 commits

Author SHA1 Message Date
Carl Worth
1c63ec7031 notmuch-index-message: Correctly parse and index encoded mime parts.
This cleans up some old code that was very ugly, (separately opening
the mail file and seeking to the end of the headers to parse the
body). I knew gmime must have had support for transparently decoding
mime content, but I just couldn't find it previously.

Note: Multipart and MultipartSigned parts are not handled yet.

Things are quite happy now. The few differences I see with sup are:

1. sup forces email address domains to lowercase, (I don't think I care)

2. sup and notmuch disagree on ordering of multiple thread_id values
   (another thing that's of no concern)

We are still doing one thing wrong when a message belongs to multiple
threads. We've got a nice comma-separated thread-value just like sup,
but then we're also putting in a comma-separated thread-term where
sup does multiple thread terms. That should be an easy fix.

Beyond that, sup and notmuch are still disagreeing on the term lists
for some messages, (I think attachment vs. inline content-disposition
is at least one piece of this). But there are likley still differences
in the heuristics for which chunks of the message body to index. I'll
be looking into this more.
2009-10-14 13:29:52 -07:00
Carl Worth
9ab2447e89 notmuch-index-message: Lookup children for thread_id as well.
This provides the thread_id linkage for when a child message is
indexed before the parent.
2009-10-14 10:34:05 -07:00
Carl Worth
ed320cb45b notmuch-index-message: Use more meaningful variable names.
The abuse of the generic "value" name was getting very hard to read.
2009-10-14 09:57:59 -07:00
Carl Worth
7d1227c4a8 notmuch-index-message: Start generating correct thread_id values.
Currently we're looking up all parents (based on In-reply-to and
References header) and using the list of all thread_id values
from those as our thread_id value. We're missing one step which
sup does which is to also look up any children in the database
that have reference our message ID. So we'll need to do that next.
2009-10-14 09:54:05 -07:00
Carl Worth
5cbdcbbec5 Factor out parsing of reference-header values and pickup In-reply-to.
This is in preparation for doing a couple of passes over the references,
(one to add terms to the database, and a second to find the thread_id).

We also now parse the In-reply-to header which we were missing before.
We treat it identically to the References header.
2009-10-14 08:02:27 -07:00
Carl Worth
09f765ce18 notmuch-index-message: Ignore more signature patterns.
Getting more sup-compatible all the time.
2009-10-14 07:24:28 -07:00
Carl Worth
c0da89a8e0 notmuch-index-message: Avoid crashing when a message has no references.
It's obviously an innocent-enough message, and the right thing is
so easy to do.
2009-10-13 21:15:12 -07:00
Carl Worth
3922bb4cfd notmuch-index-message: Read message filenames from stdin
This allows for indexing an arbitrary number of messages with a
single invocation rather than just a single message on the command
line.
2009-10-13 21:15:07 -07:00
Carl Worth
3253954233 Move index_file out from main() into its own function.
This is a step toward having a program that will index many messages
with a single invocation.
2009-10-13 21:15:06 -07:00
Carl Worth
c4812dae16 notmuch-index-message: Index References as well.
We're basically matching sup now! (As long as one uses sup with my
special notmuch_index.rb file).
2009-10-13 21:15:01 -07:00
Carl Worth
dceb501e44 Minor code re-ordering for clarity.
Pull the "constant" source_id value out from among several calls
that set a value based on the Message ID.
2009-10-13 21:15:00 -07:00
Carl Worth
1479b99b50 notmuch-index-message: Don't index the "re:" prefix in subjects.
Getting closer to sup results all the time.
2009-10-13 21:14:55 -07:00
Carl Worth
9bf3cda34c notmuch-index-message: Don't index the line introducing a quote.
We identify it based on a trailing ':' on the line before a quote
begins.

At this point the database-dump diff between sup and notmuch is
getting very, very small, (at least for our one test message).
2009-10-13 21:14:50 -07:00
Carl Worth
048b8aec11 notmuch-index-message: Don't index quoted lines and signatures.
At this point, we're achieving a result that is *very* close to
what sup does. The only difference is that we are still indexing
the "excerpts from message ..." line, and we are not yet indexing
references.
2009-10-13 21:14:44 -07:00
Carl Worth
9dbb1facfb notmuch-index-message: Separate gen_terms_body into its own function
This one is complex enough to deserve its own treament.
2009-10-13 21:14:33 -07:00
Carl Worth
f69215d41f notmuch-index-message: Add code to actually create a Xapian index
Most of this code is fairly clean and works well. One part is
fairly painful---namely extracting the body of an email message
from libgmime. Currently, I'm just extracting the offset to
the end of the headers, and then separately opening the message.
Surely there's a better way.

Anyway, with that the results are looking very similar to sup-sync
now, (as verified by xapian-dump). The only substantial difference
I'm seeing now is that sup does not seem to index quoted portions
of messages nor signatures. I'm not actually sure whether I want
to follow sup's lead in that or not.
2009-10-13 15:59:57 -07:00
Carl Worth
c55c34f4a0 Rename g_mime_test to notmuch-index-message
In preparation for actually creating a Xapian index from the
message, (not that we're doing that quite yet).
2009-10-13 13:31:17 -07:00
Renamed from g_mime_test.c (Browse further)