notmuch

mirror of https://git.notmuchmail.org/git/notmuch synced 2024-11-21 18:38:08 +01:00

Author	SHA1	Message	Date
Carl Worth	71bd250cb6	Avoid complaints about messages with empty mime parts.	2009-10-14 17:09:56 -07:00
Carl Worth	48d2e2dc44	Avoid complaints about empty address lists.	2009-10-14 17:09:30 -07:00
Carl Worth	bae1ce09a3	Document the little details separating the sup and notmuch indexes. As can be seen here, there are not a lot of differences. I've verified this by using sup-sync to import a month of mail from the sup mailing list, and comparing the database term-by-term, value-by-value, and data-by-data with that created by notmuch. There are no differences other than those documented here.	2009-10-14 16:49:26 -07:00
Carl Worth	784779fb67	Avoid trimming initial whitespace while looking for signatures. I ran into a message with an indented stack trace that my indexer was mistaking for a signature.	2009-10-14 16:38:21 -07:00
Carl Worth	30ed705fda	Index an attachment's filename extension as well. I hadn't realized that sup used a special term for this. But there you go.	2009-10-14 16:35:03 -07:00
Carl Worth	29974af08f	Index the filename of any attachment.	2009-10-14 16:28:07 -07:00
Carl Worth	653ff260f5	[sup-compat] Don't index mime parts with content-disposition of attachment Here's another change which I'm making for sup compatibility against my better judgment. It seems that sup never indexes content from mime parts with content-disposition of attachment. But these attachments are often very indexable, (for example, the first one I encountered was a small shell script). So I'll have to think a bit about whether or not I want to revert this commit. To do this properly we would really want to distinguish between attachments that are indexable, (such as text), and those that aren't, (such as binaries). I know the mime-type alone isn't alwas sufficient here as even this little plaintext shell script was attached as octet-stream. And if we wanted to get really fancy we could run things like antiword to generate text from non-text attachments and index their output.	2009-10-14 16:20:45 -07:00
Carl Worth	7c9dbbad40	Add label "attachment" when an attachment is seen.	2009-10-14 16:20:26 -07:00
Carl Worth	870b398726	Split thread_id value on commas before inserting into hash. One thread_id value may have multiple thread IDs in it so we need to separate them out before inserting into our hash.	2009-10-14 16:04:25 -07:00
Carl Worth	27c01802c8	Add missing null terminator before using byte-array contents as string. Thanks to valgrind for spotting this one.	2009-10-14 15:55:07 -07:00
Carl Worth	7878175ed9	notmuch-index-message: Add explicit support for multipart mime. Instead of using the recursive "foreach" method, we implement our own recursive function. This allows us to ignore the signature component of a multipart/signed message, (which we certainly don't need to index).	2009-10-14 15:36:13 -07:00
Carl Worth	6363ab32ea	[sup-compat] Don't trim trailing whitespace on line introducing quotation. Ignoring this whitespace seems like a good idea to me, but it's interfering with my comparisons with sup since sup doesn't do this. This might be a commit worth dropping in the future since it exists only for pedantic consistency with sup and not for any reason of its own.	2009-10-14 14:06:06 -07:00
Carl Worth	736bad40ac	notmuch-index-message: Fix handling of thread_id terms. We now emit one term per thread_id, rather than the comma-separated super-term we were doing previously.	2009-10-14 14:00:10 -07:00
Carl Worth	535b14dcba	notmuch-index-message: Use local-part of email addres in lieu of name. If there's no name given, take the portion of the email addres before the '@' sign. One step closer to matching sup's terms in the database.	2009-10-14 13:47:18 -07:00
Carl Worth	be72bf3070	Use gmime's own reference-parsing code. Here's another instance where I "knew" gmime must have support for some functionality, but not finding it, I rolled my own. Now that I found g_mime_references_decode I'm glad to drop my ugly code.	2009-10-14 13:30:33 -07:00
Carl Worth	1c63ec7031	notmuch-index-message: Correctly parse and index encoded mime parts. This cleans up some old code that was very ugly, (separately opening the mail file and seeking to the end of the headers to parse the body). I knew gmime must have had support for transparently decoding mime content, but I just couldn't find it previously. Note: Multipart and MultipartSigned parts are not handled yet. Things are quite happy now. The few differences I see with sup are: 1. sup forces email address domains to lowercase, (I don't think I care) 2. sup and notmuch disagree on ordering of multiple thread_id values (another thing that's of no concern) We are still doing one thing wrong when a message belongs to multiple threads. We've got a nice comma-separated thread-value just like sup, but then we're also putting in a comma-separated thread-term where sup does multiple thread terms. That should be an easy fix. Beyond that, sup and notmuch are still disagreeing on the term lists for some messages, (I think attachment vs. inline content-disposition is at least one piece of this). But there are likley still differences in the heuristics for which chunks of the message body to index. I'll be looking into this more.	2009-10-14 13:29:52 -07:00
Carl Worth	9ab2447e89	notmuch-index-message: Lookup children for thread_id as well. This provides the thread_id linkage for when a child message is indexed before the parent.	2009-10-14 10:34:05 -07:00
Carl Worth	ed320cb45b	notmuch-index-message: Use more meaningful variable names. The abuse of the generic "value" name was getting very hard to read.	2009-10-14 09:57:59 -07:00
Carl Worth	7d1227c4a8	notmuch-index-message: Start generating correct thread_id values. Currently we're looking up all parents (based on In-reply-to and References header) and using the list of all thread_id values from those as our thread_id value. We're missing one step which sup does which is to also look up any children in the database that have reference our message ID. So we'll need to do that next.	2009-10-14 09:54:05 -07:00
Carl Worth	5cbdcbbec5	Factor out parsing of reference-header values and pickup In-reply-to. This is in preparation for doing a couple of passes over the references, (one to add terms to the database, and a second to find the thread_id). We also now parse the In-reply-to header which we were missing before. We treat it identically to the References header.	2009-10-14 08:02:27 -07:00
Carl Worth	09f765ce18	notmuch-index-message: Ignore more signature patterns. Getting more sup-compatible all the time.	2009-10-14 07:24:28 -07:00
Carl Worth	c0da89a8e0	notmuch-index-message: Avoid crashing when a message has no references. It's obviously an innocent-enough message, and the right thing is so easy to do.	2009-10-13 21:15:12 -07:00
Carl Worth	3922bb4cfd	notmuch-index-message: Read message filenames from stdin This allows for indexing an arbitrary number of messages with a single invocation rather than just a single message on the command line.	2009-10-13 21:15:07 -07:00
Carl Worth	3253954233	Move index_file out from main() into its own function. This is a step toward having a program that will index many messages with a single invocation.	2009-10-13 21:15:06 -07:00
Carl Worth	c4812dae16	notmuch-index-message: Index References as well. We're basically matching sup now! (As long as one uses sup with my special notmuch_index.rb file).	2009-10-13 21:15:01 -07:00
Carl Worth	dceb501e44	Minor code re-ordering for clarity. Pull the "constant" source_id value out from among several calls that set a value based on the Message ID.	2009-10-13 21:15:00 -07:00
Carl Worth	1479b99b50	notmuch-index-message: Don't index the "re:" prefix in subjects. Getting closer to sup results all the time.	2009-10-13 21:14:55 -07:00
Carl Worth	9bf3cda34c	notmuch-index-message: Don't index the line introducing a quote. We identify it based on a trailing ':' on the line before a quote begins. At this point the database-dump diff between sup and notmuch is getting very, very small, (at least for our one test message).	2009-10-13 21:14:50 -07:00
Carl Worth	048b8aec11	notmuch-index-message: Don't index quoted lines and signatures. At this point, we're achieving a result that is very close to what sup does. The only difference is that we are still indexing the "excerpts from message ..." line, and we are not yet indexing references.	2009-10-13 21:14:44 -07:00
Carl Worth	9dbb1facfb	notmuch-index-message: Separate gen_terms_body into its own function This one is complex enough to deserve its own treament.	2009-10-13 21:14:33 -07:00
Carl Worth	f69215d41f	notmuch-index-message: Add code to actually create a Xapian index Most of this code is fairly clean and works well. One part is fairly painful---namely extracting the body of an email message from libgmime. Currently, I'm just extracting the offset to the end of the headers, and then separately opening the message. Surely there's a better way. Anyway, with that the results are looking very similar to sup-sync now, (as verified by xapian-dump). The only substantial difference I'm seeing now is that sup does not seem to index quoted portions of messages nor signatures. I'm not actually sure whether I want to follow sup's lead in that or not.	2009-10-13 15:59:57 -07:00
Carl Worth	c55c34f4a0	Rename g_mime_test to notmuch-index-message In preparation for actually creating a Xapian index from the message, (not that we're doing that quite yet).	2009-10-13 13:31:17 -07:00
Carl Worth	a68a023d47	xapian-dump: Add a little mor indentation Just to make it easier to visually identify where one document ends and the next begins.	2009-10-13 13:21:47 -07:00
Carl Worth	1a6d88697b	Include document data in the dump. At the same time, I've started hacking up sup with a new NotmuchIndex class in the place of the previous XapianIndex class. The new class stores only the source_info field in the document data, (rather than a serialized ruby hash with a bunch of data that can be found in the original message). Eventually, I plan to replace source_info with a relative filename for the message, (or even a list of filenames for when multiple messages in the database share a common message ID).	2009-10-13 13:18:32 -07:00
Carl Worth	ea96cb694f	xapian-dump: Add support to unserialize values. The interface for this is cheesy, (bare integer value numbers on the command line indicating that unserialization is desired for those value numbers). But this at least lets us print sup databases with human-readable output for the date values.	2009-10-13 09:36:25 -07:00
Carl Worth	96a706383f	Add .gitignore file to ignore compiled binaries.	2009-10-13 08:57:02 -07:00
Carl Worth	76e15cf673	xapian-dump: Add values to the dump as well.	2009-10-13 08:54:43 -07:00
Carl Worth	c8532ce25d	xapian-dump: Fix to dump all terms for each document ID.	2009-10-13 08:54:35 -07:00
Carl Worth	26795d64e6	xapian-dump: Actually dump document IDs It's not a complete tool yet, but it at least does something now.	2009-10-13 08:53:34 -07:00
Carl Worth	287ffc828d	Remove unused variable. Compiling with -Wall considered useful.	2009-10-13 08:53:28 -07:00
Carl Worth	11f99eb8ea	Add the beginnings of a xapian-dump program. This will (when it is finished) make a much more reliable way to ensure that notmuch's sync program behaves identically to sup-sync. It doesn't actually do anything yet.	2009-10-13 08:53:14 -07:00
Carl Worth	5986cfe5e7	Add sup-compatible prefixes and achieve sup-compatible print output What I've done here is to instrument sup-sync to print the text and terms objects it constructs just before indexing a message. Then I've made my g_mime_test program achieve (nearly) identical output for an example email message, (just missing the body text). Next we can start shoving this data into a Xapian index.	2009-10-13 08:52:34 -07:00
Carl Worth	7d0886352c	Initial commit of a test program to form the basis of notmuch. Basically just playing with some simple code using libgmime to parse an email message.	2009-10-13 08:52:02 -07:00

... 147 148 149 150 151

7543 commits