Parse emails with RegEx

Some days ago I wrote a parser for MIME mails. MIME extends the structure of simple emails (RFC 2822) to allow for non-ASCII / binary attachments and shipping alternative versions of the content (e.g. HTML for rich and plaintext for text clients).

The parser is written in JavaScript, therefore any code examples will be JS – but the regular expressions used can be adapted for other languages of course.

The target

MIME mails are characterized by the version header which is added to the other headers. It always is 'MIME-Version: 1.0' as there is no other version. More changes apply to the body structure. Where simple mails directly start with the actual message, MIME mails declare additional headers to describe the body contents (content type and encoding) and in case of multipart messages also its structure.

The code box below shows a multipart email which comes with two alternative variants of the contents. The HTML subpart again is a multipart container which carries the HTML for the message and an image referenced by it. If you want to look at the mail yourself, make sure that each line ends with \r\n resp. CRLF and no whitespaces are accidentally added.

Example email with embedded image

by example (Postfix) with ESMTPSA id FFFFFFFFFF
for <test@example.com>; Thu, 23 Jun 2011 11:26:44 +0200 (CEST)
Message-ID: <FFFFFFFF.5555555@example.com>
Date: Thu, 23 Jun 2011 11:26:29 +0200
From: Example Sender <test@example.com>
User-Agent: Mozilla/5.0 Gecko/20110616 Thunderbird/3.1.11
MIME-Version: 1.0
To: Example Receiver <test@example.com>
Subject: HTML Mail with embedded picture
Content-Type: multipart/alternative;
boundary="------------060102080402030702040100"
This is a multi-part message in MIME format.
--------------060102080402030702040100
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Hello,
this is an HTML mail, it has *bold*, /italic /and _underlined_ text.
And then we have a table here:
Cell(1,1)
Cell(2,1)
Cell(1,2) Cell(2,2)
And we put a picture here:
Image Alt Text
That's it.
--------------060102080402030702040100
Content-Type: multipart/related;
boundary="------------030904080004010009060206"
--------------030904080004010009060206
Content-Type: text/html; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-15">
</head>
<body bgcolor="#ffffff" text="#000000">
Hello,<br>
<br>
this is an HTML mail, it has <b>bold</b>, <i>italic </i>and <u>underlined</u>
text.<br>
And then we have a table here:<br>
<table border="1" cellpadding="2" cellspacing="2" height="62"
width="401">
<tbody>
<tr>
<td valign="top">Cell(1,1)<br>
</td>
<td valign="top">Cell(2,1)</td>
</tr>
<tr>
<td valign="top">Cell(1,2)</td>
<td valign="top">Cell(2,2)</td>
</tr>
</tbody>
</table>
<br>
And we put a picture here:<br>
<br>
<img alt="Image Alt Text"
src="cid:part1.FFFFFFFF.5555555@example.com" height="79"
width="98"><br>
<br>
That's it.<br>
<br>
</body>
</html>
--------------030904080004010009060206
Content-Type: image/jpeg;
name="picture.jpg"
Content-Transfer-Encoding: base64
Content-ID: <part1.FFFFFFFF.5555555@example.com>
Content-Disposition: inline;
filename="picture.jpg"
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYH
BwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcI
DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAAR
CABPAGIDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAA
AgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkK
FhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWG
h4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl
5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREA
AgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYk
NOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOE
hYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk
5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5rooor8DP9oAooooAK5H4/f8AJCPG3/YB
vv8A0nkrrq5H4/f8kI8bf9gG+/8ASeSujCfx4eq/M8XiT/kUYr/r3P8A9JZ5v/wRB/5qd/3C
v/b2vvavgn/giD/zU7/uFf8At7X3tX03EP8AyMKny/8ASUf83nij/wAlPif+3P8A03AKKKK8
U+ACiiigAooooA+f6KKK8s/6nAooooAK5H4/f8kI8bf9gG+/9J5K66uR+P3/ACQjxt/2Ab7/
ANJ5K6MJ/Hh6r8zxeJP+RRiv+vc//SWeb/8ABEH/AJqd/wBwr/29r72r4J/4Ig/81O/7hX/t
7X3tX03EP/IwqfL/ANJR/wA3nij/AMlPif8Atz/03AKKKK8U+ACiiigAooooA/Pb/h5j4E/6
BPi7/wABbf8A+P0f8PMfAn/QJ8Xf+Atv/wDH6/av/iHQ/Y2/6I7/AOXZrn/yZR/xDofsbf8A
RHf/AC7Nc/8Akyvwj/iaDw5/6BMZ/wCA0v8A5ef7S/8AEbvEL/n9Q/8AAP8AgH4qf8PMfAn/
AECfF3/gLb//AB+j/h5j4E/6BPi7/wABbf8A+P1+1f8AxDofsbf9Ed/8uzXP/kyj/iHQ/Y2/
6I7/AOXZrn/yZR/xNB4c/wDQJjP/AAGl/wDLw/4jd4hf8/qH/gH/AAD8VP8Ah5j4E/6BPi7/
AMBbf/4/WH8T/wDgoX4L8a/DXxDo1rpniiO61bTLmyhaW2gEavJEyKWImJAywzgE47Gv3H/4
h0P2Nv8Aojv/AJdmuf8AyZXlv7cf/BBb9k/4PfsU/GDxd4c+FP8AZ3iHwt4I1rV9Lu/+Em1i
b7LdW9hNLDJsku2Rtrop2spU4wQRxXdlf0lvD7E4yjhqOFxanOUYq8aVrtpK/wC+el99GcmP
8Z+Pq2GqUa1Wjyyi07Q1s1Z9Ox+Lv/BOj9s7wv8Asif8Jj/wklhr99/wkP2L7N/ZkEMuzyft
G7f5kqYz5q4xnoenGfpn/h858L/+gD4+/wDAK0/+Sa8D/wCCWf7Mfgf9o7/hOv8AhM9E/tn+
xv7P+x/6ZcW/k+b9p3/6p0zny065xjjGTX1v/wAOxvgd/wBCR/5WNQ/+P1/f2WeCmKz7DRza
nOCVS+8pJ+63HZQa+z3P8xPEPifgrCcQYjD5vQryrrk5nDk5XeEWrXmn8LV9Frc4T/h858L/
APoA+Pv/AACtP/kmj/h858L/APoA+Pv/AACtP/kmu7/4djfA7/oSP/KxqH/x+j/h2N8Dv+hI
/wDKxqH/AMfr0P8AiXHG/wDPyn/4HP8A+Vnxf+uvhx/0DYr/AMk/+WHCf8PnPhf/ANAHx9/4
BWn/AMk0f8PnPhf/ANAHx9/4BWn/AMk13f8Aw7G+B3/Qkf8AlY1D/wCP0f8ADsb4Hf8AQkf+
VjUP/j9H/EuON/5+U/8AwOf/AMrD/XXw4/6BsV/5J/8ALDhP+Hznwv8A+gD4+/8AAK0/+SaK
7v8A4djfA7/oSP8Aysah/wDH6KP+Jccb/wA/Kf8A4HP/AOVh/rr4cf8AQNiv/JP/AJYfv1RR
RX/Osf6chRRRQAV4l/wUv/5RxftAf9k28Rf+mu5r22vEv+Cl/wDyji/aA/7Jt4i/9NdzXv8A
Cn/I7wf/AF9p/wDpaMcT/Bn6P8j+fj/giL/zU7/uFf8At7X3pXwX/wAERf8Amp3/AHCv/b2v
vSv+ofwv/wCSZw3/AG//AOnJn+THjh/yW2N/7h/+maYUUUV98flAUUUUAFFFFAH7AUUUV/yN
n+4gUUUUAFeJf8FL/wDlHF+0B/2TbxF/6a7mvba8S/4KX/8AKOL9oD/sm3iL/wBNdzXv8Kf8
jvB/9faf/paMcT/Bn6P8j+fj/giL/wA1O/7hX/t7X3pXwX/wRF/5qd/3Cv8A29r70r/qH8L/
APkmcN/2/wD+nJn+THjh/wAltjf+4f8A6ZphRRRX3x+UBRRRQAUUUUAf/9k=
--------------030904080004010009060206--
--------------060102080402030702040100--

Parsing the mail

First I determine where the body part of the email starts. Then I take everything before this and fetch the headers from there. The following code shows what to do:

Parsing email headers

// 'mail' is a variable holding the email string
// RegEx matching first occurrence of 'Content-Type: ' - start of mail body
var eohRegEx = /^Content-Type: /im;
var bodyPos = mail.indexOf(eohRegEx.exec(mail)); // holds pos of C in 'Content-Type: '
var headerLines = mail.substring(0, bodyPos); // ends with \r\n 
// now determine the present headers and store them in an object
var headers = new Object();
var headerRegExp = /^(.+?): ((.|\r\n\s)+)\r\n/mg;
var h;
while (h = headerRegExp.exec(headerLines))
    headers[h[1]] = h[2];
// on a website you now can annoy the visitor with all headers and their values
for(var field in headers)
    alert(field + ": " + headers[field]);

The header RegEx

The RegEx to match single headers is /^(.+): ((.|\r\n\s)+)\r\n/mg. The first group captures every character up to the first colon which is the header’s name. The second group for the value captures every character until a CRLF occurs. But there is an exception that requires the \r\n\s alternative in the group: Long values may be split over several lines what is indicated by a leading whitespace character in the next line.

Decomposing the body

Having the body’s starting position (in bodyPos) I fetch its content and have a look at the first Content-Type header. If it contains ‚multipart/foo‘ then the body consists of a MIME container with at least two inner MIME fragments. Otherwise there is only one fragment which is either plaintext or HTML without embedded images.

Determine the body's outer structure

// 'bodyPos' and 'mail' from above
var bodyLines = mail.substring(bodyPos);
var contentType = (/^Content-Type: (.+)$/im).exec(bodyLines)[1];
// check whether body is multipart (grouping not further used here)
var mpRegEx = /^multipart\/(.+);/i;
// if multipart then the body is a container otherwise a fragment
var parsedBody;
if (mpRegEx.test(contentType))
    parsedBody = parseMimeContainer(bodyLines);
else
    parsedBody = parseMimeFragment(bodyLines);

Parsing MIME fragments

Fragments always have Content-Type and Content-Transfer-Encoding headers, additionally Content-ID (for referencing images in HTML) and Content-Disposition may be present. All lines following the MIME headers form the actual content of a fragment. The content type header defines what kind of data is at hand (e.g. text/plain or image/jpeg) and the transfer encoding specifies the given representation (binary data usually is base64 encoded).

Function to parse MIME fragments

function parseMimeFragment(fragmentStr){
    var result = new Object();
    // each fragment has a content type and encoding
    result["contentType"] = (/^Content-Type: (.+)$/m).exec(fragmentStr)[1];
    result["encoding"] =
                (/^Content-Transfer-Encoding: (.+)$/m).exec(fragmentStr)[1];
    // not all fragments have a disposition or an id
    if ((/^Content-Disposition: ((.|\r\n )+)\r\n/mg).test(fragmentStr)){
        result["contentDisposition"] =
            (/^Content-Disposition: ((.|\r\n )+)\r\n/mg).exec(fragmentStr)[1];
    }
    if ((/^Content-ID: ((.|\r\n\t)+)\r\n/mg).test(fragmentStr)){
        result["contentId"] =
                    (/^Content-ID: ((.|\r\n\t)+)\r\n/mg).exec(fragmentStr)[1];
    }
    // in my case between MIME headers and content where 2 \r\n,
    // may be only one if there are problems with this code
    result["contents"] = (/^.*\r\n\r\n([\s\S]+)\r\n/m).exec(fragmentStr)[1];
    result["isFragment"] = true;
    return result;
}

Parsing MIME containers

I conceive all MIME parts which are of type multipart as containers because they are composed of two or more MIME fragments or inner containers which are separated by a boundary value. Hence the possibility to nest containers allows for arbitrary depth in the resulting structure. To parse a container I sequentially check each part in it for its type and either call the above function to handle a fragment or do a recursive call to the following function.

Function to parse MIME containers

function parseMimeContainer(aContainerStr){
    // each container has a content type and boundary
    var cType = (/^Content-Type: ((.|\r\n )+)\r\n/m).exec(aContainerStr)[1];
    var boundary = (/^ boundary="(.+)"/m).exec(cType)[1];
    var result = new Object();
    result["contentType"] = cType;
    result["boundary"] = boundary;
    result["isFragment"] = false;
    // next fetch contents (everything after the first use of boundary)
    var contentRegEx =
        new RegExp("--" + boundary + "\r\n(((.|\\s)+)--" + boundary + ")--","img");
    var containerContents = contentRegEx.exec(aContainerStr)[1];
    // RegEx below determines where the next part ends
    var boundaryRegEx = new RegExp("^([\\s\\S]+?)--" + boundary, "m");
    var contents = new Object();
    // as long as there are more parts
    while(boundaryRegEx.test(containerContents)){
    // fetch next part, remove it from the remaining contents to be handled
    var nextPart = boundaryRegEx.exec(containerContents)[1];
    // + 4 is for the two dashes preceding the boundary value and \r\n
    containerContents =
    containerContents.substring(nextPart.length + boundary.length + 4);
    // is the current next part is of type multipart, we have a container
    if ((/^Content-Type: multipart\/(.+);/i).test(nextPart))
        contents.push(parseMimeContainer(nextPart));
    else
        contents.push(parseMimeFragment(nextPart));
    }
    result["contents"] = contents;
    return result;
}

Final remarks

These four snippets show the essential building blocks which are required to parse MIME mails. If it in some cases doesn’t work, check whether there is an additional CRLF or one less. Another issue may be nonconforming emails which don’t use \r\n as newline sequence. However, any deviations will likely be due to whitespaces I guess. So an editor which can display them is a good tool to identify this spots (Notepad++ (Win) or SciTE (Linux)).

However, I only considered emails from Thunderbird during development of the parser. This is important because TB always writes the Content-Type header immediately in front of the body and my parser exploits this (Outlook for example behaves different).

I also created a demo from the above code which you can find here. It parses the header fields and the body structure from a given email’s source.

August 9th, 2011 in Programming | tags: Email, JavaScript, Parsing

Comments (6)

NehaSeptember 9th, 2011 at 14:45

Hi,

I am also looking for something which parses the emails and gives me the body,subject,date i java ? Do we have any parsers available?

jonasSeptember 10th, 2011 at 14:08

I don’t know whether there is something similar available in Java, but I guess if you have the raw email string at hand and know how to use RegEx in Java, you could easily write your own parser based on my JavaScript version. Otherwise you need to familiarize with RegEx first.

DIegoSeptember 15th, 2014 at 18:53

How can you put that email into a javascript variable when it contains both double quotes and single quotes??

jonasSeptember 15th, 2014 at 19:13

Well that depends on the source from which you load it into the variable. I guess, you want to literally paste an email into your code for playing around as in
var mail = "[the mail text]"
In this case you need to escape either the single or the double quotes in your string literal with a backslash: var escaped_quote = "This \" is escpaed.".

GregSeptember 25th, 2014 at 22:53

It seems like there might be some issues parsing the header field „Received-SPF“. for example:

Received-SPF: pass (google.com: domain of bounce-md_XXXXXXXX@google.com.com designates 198.X.XXX.X as permitted sender) client-ip=198.X.XXX.X;

Returns the header field of „Received-SPF: pass (google.com“ instead of „Received-SPF“

It seems like it might have some problems with multiple „:“ in the header lines.

Another example is „X-Report-Abuse: You can also report abuse here: http://google.com„. it returns the field as „X-Report-Abuse: You can also report abuse here“

jonasSeptember 27th, 2014 at 11:18

You are right, the header name group in the regex needs to be lazy:
/^(.+?): ((.|\r\n\s)+)\r\n/mg.

Jonas' weblogFoobar et al.

Parse emails with RegEx

The target

Parsing the mail

The header RegEx

Decomposing the body

Parsing MIME fragments

Parsing MIME containers

Final remarks

Comments (6)

Leave a comment

Author

Categories

Archive

Blogroll

Meta