Parse emails with RegEx

Some days ago I wrote a parser for MIME mails. MIME extends the structure of simple emails (RFC 2822) to allow for non-ASCII / binary attachments and shipping alternative versions of the content (e.g. HTML for rich and plaintext for text clients).

The parser is written in JavaScript, therefore any code examples will be JS – but the regular expressions used can be adapted for other languages of course.

The target

MIME mails are characterized by the version header which is added to the other headers. It always is 'MIME-Version: 1.0' as there is no other version. More changes apply to the body structure. Where simple mails directly start with the actual message, MIME mails declare additional headers to describe the body contents (content type and encoding) and in case of multipart messages also its structure.

The code box below shows a multipart email which comes with two alternative variants of the contents. The HTML subpart again is a multipart container which carries the HTML for the message and an image referenced by it. If you want to look at the mail yourself, make sure that each line ends with \r\n resp. CRLF and no whitespaces are accidentally added.

  1. Return-Path: <test@example.com>
  2. X-Original-To: test@example.com
  3. Delivered-To: test@example.com
  4. Received: from [127.0.0.1] (127-0-0-1-dynip.superkabel.de [127.0.0.1])
  5. by example (Postfix) with ESMTPSA id FFFFFFFFFF
  6. for <test@example.com>; Thu, 23 Jun 2011 11:26:44 +0200 (CEST)
  7. Message-ID: <FFFFFFFF.5555555@example.com>
  8. Date: Thu, 23 Jun 2011 11:26:29 +0200
  9. From: Example Sender <test@example.com>
  10. User-Agent: Mozilla/5.0 Gecko/20110616 Thunderbird/3.1.11
  11. MIME-Version: 1.0
  12. To: Example Receiver <test@example.com>
  13. Subject: HTML Mail with embedded picture
  14. Content-Type: multipart/alternative;
  15. boundary="------------060102080402030702040100"
  16. This is a multi-part message in MIME format.
  17. --------------060102080402030702040100
  18. Content-Type: text/plain; charset=ISO-8859-15; format=flowed
  19. Content-Transfer-Encoding: 7bit
  20. Hello,
  21. this is an HTML mail, it has *bold*, /italic /and _underlined_ text.
  22. And then we have a table here:
  23. Cell(1,1)
  24. Cell(2,1)
  25. Cell(1,2) Cell(2,2)
  26. And we put a picture here:
  27. Image Alt Text
  28. That's it.
  29. --------------060102080402030702040100
  30. Content-Type: multipart/related;
  31. boundary="------------030904080004010009060206"
  32. --------------030904080004010009060206
  33. Content-Type: text/html; charset=ISO-8859-15
  34. Content-Transfer-Encoding: 7bit
  35. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  36. <html>
  37. <head>
  38. <meta http-equiv="content-type" content="text/html;
  39. charset=ISO-8859-15">
  40. </head>
  41. <body bgcolor="#ffffff" text="#000000">
  42. Hello,<br>
  43. <br>
  44. this is an HTML mail, it has <b>bold</b>, <i>italic </i>and <u>underlined</u>
  45. text.<br>
  46. And then we have a table here:<br>
  47. <table border="1" cellpadding="2" cellspacing="2" height="62"
  48. width="401">
  49. <tbody>
  50. <tr>
  51. <td valign="top">Cell(1,1)<br>
  52. </td>
  53. <td valign="top">Cell(2,1)</td>
  54. </tr>
  55. <tr>
  56. <td valign="top">Cell(1,2)</td>
  57. <td valign="top">Cell(2,2)</td>
  58. </tr>
  59. </tbody>
  60. </table>
  61. <br>
  62. And we put a picture here:<br>
  63. <br>
  64. <img alt="Image Alt Text"
  65. src="cid:part1.FFFFFFFF.5555555@example.com" height="79"
  66. width="98"><br>
  67. <br>
  68. That's it.<br>
  69. <br>
  70. </body>
  71. </html>
  72. --------------030904080004010009060206
  73. Content-Type: image/jpeg;
  74. name="picture.jpg"
  75. Content-Transfer-Encoding: base64
  76. Content-ID: <part1.FFFFFFFF.5555555@example.com>
  77. Content-Disposition: inline;
  78. filename="picture.jpg"
  79. /9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYH
  80. BwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcI
  81. DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAAR
  82. CABPAGIDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAA
  83. AgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkK
  84. FhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWG
  85. h4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl
  86. 5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREA
  87. AgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYk
  88. NOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOE
  89. hYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk
  90. 5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5rooor8DP9oAooooAK5H4/f8AJCPG3/YB
  91. vv8A0nkrrq5H4/f8kI8bf9gG+/8ASeSujCfx4eq/M8XiT/kUYr/r3P8A9JZ5v/wRB/5qd/3C
  92. v/b2vvavgn/giD/zU7/uFf8At7X3tX03EP8AyMKny/8ASUf83nij/wAlPif+3P8A03AKKKK8
  93. U+ACiiigAooooA+f6KKK8s/6nAooooAK5H4/f8kI8bf9gG+/9J5K66uR+P3/ACQjxt/2Ab7/
  94. ANJ5K6MJ/Hh6r8zxeJP+RRiv+vc//SWeb/8ABEH/AJqd/wBwr/29r72r4J/4Ig/81O/7hX/t
  95. 7X3tX03EP/IwqfL/ANJR/wA3nij/AMlPif8Atz/03AKKKK8U+ACiiigAooooA/Pb/h5j4E/6
  96. BPi7/wABbf8A+P0f8PMfAn/QJ8Xf+Atv/wDH6/av/iHQ/Y2/6I7/AOXZrn/yZR/xDofsbf8A
  97. RHf/AC7Nc/8Akyvwj/iaDw5/6BMZ/wCA0v8A5ef7S/8AEbvEL/n9Q/8AAP8AgH4qf8PMfAn/
  98. AECfF3/gLb//AB+j/h5j4E/6BPi7/wABbf8A+P1+1f8AxDofsbf9Ed/8uzXP/kyj/iHQ/Y2/
  99. 6I7/AOXZrn/yZR/xNB4c/wDQJjP/AAGl/wDLw/4jd4hf8/qH/gH/AAD8VP8Ah5j4E/6BPi7/
  100. AMBbf/4/WH8T/wDgoX4L8a/DXxDo1rpniiO61bTLmyhaW2gEavJEyKWImJAywzgE47Gv3H/4
  101. h0P2Nv8Aojv/AJdmuf8AyZXlv7cf/BBb9k/4PfsU/GDxd4c+FP8AZ3iHwt4I1rV9Lu/+Em1i
  102. b7LdW9hNLDJsku2Rtrop2spU4wQRxXdlf0lvD7E4yjhqOFxanOUYq8aVrtpK/wC+el99GcmP
  103. 8Z+Pq2GqUa1Wjyyi07Q1s1Z9Ox+Lv/BOj9s7wv8Asif8Jj/wklhr99/wkP2L7N/ZkEMuzyft
  104. G7f5kqYz5q4xnoenGfpn/h858L/+gD4+/wDAK0/+Sa8D/wCCWf7Mfgf9o7/hOv8AhM9E/tn+
  105. xv7P+x/6ZcW/k+b9p3/6p0zny065xjjGTX1v/wAOxvgd/wBCR/5WNQ/+P1/f2WeCmKz7DRza
  106. nOCVS+8pJ+63HZQa+z3P8xPEPifgrCcQYjD5vQryrrk5nDk5XeEWrXmn8LV9Frc4T/h858L/
  107. APoA+Pv/AACtP/kmj/h858L/APoA+Pv/AACtP/kmu7/4djfA7/oSP/KxqH/x+j/h2N8Dv+hI
  108. /wDKxqH/AMfr0P8AiXHG/wDPyn/4HP8A+Vnxf+uvhx/0DYr/AMk/+WHCf8PnPhf/ANAHx9/4
  109. BWn/AMk0f8PnPhf/ANAHx9/4BWn/AMk13f8Aw7G+B3/Qkf8AlY1D/wCP0f8ADsb4Hf8AQkf+
  110. VjUP/j9H/EuON/5+U/8AwOf/AMrD/XXw4/6BsV/5J/8ALDhP+Hznwv8A+gD4+/8AAK0/+SaK
  111. 7v8A4djfA7/oSP8Aysah/wDH6KP+Jccb/wA/Kf8A4HP/AOVh/rr4cf8AQNiv/JP/AJYfv1RR
  112. RX/Osf6chRRRQAV4l/wUv/5RxftAf9k28Rf+mu5r22vEv+Cl/wDyji/aA/7Jt4i/9NdzXv8A
  113. Cn/I7wf/AF9p/wDpaMcT/Bn6P8j+fj/giL/zU7/uFf8At7X3pXwX/wAERf8Amp3/AHCv/b2v
  114. vSv+ofwv/wCSZw3/AG//AOnJn+THjh/yW2N/7h/+maYUUUV98flAUUUUAFFFFAH7AUUUV/yN
  115. n+4gUUUUAFeJf8FL/wDlHF+0B/2TbxF/6a7mvba8S/4KX/8AKOL9oD/sm3iL/wBNdzXv8Kf8
  116. jvB/9faf/paMcT/Bn6P8j+fj/giL/wA1O/7hX/t7X3pXwX/wRF/5qd/3Cv8A29r70r/qH8L/
  117. APkmcN/2/wD+nJn+THjh/wAltjf+4f8A6ZphRRRX3x+UBRRRQAUUUUAf/9k=
  118. --------------030904080004010009060206--
  119. --------------060102080402030702040100--

Parsing the mail

First I determine where the body part of the email starts. Then I take everything before this and fetch the headers from there. The following code shows what to do:

  1. // 'mail' is a variable holding the email string
  2. // RegEx matching first occurrence of 'Content-Type: ' - start of mail body
  3. var eohRegEx = /^Content-Type: /im;
  4. var bodyPos = mail.indexOf(eohRegEx.exec(mail)); // holds pos of C in 'Content-Type: '
  5. var headerLines = mail.substring(0, bodyPos); // ends with \r\n
  6. // now determine the present headers and store them in an object
  7. var headers = new Object();
  8. var headerRegExp = /^(.+?): ((.|\r\n\s)+)\r\n/mg;
  9. var h;
  10. while (h = headerRegExp.exec(headerLines))
  11.     headers[h[1]] = h[2];
  12. // on a website you now can annoy the visitor with all headers and their values
  13. for(var field in headers)
  14.     alert(field + ": " + headers[field]);

The header RegEx

The RegEx to match single headers is  /^(.+): ((.|\r\n\s)+)\r\n/mg. The first group captures every character up to the first colon which is the header’s name.  The second group for the value captures every character until a CRLF occurs. But there is an exception that requires the \r\n\s alternative in the group: Long values may be split over several lines what is indicated by a leading whitespace character in the next line.

Decomposing the body

Having the body’s starting position (in bodyPos) I fetch its content and have a look at the first Content-Type header. If it contains ‚multipart/foo‘ then the body consists of a MIME container with at least two inner MIME fragments. Otherwise there is only one fragment which is either plaintext or HTML without embedded images.

  1. // 'bodyPos' and 'mail' from above
  2. var bodyLines = mail.substring(bodyPos);
  3. var contentType = (/^Content-Type: (.+)$/im).exec(bodyLines)[1];
  4. // check whether body is multipart (grouping not further used here)
  5. var mpRegEx = /^multipart\/(.+);/i;
  6. // if multipart then the body is a container otherwise a fragment
  7. var parsedBody;
  8. if (mpRegEx.test(contentType))
  9.     parsedBody = parseMimeContainer(bodyLines);
  10. else
  11.     parsedBody = parseMimeFragment(bodyLines);

Parsing MIME fragments

Fragments always have  Content-Type and Content-Transfer-Encoding headers, additionally Content-ID (for referencing images in HTML) and Content-Disposition may be present. All lines following the MIME headers form the actual content of a fragment. The content type header defines what kind of data is at hand (e.g. text/plain or image/jpeg) and the transfer encoding specifies the given representation (binary data usually is base64 encoded).

  1. function parseMimeFragment(fragmentStr){
  2.     var result = new Object();
  3.     // each fragment has a content type and encoding
  4.     result["contentType"] = (/^Content-Type: (.+)$/m).exec(fragmentStr)[1];
  5.     result["encoding"] =
  6.                 (/^Content-Transfer-Encoding: (.+)$/m).exec(fragmentStr)[1];
  7.     // not all fragments have a disposition or an id
  8.     if ((/^Content-Disposition: ((.|\r\n )+)\r\n/mg).test(fragmentStr)){
  9.         result["contentDisposition"] =
  10.             (/^Content-Disposition: ((.|\r\n )+)\r\n/mg).exec(fragmentStr)[1];
  11.     }
  12.     if ((/^Content-ID: ((.|\r\n\t)+)\r\n/mg).test(fragmentStr)){
  13.         result["contentId"] =
  14.                     (/^Content-ID: ((.|\r\n\t)+)\r\n/mg).exec(fragmentStr)[1];
  15.     }
  16.     // in my case between MIME headers and content where 2 \r\n,
  17.     // may be only one if there are problems with this code
  18.     result["contents"] = (/^.*\r\n\r\n([\s\S]+)\r\n/m).exec(fragmentStr)[1];
  19.     result["isFragment"] = true;
  20.     return result;
  21. }

Parsing MIME containers

I conceive all MIME parts which are of  type multipart as containers because they are composed of two or more MIME fragments or inner containers which are separated by a boundary value. Hence the possibility to nest containers allows for arbitrary depth in the resulting structure. To parse a container I sequentially check each part in it for its type and either call the above function to handle a fragment or do a recursive call to the following function.

  1. function parseMimeContainer(aContainerStr){
  2.     // each container has a content type and boundary
  3.     var cType = (/^Content-Type: ((.|\r\n )+)\r\n/m).exec(aContainerStr)[1];
  4.     var boundary = (/^ boundary="(.+)"/m).exec(cType)[1];
  5.     var result = new Object();
  6.     result["contentType"] = cType;
  7.     result["boundary"] = boundary;
  8.     result["isFragment"] = false;
  9.     // next fetch contents (everything after the first use of boundary)
  10.     var contentRegEx =
  11.         new RegExp("--" + boundary + "\r\n(((.|\\s)+)--" + boundary + ")--","img");
  12.     var containerContents = contentRegEx.exec(aContainerStr)[1];
  13.     // RegEx below determines where the next part ends
  14.     var boundaryRegEx = new RegExp("^([\\s\\S]+?)--" + boundary, "m");
  15.     var contents = new Object();
  16.     // as long as there are more parts
  17.     while(boundaryRegEx.test(containerContents)){
  18.     // fetch next part, remove it from the remaining contents to be handled
  19.     var nextPart = boundaryRegEx.exec(containerContents)[1];
  20.     // + 4 is for the two dashes preceding the boundary value and \r\n
  21.     containerContents =
  22.     containerContents.substring(nextPart.length + boundary.length + 4);
  23.     // is the current next part is of type multipart, we have a container
  24.     if ((/^Content-Type: multipart\/(.+);/i).test(nextPart))
  25.         contents.push(parseMimeContainer(nextPart));
  26.     else
  27.         contents.push(parseMimeFragment(nextPart));
  28.     }
  29.     result["contents"] = contents;
  30.     return result;
  31. }

Final remarks

These four snippets show the essential building blocks which are required to parse MIME mails. If it in some cases doesn’t work, check whether there is an additional CRLF or one less. Another issue may be nonconforming emails which don’t use \r\n as newline sequence. However, any deviations will likely be due to whitespaces I guess. So an editor which can display them is a good tool to identify this spots (Notepad++ (Win) or SciTE (Linux)).

However, I only considered emails from Thunderbird during development of the parser. This is important because TB always writes the Content-Type header immediately in front of the body and my parser exploits this (Outlook for example behaves different).

I also created a demo from the above code which you can find here. It parses the header fields and the body structure from a given email’s source.

Comments (6)

NehaSeptember 9th, 2011 at 14:45

Hi,

I am also looking for something which parses the emails and gives me the body,subject,date i java ? Do we have any parsers available?

jonasSeptember 10th, 2011 at 14:08

I don’t know whether there is something similar available in Java, but I guess if you have the raw email string at hand and know how to use RegEx in Java, you could easily write your own parser based on my JavaScript version. Otherwise you need to familiarize with RegEx first.

DIegoSeptember 15th, 2014 at 18:53

How can you put that email into a javascript variable when it contains both double quotes and single quotes??

jonasSeptember 15th, 2014 at 19:13

Well that depends on the source from which you load it into the variable. I guess, you want to literally paste an email into your code for playing around as in
var mail = "[the mail text]"
In this case you need to escape either the single or the double quotes in your string literal with a backslash: var escaped_quote = "This \" is escpaed.".

GregSeptember 25th, 2014 at 22:53

It seems like there might be some issues parsing the header field „Received-SPF“. for example:

Received-SPF: pass (google.com: domain of bounce-md_XXXXXXXX@google.com.com designates 198.X.XXX.X as permitted sender) client-ip=198.X.XXX.X;

Returns the header field of „Received-SPF: pass (google.com“ instead of „Received-SPF“

It seems like it might have some problems with multiple „:“ in the header lines.

Another example is „X-Report-Abuse: You can also report abuse here: http://google.com„. it returns the field as „X-Report-Abuse: You can also report abuse here“

jonasSeptember 27th, 2014 at 11:18

You are right, the header name group in the regex needs to be lazy:
/^(.+?): ((.|\r\n\s)+)\r\n/mg.

Leave a comment

Your comment

(required)