Parse emails with RegEx
Some days ago I wrote a parser for MIME mails. MIME extends the structure of simple emails (RFC 2822) to allow for non-ASCII / binary attachments and shipping alternative versions of the content (e.g. HTML for rich and plaintext for text clients).
The parser is written in JavaScript, therefore any code examples will be JS – but the regular expressions used can be adapted for other languages of course.
The target
MIME mails are characterized by the version header which is added to the other headers. It always is 'MIME-Version: 1.0'
as there is no other version. More changes apply to the body structure. Where simple mails directly start with the actual message, MIME mails declare additional headers to describe the body contents (content type and encoding) and in case of multipart messages also its structure.
The code box below shows a multipart email which comes with two alternative variants of the contents. The HTML subpart again is a multipart container which carries the HTML for the message and an image referenced by it. If you want to look at the mail yourself, make sure that each line ends with \r\n
resp. CRLF
and no whitespaces are accidentally added.
- Return-Path: <test@example.com>
- X-Original-To: test@example.com
- Delivered-To: test@example.com
- Received: from [127.0.0.1] (127-0-0-1-dynip.superkabel.de [127.0.0.1])
- by example (Postfix) with ESMTPSA id FFFFFFFFFF
- for <test@example.com>; Thu, 23 Jun 2011 11:26:44 +0200 (CEST)
- Message-ID: <FFFFFFFF.5555555@example.com>
- Date: Thu, 23 Jun 2011 11:26:29 +0200
- From: Example Sender <test@example.com>
- User-Agent: Mozilla/5.0 Gecko/20110616 Thunderbird/3.1.11
- MIME-Version: 1.0
- To: Example Receiver <test@example.com>
- Subject: HTML Mail with embedded picture
- Content-Type: multipart/alternative;
- boundary="------------060102080402030702040100"
- This is a multi-part message in MIME format.
- --------------060102080402030702040100
- Content-Type: text/plain; charset=ISO-8859-15; format=flowed
- Content-Transfer-Encoding: 7bit
- Hello,
- this is an HTML mail, it has *bold*, /italic /and _underlined_ text.
- And then we have a table here:
- Cell(1,1)
- Cell(2,1)
- Cell(1,2) Cell(2,2)
- And we put a picture here:
- Image Alt Text
- That's it.
- --------------060102080402030702040100
- Content-Type: multipart/related;
- boundary="------------030904080004010009060206"
- --------------030904080004010009060206
- Content-Type: text/html; charset=ISO-8859-15
- Content-Transfer-Encoding: 7bit
- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
- <html>
- <head>
- <meta http-equiv="content-type" content="text/html;
- charset=ISO-8859-15">
- </head>
- <body bgcolor="#ffffff" text="#000000">
- Hello,<br>
- <br>
- this is an HTML mail, it has <b>bold</b>, <i>italic </i>and <u>underlined</u>
- text.<br>
- And then we have a table here:<br>
- <table border="1" cellpadding="2" cellspacing="2" height="62"
- width="401">
- <tbody>
- <tr>
- <td valign="top">Cell(1,1)<br>
- </td>
- <td valign="top">Cell(2,1)</td>
- </tr>
- <tr>
- <td valign="top">Cell(1,2)</td>
- <td valign="top">Cell(2,2)</td>
- </tr>
- </tbody>
- </table>
- <br>
- And we put a picture here:<br>
- <br>
- <img alt="Image Alt Text"
- src="cid:part1.FFFFFFFF.5555555@example.com" height="79"
- width="98"><br>
- <br>
- That's it.<br>
- <br>
- </body>
- </html>
- --------------030904080004010009060206
- Content-Type: image/jpeg;
- name="picture.jpg"
- Content-Transfer-Encoding: base64
- Content-ID: <part1.FFFFFFFF.5555555@example.com>
- Content-Disposition: inline;
- filename="picture.jpg"
- /9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYH
- BwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcI
- DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAAR
- CABPAGIDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAA
- AgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkK
- FhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWG
- h4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl
- 5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREA
- AgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYk
- NOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOE
- hYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk
- 5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5rooor8DP9oAooooAK5H4/f8AJCPG3/YB
- vv8A0nkrrq5H4/f8kI8bf9gG+/8ASeSujCfx4eq/M8XiT/kUYr/r3P8A9JZ5v/wRB/5qd/3C
- v/b2vvavgn/giD/zU7/uFf8At7X3tX03EP8AyMKny/8ASUf83nij/wAlPif+3P8A03AKKKK8
- U+ACiiigAooooA+f6KKK8s/6nAooooAK5H4/f8kI8bf9gG+/9J5K66uR+P3/ACQjxt/2Ab7/
- ANJ5K6MJ/Hh6r8zxeJP+RRiv+vc//SWeb/8ABEH/AJqd/wBwr/29r72r4J/4Ig/81O/7hX/t
- 7X3tX03EP/IwqfL/ANJR/wA3nij/AMlPif8Atz/03AKKKK8U+ACiiigAooooA/Pb/h5j4E/6
- BPi7/wABbf8A+P0f8PMfAn/QJ8Xf+Atv/wDH6/av/iHQ/Y2/6I7/AOXZrn/yZR/xDofsbf8A
- RHf/AC7Nc/8Akyvwj/iaDw5/6BMZ/wCA0v8A5ef7S/8AEbvEL/n9Q/8AAP8AgH4qf8PMfAn/
- AECfF3/gLb//AB+j/h5j4E/6BPi7/wABbf8A+P1+1f8AxDofsbf9Ed/8uzXP/kyj/iHQ/Y2/
- 6I7/AOXZrn/yZR/xNB4c/wDQJjP/AAGl/wDLw/4jd4hf8/qH/gH/AAD8VP8Ah5j4E/6BPi7/
- AMBbf/4/WH8T/wDgoX4L8a/DXxDo1rpniiO61bTLmyhaW2gEavJEyKWImJAywzgE47Gv3H/4
- h0P2Nv8Aojv/AJdmuf8AyZXlv7cf/BBb9k/4PfsU/GDxd4c+FP8AZ3iHwt4I1rV9Lu/+Em1i
- b7LdW9hNLDJsku2Rtrop2spU4wQRxXdlf0lvD7E4yjhqOFxanOUYq8aVrtpK/wC+el99GcmP
- 8Z+Pq2GqUa1Wjyyi07Q1s1Z9Ox+Lv/BOj9s7wv8Asif8Jj/wklhr99/wkP2L7N/ZkEMuzyft
- G7f5kqYz5q4xnoenGfpn/h858L/+gD4+/wDAK0/+Sa8D/wCCWf7Mfgf9o7/hOv8AhM9E/tn+
- xv7P+x/6ZcW/k+b9p3/6p0zny065xjjGTX1v/wAOxvgd/wBCR/5WNQ/+P1/f2WeCmKz7DRza
- nOCVS+8pJ+63HZQa+z3P8xPEPifgrCcQYjD5vQryrrk5nDk5XeEWrXmn8LV9Frc4T/h858L/
- APoA+Pv/AACtP/kmj/h858L/APoA+Pv/AACtP/kmu7/4djfA7/oSP/KxqH/x+j/h2N8Dv+hI
- /wDKxqH/AMfr0P8AiXHG/wDPyn/4HP8A+Vnxf+uvhx/0DYr/AMk/+WHCf8PnPhf/ANAHx9/4
- BWn/AMk0f8PnPhf/ANAHx9/4BWn/AMk13f8Aw7G+B3/Qkf8AlY1D/wCP0f8ADsb4Hf8AQkf+
- VjUP/j9H/EuON/5+U/8AwOf/AMrD/XXw4/6BsV/5J/8ALDhP+Hznwv8A+gD4+/8AAK0/+SaK
- 7v8A4djfA7/oSP8Aysah/wDH6KP+Jccb/wA/Kf8A4HP/AOVh/rr4cf8AQNiv/JP/AJYfv1RR
- RX/Osf6chRRRQAV4l/wUv/5RxftAf9k28Rf+mu5r22vEv+Cl/wDyji/aA/7Jt4i/9NdzXv8A
- Cn/I7wf/AF9p/wDpaMcT/Bn6P8j+fj/giL/zU7/uFf8At7X3pXwX/wAERf8Amp3/AHCv/b2v
- vSv+ofwv/wCSZw3/AG//AOnJn+THjh/yW2N/7h/+maYUUUV98flAUUUUAFFFFAH7AUUUV/yN
- n+4gUUUUAFeJf8FL/wDlHF+0B/2TbxF/6a7mvba8S/4KX/8AKOL9oD/sm3iL/wBNdzXv8Kf8
- jvB/9faf/paMcT/Bn6P8j+fj/giL/wA1O/7hX/t7X3pXwX/wRF/5qd/3Cv8A29r70r/qH8L/
- APkmcN/2/wD+nJn+THjh/wAltjf+4f8A6ZphRRRX3x+UBRRRQAUUUUAf/9k=
- --------------030904080004010009060206--
- --------------060102080402030702040100--
Parsing the mail
First I determine where the body part of the email starts. Then I take everything before this and fetch the headers from there. The following code shows what to do:
- // 'mail' is a variable holding the email string
- // RegEx matching first occurrence of 'Content-Type: ' - start of mail body
- var eohRegEx = /^Content-Type: /im;
- var bodyPos = mail.indexOf(eohRegEx.exec(mail)); // holds pos of C in 'Content-Type: '
- var headerLines = mail.substring(0, bodyPos); // ends with \r\n
- // now determine the present headers and store them in an object
- var headers = new Object();
- var headerRegExp = /^(.+?): ((.|\r\n\s)+)\r\n/mg;
- var h;
- while (h = headerRegExp.exec(headerLines))
- headers[h[1]] = h[2];
- // on a website you now can annoy the visitor with all headers and their values
- for(var field in headers)
- alert(field + ": " + headers[field]);
The header RegEx
The RegEx to match single headers is /^(.+): ((.|\r\n\s)+)\r\n/mg
. The first group captures every character up to the first colon which is the header’s name. The second group for the value captures every character until a CRLF occurs. But there is an exception that requires the \r\n\s
alternative in the group: Long values may be split over several lines what is indicated by a leading whitespace character in the next line.
Decomposing the body
Having the body’s starting position (in bodyPos
) I fetch its content and have a look at the first Content-Type
header. If it contains ‚multipart/foo‘ then the body consists of a MIME container with at least two inner MIME fragments. Otherwise there is only one fragment which is either plaintext or HTML without embedded images.
- // 'bodyPos' and 'mail' from above
- var bodyLines = mail.substring(bodyPos);
- var contentType = (/^Content-Type: (.+)$/im).exec(bodyLines)[1];
- // check whether body is multipart (grouping not further used here)
- var mpRegEx = /^multipart\/(.+);/i;
- // if multipart then the body is a container otherwise a fragment
- var parsedBody;
- if (mpRegEx.test(contentType))
- parsedBody = parseMimeContainer(bodyLines);
- else
- parsedBody = parseMimeFragment(bodyLines);
Parsing MIME fragments
Fragments always have Content-Type
and Content-Transfer-Encoding
headers, additionally Content-ID
(for referencing images in HTML) and Content-Disposition
may be present. All lines following the MIME headers form the actual content of a fragment. The content type header defines what kind of data is at hand (e.g. text/plain
or image/jpeg
) and the transfer encoding specifies the given representation (binary data usually is base64 encoded).
- function parseMimeFragment(fragmentStr){
- var result = new Object();
- // each fragment has a content type and encoding
- result["contentType"] = (/^Content-Type: (.+)$/m).exec(fragmentStr)[1];
- result["encoding"] =
- (/^Content-Transfer-Encoding: (.+)$/m).exec(fragmentStr)[1];
- // not all fragments have a disposition or an id
- if ((/^Content-Disposition: ((.|\r\n )+)\r\n/mg).test(fragmentStr)){
- result["contentDisposition"] =
- (/^Content-Disposition: ((.|\r\n )+)\r\n/mg).exec(fragmentStr)[1];
- }
- if ((/^Content-ID: ((.|\r\n\t)+)\r\n/mg).test(fragmentStr)){
- result["contentId"] =
- (/^Content-ID: ((.|\r\n\t)+)\r\n/mg).exec(fragmentStr)[1];
- }
- // in my case between MIME headers and content where 2 \r\n,
- // may be only one if there are problems with this code
- result["contents"] = (/^.*\r\n\r\n([\s\S]+)\r\n/m).exec(fragmentStr)[1];
- result["isFragment"] = true;
- return result;
- }
Parsing MIME containers
I conceive all MIME parts which are of type multipart as containers because they are composed of two or more MIME fragments or inner containers which are separated by a boundary value. Hence the possibility to nest containers allows for arbitrary depth in the resulting structure. To parse a container I sequentially check each part in it for its type and either call the above function to handle a fragment or do a recursive call to the following function.
- function parseMimeContainer(aContainerStr){
- // each container has a content type and boundary
- var cType = (/^Content-Type: ((.|\r\n )+)\r\n/m).exec(aContainerStr)[1];
- var boundary = (/^ boundary="(.+)"/m).exec(cType)[1];
- var result = new Object();
- result["contentType"] = cType;
- result["boundary"] = boundary;
- result["isFragment"] = false;
- // next fetch contents (everything after the first use of boundary)
- var contentRegEx =
- new RegExp("--" + boundary + "\r\n(((.|\\s)+)--" + boundary + ")--","img");
- var containerContents = contentRegEx.exec(aContainerStr)[1];
- // RegEx below determines where the next part ends
- var boundaryRegEx = new RegExp("^([\\s\\S]+?)--" + boundary, "m");
- var contents = new Object();
- // as long as there are more parts
- while(boundaryRegEx.test(containerContents)){
- // fetch next part, remove it from the remaining contents to be handled
- var nextPart = boundaryRegEx.exec(containerContents)[1];
- // + 4 is for the two dashes preceding the boundary value and \r\n
- containerContents =
- containerContents.substring(nextPart.length + boundary.length + 4);
- // is the current next part is of type multipart, we have a container
- if ((/^Content-Type: multipart\/(.+);/i).test(nextPart))
- contents.push(parseMimeContainer(nextPart));
- else
- contents.push(parseMimeFragment(nextPart));
- }
- result["contents"] = contents;
- return result;
- }
Final remarks
These four snippets show the essential building blocks which are required to parse MIME mails. If it in some cases doesn’t work, check whether there is an additional CRLF or one less. Another issue may be nonconforming emails which don’t use \r\n
as newline sequence. However, any deviations will likely be due to whitespaces I guess. So an editor which can display them is a good tool to identify this spots (Notepad++ (Win) or SciTE (Linux)).
However, I only considered emails from Thunderbird during development of the parser. This is important because TB always writes the Content-Type
header immediately in front of the body and my parser exploits this (Outlook for example behaves different).
I also created a demo from the above code which you can find here. It parses the header fields and the body structure from a given email’s source.
Hi,
I am also looking for something which parses the emails and gives me the body,subject,date i java ? Do we have any parsers available?
I don’t know whether there is something similar available in Java, but I guess if you have the raw email string at hand and know how to use RegEx in Java, you could easily write your own parser based on my JavaScript version. Otherwise you need to familiarize with RegEx first.
How can you put that email into a javascript variable when it contains both double quotes and single quotes??
Well that depends on the source from which you load it into the variable. I guess, you want to literally paste an email into your code for playing around as in
var mail = "[the mail text]"
In this case you need to escape either the single or the double quotes in your string literal with a backslash:
var escaped_quote = "This \" is escpaed."
.It seems like there might be some issues parsing the header field „Received-SPF“. for example:
Received-SPF: pass (google.com: domain of bounce-md_XXXXXXXX@google.com.com designates 198.X.XXX.X as permitted sender) client-ip=198.X.XXX.X;
Returns the header field of „Received-SPF: pass (google.com“ instead of „Received-SPF“
It seems like it might have some problems with multiple „:“ in the header lines.
Another example is „X-Report-Abuse: You can also report abuse here: http://google.com„. it returns the field as „X-Report-Abuse: You can also report abuse here“
You are right, the header name group in the regex needs to be lazy:
/^(.+?): ((.|\r\n\s)+)\r\n/mg
.