rfc2822 for mere mortals
DESCRIPTION
This is a presentation I did years ago, but I heard that there are still people using it as a reference. So here it is, slightly cleaned up. If you are writing systems that process email addresses in some form or anotehr you might want to read this.TRANSCRIPT
What's in an Email Address?
RFC2822 Em@il @ddresses for Mere Mortals
Schalk W. Cronjé@ysb33r
Why This Topic?
● Recurring bugs in software we build● Lack of understanding at all levels
– Deve lopers– Testers– Support People
● Assumptions made, without reading RFCs● Understanding RFCs are not straightforward
– RTFM is difficult when TFM cannot be found● We require a basic reference
Content
● Overview● Local-part● Domain-part● Valid or not?● The real world
RFC2821RFC2821 RFC1034RFC1034RFC1035RFC1035
RFC2822RFC2822
RFC821RFC821
RFC822RFC822
Domain name specification.Restrictions on email addresses at protocol levels.
Specifies layout of email transmitted over internet. Specifies format of email address.
Brave, brave RFC World
RFC2047RFC2047Encoding of 8-bit in RFC2822 header fields
RFC3490RFC3490Encoding international domain names
RFC1123RFC1123
Requirements for internet hosts
(Partially updated by RFC2821)(Partially updated by RFC2821)
Address Format
Modern formatlocal-part @ domain-part
Historic format (RFC821/RFC2821)source-route : local-part @ domain-part
RFC2822 Local Parts● Unrestricted characters
0..9 a..z A..Z ! # $ % & ' * + - / = ? ^ _ ` | { } ~ .
● Quotable characters (quoted by “ \)
< [ ( : @ ; ) ] > , non-ws-ctrl
● Illegal characters
All 8-bit.
● Whitespacews-ctrl illegal, only used for folding in headersspace character is valid if quoted
[ RFC2821: 4.1.2; RFC2822: 3.2, 3.4 ]
Local Payload
● Routing characters– ! % have been used for local-routing in legacy
systems, including UUCP and MHS.– Can be used to bypass routing in mis-configured
systems.
● Shell exploits– | / ` $ have been used to attempt remote
command execution
Does Case Matter?
● Case is ignored in domain
ntaba.biz == ntaba.biz
● Strictly-speaking case matters in local-parts
[email protected] != [email protected]
– Most MTAs ignore case– RFC2821 discourages use of case as a
distinguishing factor
● Case ignored in source-routes
[ RFC2821: 2.4 ]
Does Size Matter?
● RFC2821 places lim itations on length of local-part and domain-part– 64 characters for local-part– 255 characters for domain-part
● This is normally not a problem for messages transmitted across the internet, but can be problematic for in-house applications or encoded email addresses such as X.400.
● Many MTAs will now ignore this length restriction as long as the overall SMTP protocol line length restriction is not exceeded.
[ RFC2821: 4.5.3.1 ]
Domain Parts
● Can either be a RFC1035 doma in or an address literal● Valid characters for domain names:
a..z A..Z 0..9 -● Subdomains separated by dot character.● Subdomain may not start or end with dash.● 255 characters max length.● 63 characters max per subdomain.● Cannot start or end in dot.● Restriction of subdomain starting with digit have been
relaxed.
Address Literals
● Workarounds for when host names cannot be resolved.– @[protocol:host-address]– IPv4: @[192.1.1.1]– IPv6: @[IPv6:fe80::a00:20ff:fec2:2ef4]
● Protocol must be registered with ICANN.
[ RFC2821: 4.1.3 ]
International Domain Names
● Domain names not representable in US-ASCII can be registered
● Such domain names cannot be handles by DNS or existing protocols
● RFC 3490 describes the encoding/decoding of such domain names from presentation to protocol:
exämple.com => xn--example-cua.com
● Potential for phising
Valid or not?
● Valid even under strict RFC2822 interpretation
● Most punctuation are valid in local part, including: {$cha?k*cr%nje}@ntaba.biz
Valid or not?
● Yes, the domain part is an address-literal● Acceptance of address-literals should be
configurable– They can be security risks– RFC2821 prefers usage of MX-based deliveries.
schalk_cronje@[192.168.1.1]
Valid or not?
● No, it is not an address-literal nor a valid domain name.
● Some systems will attempt to deliver this by passing the 192.168.1.1 to the domain resolving subsystem, which in return will simply return the IP address.– This violates RFC1123– This is a potential security risk.
[ RFC1123: 2.1 ]
Valid or not?
● Not valid according to RFC1035● Limitation lifted in RFC1123.
[ RFC1123: 2.1 ]
Valid or not?
● Valid in RFC821 for compatibility with non-TCP/IP networks.
● Outlawed by RFC2821.● Not supported by any modern MTA.
schalk_cronje@#192168
[ RFC821: 4.1.2; RFC2821: F.4 ]
Valid or not?
● No, strictly RFC2822 states that domain-part may not end with a dot.
● RFC1034 use the dot-ending to indicate absolute domains (FQDN) in resource records.
● Most systems will accept, resolve and deliver this
[ RFC2822: 3.2.4; RFC1034: 3.1]
Valid or not?
● No, consecutive dots are not allowed in domain parts.
[ RFC2822: 3.2.4; RFC1034: 3.1]
Valid or not?
● No.– Local-parts may not start with a dot.– Consecutive dots are not allowed in local parts.
● Pragmatically, many known MTAs don’t care
[email protected]@ntaba.biz
[ RFC2822: 3.2.4]
Valid or not?
● No, _ is not valid in domain names● Some DNS servers will support this.● Some sites do use the _ for internal systems.● It remains illegal for internet operations
schalk_cronje@lon_eng.ntaba.biz
[ RFC2821: 4.1.3 ]
Valid or not?
● No, @ cannot be used unquoted in local parts
“schalk_cronje@lon_eng”@ntaba.bizschalk_cronje\@[email protected]
schalk_cronje@[email protected]
[ RFC2822: 3.2.5, 3.4 ]
Local-part Quoting
● Quoting should only be used where absolutely necessary
● Where a quoted-form have an unquoted form... – The two forms are equivalent– The unquoted form should be used for
transmission● Quoting is performed by enclosing local-
part in quotes or preceding a character by backslash.
[ RFC2821: 4.1.2 ]
Valid or not?
● No, this is an envelope for email addresses● The following is valid:
“<schalk_cronje>”@ntaba.biz
Valid or not?
● This is debatable● Neither RFC2821, nor RFC2822, is
completely clear whether the double quote is valid if escapedNote that the backslash, "\", is a quote character, which is used to indicate that the next character is to be used literally
“schalk_O\”cronje”@ntaba.biz
[ RFC2821: 4.1.2 ]
Valid or not?
● Not at RFC2821/RFC2822 levels - contains at one least 8-bit character
● Can be completely valid at the presentation level– Email client can take care of translation between
a user-readable form and a level suitable for transmission
● There is NO agreed standard for encoding non-US-ASCII in local parts
schalk_cronjé@ntaba.biz
My 8-bit's Worth
● Custom encoding is valid, when both the sender and receiver will know about the encoding – Intermediate relays will simply pass it through
● UTF-7: [email protected]
● RFC2047 (adapted): [email protected]
● Storing email addresses with 8-bit content in XML is problematic – requires encoding.
The 8-bit Legacy
● RFC822 was written in a 7-bit world– It can be m isinterpreted as to 8-bit being legal.
● Some MTAs will actually transmit 8-bit characters in email addresses
● In-house systems might have a requirement for 8-bit
● An email must be able to allow, block, quarantine or filter on 8-bit characters.
Valid or not?
● Valid even under strict RFC2822 interpretation
● Quoting allows for spaces and | to be used● Imagine if this was passed to a shell script in
a badly configured system!
"`echo haX0r | /usr/bin/passwd root --stdin`"@ntaba.biz
Valid or not?
● Valid even under strict RFC2822 interpretation
● Quoting allows for @ :, to be used
"@lon-eng,@scm-eng:schalk_cronje"@ntaba.biz
Valid or not?
● Valid even under strict RFC2822 interpretation
● This is an example of a source-route.● Usage is deprecated● It is best to remove them, before relaying.
@lon-eng,@scm-eng:[email protected]
[ RFC2821: 3.7, C, F.2 ]
Practical Validation
● Address validation cannot purely be performed against the RFC
● Context is very important● Validation at user-level will differ from that at
protocol-level.
RFC rule of thumb: Be as lenient as possible in what you accept, but as strict as possible in what you send out.
Validation Context
● Context places additional demands on validation algorithms
● Validation algorithms must be configurable– Allows for specifics in user environments– Allows for adaptability within various code
subsystems
Pattern Matching
● DOS-patterns (*?) is useful, but not good enough
● Regex is a better way to perform complex pattern matches– Not all users understand regex– It is therefore good to give users the option of an
input notation, but use regex internally to perform the matching
The *? Problem
● The above is a valid email address● Was the intention to filter for this exact
address?● Or was the intention to filter for addresses
such as [email protected]
● Regex: – schalk\*[email protected]– schalk.*[email protected]
schalk*[email protected]
Lists of Addresses
● RFC2822 uses the comma for separating address lists in headers
● A common misnomer is that it is easy to delimit addresses using ; or ,.
● Although it is possible, it is no trivial task to parse lists such as
[email protected], “s,c,h,a,l,k”@ntaba.biz ,s\,\\cha\,[email protected] , “sch\”,alk”@ntaba.biz
Real World Violations
● Use of _ in domain-part● Domain part starts with dot● Domain part ends in dot● 4000 characters in local part● 8-bit characters in local-part
What can we do?
● Developers should never make any assumptions as to what the customer might need or to what the customer's infrastructure might be– Code to be as RFC-compliant as possible, but
allow for configurability as and when needed.– User interfaces should be context-sensitive.
● Testers should ensure that nobody makes such assumptions
Questions ?
Handling email addresses is an extraodinarycomplex matter for something very simple.
Next time you enter an email address...
...you might not want to take it for granted