Internationalizating Big Backend Web & Mail Applications
Adrian D. Havill
original edition by Red Hat, Inc.
Copyright ©2001 Red Hat, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being just "UTF-8 for ASCII Hackers", with one Front-Cover Text: "original edition by Red Hat, Inc." and with one Back-Cover Text: "Additional documentation and support for iMIME can be obtained from Red Hat, Inc.".
iMIME is a C library (with some other language interfaces) that is used to input and query mail messages and HTTP submissions that have non-ASCII and non-text input.
iconv
library is used, keeping
this code small and free of tables.iMIME was written to correctly parse messages using the syntax defined in most popular MIME and HTTP related RFCs. Special emphasis was placed on decoding I18N encoded messages painlessly and supporting as many I18N related standards and specifications as possible.
While not yet standard, many message submitting web clients have extensions related to I18N that iMIME supports.
iMIME isn't yet designed to do everything. In particular, the following known limitations exist.
The code is ©2001 by Red Hat, Inc. You may use the code, binary and source, in accordance with the GPL. Like most free software, the library has no warranty or support. Then again, there's a lot of non-free software out there that refuses to provide support or a warranty! Red Hat support and service products are available for open source software.
This documentation is ©2001 by Red Hat, Inc. You may use this documentation in accordance with the FDL.
The most current source code and documentation for iMIME is distributed via anonymous FTP from <URL:ftp://people.redhat.com/havill/imime.tar.gz>.
RPMs are available in the same directory for more convenient package oriented installation of the source and IA-32 binaries.
The same directory should also hold older versions of iMIME.
The following program will read in one mime message from standard input, and then will run header queries against the header information. For simplicity, no error checking is done, but proper memory deallocation is shown.
#include <stdio.h> #include <stdlib.h> #include "mime.h" int main(int argc, char *argv[]) { mime_state *state = mime_init(NULL, mime_mime, NULL, 1, NULL, stdin); mime_msg *messages = mime_parse(NULL, state); while (--argc > 0) { char *s = mime_get_header_info(messages->headers, argv[argc]); mime_fputs(s, stdout); puts(""); free(s); } mime_free_msg_list(messages); mime_free(state); return 0; }
To parse MIME messages with iMIME, most applications will need to go through the following steps shown above.
mime_init
. If unsuccesful,
the function returns NULL
.
NULL
when the user application calls
it.NULL
, it will query the
global locale to get the encoding.NULL
, it will use the default
location.file
is true, the next parameter is a pointer to
FILE
. Otherwise, the next parameter is a
pointer to a string, and the parameter after that is a size_t
indicating the size of the previous
string.mime_parse()
. The returned value will
be the head of the linked list of messages passed to it, or a new list. If
unsuccessful, it will return NULL
.
NULL
if no previous list
exists. This is for parsing more than one message and linking all the
messages together.mime_init
. The contents of the limits
field
may be altered before calling mime_parse
, and in some parsing
cases, the where
field may be altered as well.mime_msg
returned and work with the data. In this example, we run
header queries passed from the
command line and
print the UTF-8 string.mime_free_msg_list
to free the message list and everything it
references, then free up the dynamically allocated state information with
mime_free
. Note that mime_free_msg_list
will also
attempt to remove temporary files it creates, so the filename must be
changed to a string of zero length or the file copied or moved to a
different location. If you copy or move the temporary file, be sure to
preserve the file access and modification timestamps which the parser may
set.Feed the following message (make sure the newlines are CRLF and no other space precedes the headers of blank lines) to the above program in standard input. You will also need to be using an Standard C environment where text input streams are treated verbatim (MS Windows Visual C++ CRT (C Runtime Library) converts CRLF to '\n'. Cygnus/Red Hat's cygwin can leave it as is), to keep the CRLF newlines from getting converted.
To: =?iso-8859-8?b?7eXs+SDv4SDp7Oj08A==?= <santa@nowhere.com> From: =?ISO-8859-1?Q?Olle_J=E4rnefors?= <havill@redhat.com> Subject: =?ISO-2022-JP?B?N0obJEIyfkR7SEckTkZiTUYbKEI=?= Comments: (=?ISO-8859-1?Q?Patrik_F=E4ltstr=F6m?=) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=ISO-2022-JP This is a test= to see how it works =41=42=43 H=65llo!
When run with the argument Subject, the program should output a string that resembles C/C++/Java.
This is "7J改訂版の内容".
To use the library, you need to manipulate the contents of data structures defined in mime.h. While the query interface can manipulate the header and message structures and large scale software engineering principles frown on directly accessing the contents of abstract data types (because the contents could very well change when iMIME is improved), it can be beneficial to scan the structures directly for performance, for when you want to convert the data structures to a format used by the host application.
Strings are multibyte to make it easier for applications that don't use wide characters, and because strings are stored and transfered from internal memory to external files, which are normally 8-bit with modern environments, making transfer of text objects from memory to disk and back easier. Functions are provided to transform UTF strings to wide strings.
UTF-8 is a 8-bit octet transformation of 21-bit per character Unicode. If you never cared about Unicode and want to know as little as possible because you're monolingual, then all you need to know about Unicode (also known as ISO-10646) the character set and UTF-8 the encoding can be summarized by the following points.
strchr(utf, 'a')
, strtok(utf, " ")
,
strcmp("key", utf)
and
strstr("key", utf)
and get correct results.You've probably figured out that this system can be expanded up to one
head byte and five tail bytes, for a total of 31-bits, which was the
original maximum length of ISO-10646 (2 billion characters that fit into a
Standard C long
without using the sign bit). It was later
figured out that 31 bits is overkill (all modern, dead, archiac, and
artificial languages are estimated by linguists to need no more than one
million characters), and the standard was revised to say that only 21 bits
would be necessary.
This well thought out system allows for the following properties:
strcmp()
on UTF-8 works exactly like
wcscmp()
, in that if wide character string "A" numericly
sorts before wide string "B", it will also numericly sort identically in
UTF-8.strlen()
returns the amount of bytes in a string, not the
number of Unicode characters. To count characters, count the ASCII (8th
bit zero) and head bytes. Latin-1 strings in a worst case scenario will
expand 200% in UTF-8 (highly unlikely, although Cyrillic languages used
to KOI8-R and ISO-8859-5 will see close to 200% expansion). Japanese
EUC-JP strings will in an average case scenario expand around 150%.wchar_t
, you should
avoid surrogates as they're a complication that is necessary only for
platforms with 16-bit wide characters that also need to address characters
beyond U+FFFF (which is unlikely in the
first place).Occasionally you will hear criticism from a minority of Han/CJK (aka Kangxi/Chinese, Kanji/Japanese, Hanja/Korean) character) users that claim their language cannot be properly represented by Unicode. As a general rule of thumb, Unicode can round-trip encode/decode all national standard characters. If they're using computers now with their language, they can use it with Unicode, and there's plenty of room for expansion with Unicode should additional characters be discovered to be useful.
Clarifying the confusion regarding Han characters and Unicode involves understanding that Unicode, unlike some other Han character sets, tries to not mix characters and glyphs (font variants of the same character), unless a national standard also encoded the glyphs (so that round-trip conversion is possible). This is a kludge which acknowledges that Unicode never could have become accepted unless it supported legacy charset conversion, no matter how broken the legacy set was. Unicode tries not to encode glyph variants unless for compatibility because while this makes rendering (questionably) easier, the processing of character information (text searching and normalizing) unscalable as the glypth variant count rises.
Most of the characters they point out that are allegedly not in Unicode are in fact in Unicode but not the base glyph variant.
Variants of the same character are properly handled by a higher level. Although it will rarely be needed except by scholars and polyglots, the MIME decoders provided allow the specification of language to select a set of font glyphs for the Unicode characters, and Unicode allows for language tags via special characters in plane 14 should the higher level protocol not support language specification or the stream must be plain text.
If one value has multiple string values (for example, if a header contains two "X-Subject" lines), these strings will be concatenated together under the one header "X-Subject". In between the two strings will be the byte 0xFF, which never appears in UTF-8 strings.
Strings are either free-form (and may include null characters) or text. If the text was converted from a character set to UTF-8 format, the first three hex bytes of the string will be 0xEB 0xBB 0xBF. These three bytes represent the Unicode character U+FEFF, which is a zero width, non breaking space.
In a render that understands Unicode, this character when printed should do nothing. However, applications using the library should test for the presence of this character are remove it if it is the front of the string (but not further occurences of the character within the same string), as it is a magic character that is not part of the conversion. If the string indeed contained a U+FEFF character, the First six bytes will be 0xEB 0xBB 0xBF 0xEB 0xBB 0xBF, and the first three bytes should be ignored, but not the second three bytes.
This magic is put at the front of the strings because application environments that are not internationalized will need to determine which multibyte strings are Unicode and which are in other encodings. Also, MIME messages sometimes do not contain character set/encoding information, and one needs to know which text strings are ambiguous and which are definitely UTF-8.
You can modify the library behavior to make UTF-8 strings non-prefixed with a U+FEFF if your application does not need to distinguish between UTF-8 and non-UTF-8 strings.
Headers are the series of colon separated Header:
information pairs at the top of the message and continuing to the
first blank line.
If a header is non-structured (The freeform
field evaluates to
true), the text to the right of the colon is not interpreted as structured
and the string is put inside of a mime_header
with no pointer
to a mime_data
. Otherwise, the comma separated values are stored
in a mime_data
list.
Each value can have zero or more parameters (separated by semicolons)
associated with it to the right of the value. These are stored in the
mime_param
list. A parameter can have an attribute and
optionally a value. If the value is present, it is separated from the name by
an equals sign.
Messages hold the body of the message, whether it be a pointer to the disk
(the
body.filename
field), a pointer to an encapsulated message
(the body.multipart
field), or a string in memory (the
body.s
field). As the body can have embedded '\0' characters in it, the length
field
holds the actual length in bytes.
The pointers to strings in the info
structure are convenience
fields which point to various header values and parameters which are often
used. If the information is not available, they are set to a constant string
of length zero, not NULL
, so it's always safe to
pass one of these to a Standard C <string.h>
function. Common time & date
header values and parameters are also parsed and decoded for ease-of-use.
Finally, the headers
field points to a
linked list of headers associated with
this message.
In general, the state machine should be considered opaque. Helper functions for options should be used to manipulate the structure so applications will work with future versions of iMIME.
Although the message data type and the header data type are not opaque and the contents visible, generic convenient query mechanisms exist to retrieve data from these objects as if they were an abstract type. This interface also makes it easier to build wrappers and bridges from other languages that may not interface easily to native C types and functions.
A simple syntax for retrieving header information and message bodies exist. The message body query syntax includes the header query syntax.
The EBNF for query strings used by mime_get_header_info()
is
as follows.
hdr | |
---|---|
hdr | str |
hdr | str.val |
hdr | str.val.att |
Given the header Content-Type: text/plain; charset=iso-2022-jp, str would be Content-Type, val would be text/plain, and att would be charset.
If no att is specified, all the parameter names are returned in one string, separated by the string delimiter. If no val is specified, all values are returned in one string. If no str is specified, all headers are returned separated.
The EBNF for query strings used by mime_get_msg()
is
as follows, which the allocated return value, if any, in parenthesis.
cmd | ref^hdr | queries the headers | char* |
---|---|---|---|
cmd | ref? | gets content type | mime_body_type* |
cmd | ref= | gets content body | mime_msg* |
char* | |||
cmd | ref# | content length | size_t* |
cmd | ref | get message | mime_msg* |
cmd | ref:cmd | dereference multipart contents | |
ref | <id> | part with Message-ID or Content-ID of id | |
ref | !ref | do not recurse/descend into multipart. | |
ref | index | #index */* | |
ref | {index} | #index text/* | |
ref | [index] | #index text/plain | |
ref | (index) | #index file (determined by the Content-Disposition) | |
ref | "name" | part labeled with name in Content-Disposition | |
'name' | |||
index | integer | integer is a whole number referring to the message parts, decending into multipart children. |
When a message is parsed, the multipart segments are appended in order to the linked list of messages. Assuming that the linked-list consists of the following contents:
Message-ID: <0123.4567@imime> MIME-Version: 1.0 (This is a test) Content-Type: multipart/mixed; boundary=a Content-Disposition: inline; name="a" --a Content-Type: multipart/alternate; boundary=b Content-Disposition: inline; name="b" Content-ID: < 89AB.CDEF@imime> This is the prologue. It will be ignored. --b Content-Type: text/plain Content-Description: this is the content for non-rich text viewers Content-ID: <FEDC.BA98@imime> This is a test --b Content-Type: text/html Content-ID: < DEAD.BEAF@imime > Content-Description: this is the content for browser e-mail clients <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"> <TITLE>Test</TITLE> <P>This is a test --b-- This is the epilogue. It will be ignored. --a Content-Type: application/octet-stream Content-ID: <0000.0000@imime > Content-Disposition: attachment; filename=test.sh; name="a2" Content-Description: This is a bash shell script #!/bin/bash echo "This is a test!" --a--
The following example commands shows which part would be returned.
auto-descending into multipart | not decending into multipart | ||
---|---|---|---|
0 | multipart/mixed | !0 | multipart/mixed |
1 | multipart/alternate | !0:0 | multipart/alternate |
2 | text/plain | !0:!0:!0 | text/plain |
3 | text/html | !0:!0:!1 | text/html |
4 | application/octet-stream | !0:!1 | application/octet-stream |
5 | NULL |
!1 | NULL |
{0} | text/plain | {!0} | NULL |
{1} | text/html | {!1} | NULL |
{2} | NULL |
{!2} | NULL |
[0] | text/plain | [!0] | NULL |
[1] | NULL |
[!1] | NULL |
(0) | application/octet-stream | (!0) | NULL |
(1) | NULL |
(!1) | NULL |
The following is a complete program for allowing one to test the syntax of queries. It accepts zero or more messages stored in files as arguments to the program, then reads query commands from standard input until the end of file is reached (usually with Control-D on Linux and Control-Z on Windows).
#include <ctype.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include "mime.h" int main(int argc, char *argv[]) { int i; mime_msg *msgs = NULL; for (i = 1; i < argc; i++) { FILE *file = fopen(argv[i], "rb"); mime_state *state; if (file == NULL) { perror(argv[i]); exit(EXIT_FAILURE); } state = mime_init(NULL, mime_mime, NULL, 1, NULL, file); msgs = mime_parse(msgs, state); mime_free(state); if (fclose(file) == EOF) { perror(argv[i]); exit(EXIT_FAILURE); } } while (!feof(stdin)) { char line[132], *s; size_t n; if (fgets(line, sizeof(line), stdin) == NULL) { if (ferror(stdin)) { perror(NULL); mime_free_msg_list(msgs); exit(EXIT_FAILURE); } else break; } n = strlen(line); while (n != 0 && isspace(line[--n])) { line[n] = '\0'; } s = mime_get_msg(msgs, line); if (s == NULL) { puts("NULL"); continue; } switch (line[n]) { mime_body_type *type; case '#': printf("length is %zu\n", *((size_t *) s)); break; case '?': switch (*((mime_body_type *) s)) { case mime_in_memory: puts("type is in_memory"); break; case mime_filename: puts("type is filename"); break; case mime_multipart: puts("type is multipart"); break; default: puts("type is UNKNOWN"); } break; case '=': line[n] = '?'; type = mime_get_msg(msgs, line); switch (*type) { case mime_in_memory: fputs("obj is ", stdout); mime_fputs(*((char **) s), stdout); puts(""); break; case mime_filename: printf("file is %s\n", *((char **) s)); break; case mime_multipart: printf("multi is %p\n", *((void **) s)); break; default:printf("????? is %p\n", *((void **) s)); } free(type); break; default:if (strchr(line, '^') != NULL) { if (s == NULL) puts("NULL"); else { mime_fputs((char *) s, stdout); puts(""); } } else printf("body is %p\n", *((void **) s)); } free(s); } mime_free_msg_list(msgs); return 0; }
For convenience, query strings allow one or more wildcards per identifier. Note that the wildcard has many restrictions regarding its use.
{ char *s = mime_get_msg(msgs, "'submit*'^content-type.*.charset"); printf("The encoding for the 1st submit form control is %s\n", s); free(result); }
While the combination of mime_init()
then
mime_parse()
is enough for most basic message decoding tasks,
additional functions exist for controlling the parser behavior.
When a state is initialized, size limits are disabled and thresholds are set to obvious behavior. These limits and thresholds can be modified at any time after initializing the state object and using it to parse messages.
iMIME differentiates between internal "in memory" objects and objects that are considered to be external "attachments" or files, and this judgement is made based on the Content-Type and Content-Disposition.
The default settings within the state cause iMIME to save all external attachments to disk and all internal objects are kept in memory, but you can modify this behavior. For example, you may want small file based objects to be kept in memory for performance. Or you may want to write excessively large "internal" objects to disk so excessive amounts of RAM memory are not allocated.
The cutoff point for when a object, whether it be classified as an attachment or internal, is the threshold. Until the amount of bytes in the body of the object reaches or exceeds the threshold point, the object will be kept in memory. The object will be written to a temporary file. If information about the modification date and/or the access date is present in the Content-Disposition, the temporary file will have these set accordingly.
To change the thresholds of a mime_state *
, modify the
threshold fields.
{ mime_state *state = mime_init(NULL, mime_parse, NULL, 1, NULL, stdin); mime_limit *limits = mime_get_limits(state); limits->write_alloc_threshold = 10240; limits->write_file_theshold = 64; }
When using iMIME to parse objects coming from an unknown and/or untrusted source, the size of the objects being sent cannot be known in advance. To reduce DoS problems where a third person sends an object intentionally or unintentionally of an unreasonable size, or sends an unreasonable amount of objects, iMIME allows you to configure the state machine to stop accepting input after certain limits are reached.
The limits set can be per object, all objects combined, and you can control the amount of bytes and the amount of message bodies received. The library can also separate the limits between external objects and internal objects, as the handling and body resource consumption depends on thresholds and whether they are considered to be external or internal.
{ mime_state *state = mime_init(NULL, mime_parse, NULL, 1, NULL, stdin); mime_limit *limits = mime_get_limits(state); limits->max_total_size = 32768; limits->max_total_objects = 5; limits->max_total_file_size = 16384; limits->max_file_size = 8196; limits->max_file_objects = 3; limits->max_total_alloc_size = 8192; limits->max_alloc_size = 4096; limits->max_alloc_objects = 3; }
You can query the default limits that are set during the state
initialization with the function
void mime_set_default_limits(mime_limit *limits)
, which will
store the defaults values in the struct
pointed to by limits.
Objects with sizes over certain
thresholds will be saved to a file
with a name that is set by the fifth parameter to
mime_init()
.
NULL
, the filename
used will be a combination of the default temporary file directory
(usually /tmp or
/var/tmp on Linux systems) plus the
MIME_TMP_BASE
plus the six characters making it unique.
#include <stdio.h>
#include "mime.h"
int main(int argc, char *argv[]) {
mime_state *state = mime_init(NULL, mime_mime, NULL, 1, "data.", stdin);
mime_msg *messages = mime_parse(NULL, state);
mime_free_msg_list(messages);
mime_free(state);
return 0;
}
The above program will save attachments and files from the message input into the program to the current working directory, with names similar to "data.jd5hYt", "data.65hTrd", and "data.JH76ya".
If the message to be parsed is already in memory, you can specify that the parse occur with data from memory rather than an open readable binary stream.
Because the string may contain embedded zeros (if it includes MIME attachments with Content-Transfer-Encoding: binary), you must also specify the length of the string.
#include <stdio.h> #include "mime.h" const char *msg = "Content-Type: text/plain\r\n" "Content-Disposition: attachment; filename=test.txt\r\n" "\r\n" "This is a test file.\r\n"; int main(int argc, char *argv[]) { mime_state *state = mime_init(NULL, mime_mime, NULL, 0, NULL, msg, strlen(msg)); mime_msg *messages = mime_parse(NULL, state); if (messages != NULL && messages->type == mime_filename) { printf("data saved to '%s'\n", messages->body.filename); /* don't let mime_free_msg_list() remove the temp */ messages->body.filename[0] = '\0'; } mime_free_msg_list(messages); mime_free(state); return 0; }
The above example should create a file in the temporary directory with the contents "This is a test file.".
Parsing from a string received is useful for CGI applications
handling form data passed with GET as the
data will be in memory, retrieved by C code similar to
getenv("QUERY_STRING")
.
mime_parse()
reports one type of error that may be useful to
report to the submitter of the message: whether or not the message was
greater than the set limits
(mime_limits_error
). If mime_no_error
is returned
by mime_get_error()
, no error related to message size
occurred.
For absolute robustness, you will want to test for other errors that can occur.
malloc()
, calloc()
, or
realloc()
call fails, errno
will be
set to ENOMEM
.errno
will be set appropriately.To test for these errors, set errno
to 0 before
mime_init()
, and set for a non-zero value both after
mime_init()
and mime_parse
.
#include <errno.h> #include <stdio.h> #include <stdlib.h> #include "mime.h" int main(int argc, char *argv[]) { mime_state *state; mime_msg *messages; errno = 0; state = mime_init(NULL, mime_mime, NULL, 1, NULL, stdin); if (errno) { perror(NULL); /* out of memory */ exit(EXIT_FAILURE); } else { mime_limit *limits = mime_get_limits(state); limits->max_total_size = 32768; /* <= 32K of body */ messages = mime_parse(NULL, state); if (errno) { perror(NULL); /* no memory or io problem */ exit(EXIT_FAILURE); } } switch (mime_get_error(state)) { case mime_no_error: break; case mime_limits_error: puts("message too large"); break; default: puts("unknown error!"); } mime_free_msg_list(messages); mime_free(state); return 0; }
All data that comes from an untrusted source needs to be carefully evaluated by the application program. Because iMIME was designed to process information coming from outside sources, you should always regard this data as untrusted unless it comes from an authenticated trusted source.
iMIME may occasionally return a NULL pointer in certain data structures when the data could not be converted to UTF-8, an illegal UTF-8 sequence is found, the message is malformed or prematurely terminated. It will also attempt to gracefully bail out of error situations without leaking memory to reduce (not eliminate) the chance of DoS exploits.
The application using the library will need to perform additional checks should it be accepting files and non-text data to determine if they are valid.
The library tries to not crashing due to bad input, but does nothing about rejecting bad data that does not crash the parser and obeys limits. This is the responsibility of the library user.
If the library creates files, these files may have access and modification timestamps that were set from information received from the MIME data. They cannot be trusted as accurate. As always, the contents of files received from unknown sources should not be trusted. This includes not just the obvious executable binary, but scripts and less obvious sources (PostScript®) that have executable properties.
The security notes in the RFCs that iMIME support should be consulted and taken into account when using iMIME.
iMIME is designed to help software applications with the following needs.
Mailing list software and other automated mail responders proliferate around the internet. For interactive clients, where a special command is sent to an address, the mail daemon returns a specific reply. (ex. sending a "help" or "subscribe" command to a mailing list daemon). There are a few problems with current software that can be solved by modifying mailing list software to use this library to preprocess input.
iMIME was designed especially for handling new generation web form submissions. HTML 4 goes into great detail about the multipart/form-data format and its recommended use for non-ASCII submissions, file uploads, and large amounts of data.
Unfortunately, the sheer complexity of the standard, plus the fact that existing libraries handle the multipart/form-data with a completely different API means that web designers had to maintain two different code bases for web submissions.
iMIME can handle the most complicated web form submissions, as well as the older application/x-www-form-urlencoded with the same API, making it easy to migrate from the older system and/or be backwards compatible with older pages and older browsers that can't support the new type.
<FORM action=example.cgi>
NULL
.mime_init()
to
mime_urlencoded
, saying that no header information is present
and that we will be parsing the body immediately.NULL
, the library will query the run-time locale to
determine the character set/encoding. While standards dictate that
the lack of a "charset" for a web page indicates that the page defaults
to using ISO-8859-1, in reality this is abused, and web browsers may send
non-Latin-1 data if the encoding is not set explicitly by the HTTP
server. If a charset parameter has
been set in the Content-Type (some newer
browsers such as
Mozilla can do this),
this will override this parameter and the locale's character encoding.getenv("QUERY_STRING")
.size_t
unsigned
integer type that will be the length of the fifth parameter. If the
fifth parameter contains no embedded '\0'
characters (which it shouldn't if it's valid), this will be the same as the
value returned by strlen(url_encoded_string)
.The data parsed with mime_parse()
will resemble a MIME
multipart message. Ampersands and semicolons will look like MIME
multipart boundaries. The data to the left of the equals sign will look
like the data to the right of "Content-Disposition;
name=".
<FORM method=post action=example.cgi>
As the data type is the same, HTML form method="post"
handling is the same as
HTML FORM handling with iMIME are handled,
with the exception being that the application/x-www-form-urlencoded is read from a
stream (usually stdin
with CGI) and not a string.
<stdio.h>
style FILE
pointer.FILE
pointer
that has been fopen()
ed with a binary reading mode such as
"rb". This is not important with glibc
based systems because binary files are the same as text files, but on other
platforms failing to set to binary will cause linefeed information which
is critical to iMIME to be altered. stdin
is a text
stream, not a binary stream. The library does not
fclose()
the stream.Note that should the CGI be working in the obsolete "nph" mode and is receiving the headers, you should use the standard parsing which is the same as handling a file upload. The Content-Type will cause iMIME to process the data properly.
<FORM method=post enctype="multipart/form-data" action=example.cgi>
Handling a file upload is identical to a POSTed form or converting mail from the perspective of the application.
mime_init()
to
mime_mime
, saying that header information is present.In the case of multipart/form-data POSTs, iMIME behaves slightly differently than a regular MIME mail.
If the browser support uploading multiple files within one control as per HTML 4, the multiple files will be in a multipart/mixed object.
For new forms, you should always use multipart/form-data instead of application/x-www-form-urlencoded, because MIME has mechanisms built in for specifying the character encoding for every control, where the older method must either hardcode/coordinate the encoding between the web page and the script, or rely on a kludge where the encoding is passed in as a hidden control.
In the past, the new form was frowned upon due to lack of browser support, but now all modern browsers support new form submissions.
iMIME can use the following methods for determining the character encoding for older forms:
iMIME supports User-Agents (web browsers) such as Mozilla that can specify the encoding in the charset parameter in the Content-Type for old forms.
Content-Type: application/x-www-form-urlencoded; charset=iso-8859-2
name=noone&submit=ok
This method has the following advantages:
mime_init
.
Thus, as browsers become newer, they will automatically obsolete the
kludges.This method has the following disadvantages:
When the accept-charset
attribute is set to
UNKNOWN
(the default value), a form is submitted in the same
encoding that the web page containing the form is in.
If the HTML 4 attribute for the accept-charset is set, the character set used should be one from the specified list. Few browsers currently support this very new feature, but as it is standard support is expected to increase.
<FORM action=example.cgi accept-charset="utf-8, euc-kr" method=post>
charset
.If accept-charset=UNKNOWN (the default) and the page is not
in ISO-8859-1
(Latin-1, the Western European character set),
then the character encoding of the page must be set by
the web server. Web pages that assume the locale default (such as
Japanese pages with no charset
parameter) are
wrong according to standards and are not guaranteed to work
correctly. This also includes the updated version of Latin-1,
ISO-8859-15, which includes the Euro currency symbol ("€")
and some French characters missing from Latin-1. There are three ways to tell
the User-Agent (web browser) what character encoding the page is in, and thus
what character set it should submit the form when
accept-charset=UNKNOWN
.
<META http-equiv=content-type content="text/html; charset=big5">
This has the following advantages:
The <META> tag solution has the following disadvantages:
charset
parameter to be modified many
different ways.
encoding
" attribute AND the XML parser supports encodings
other than UTF-8/UTF-16 (standards mandate that they only only required to
support those two), the character encoding may be specified there.
<?xml version="1.0" encoding="iso-8859-5"?>
iMIME has a special variable that may be passed in form controls called "charset-enc". When a form control of this name appears, the value is used as the character encoding for all subsequent controls. Thus you must make sure that this control appears before all text controls.
<FORM method=post action=example.cgi>
<INPUT type=hidden name=charset-enc value=iso-8859-2>
<!-- the above must be the first control -->
<INPUT name=full-name>
</FORM>
The advantages of this method are:
The disadvantages of this method are:
<META
http-equiv=content-type>
tag.It is possible to have more than one charset-enc in a form. The subsequent variables will override and replace the previous set value, but will not cause reconversion of the previous text controls.
Doing this doesn't make much sense with legacy forms though as only one character encoding is used for all the form controls. In general, you don't want to do this, but one special cases come to mind: You want some characters to not be converted when using 7-bit stateful encoding. In this case, the normal character set would be something similar to ISO-2022-JP or ISO-2022-KR or ISO-2022-CN, and a few controls would be set to decode as US-ASCII so the escape/shift sequences will be ignored.
The simple method is to leave the encoding to the web backend script team, and tell the international web contents and translation team what character set the pages with forms must be in.
{
mime_state *state;
state = mime_init(NULL, mime_urlencoded, "koi8-r", 1, NULL, stdin);
}
The above sets the parser to expect the form data to be in a popular encoding for Russian Cyrillic. The page submitting the form must also be in KOI8-R.
The advantages of this method are:
The disadvantages of this method are:
All three technique for determining the character encoding can be used at once. In the case that two or more sources are available for determining the character encoding, precedence is set as follows, with the first listed method having the highest precedence.
mime_init()
functionIt's a good idea not to rely on the character encoding being set by the browser alone as most older browsers do not send this information.
iMIME contains some extra routines not directly related to parsing or querying messages, but to help with the debugging and conversion of UTF-8, HTML ampersand escapes, and wide strings.
Most environments still do not have UTF-8 consoles that can be used to view raw UTF-8 data. Also, the sheer number of characters in Unicode and the complexity of Unicode means that you often want to see the raw hex codes for each character instead of the actual representation. iMIME provides two helper functions that can display UTF-8 encoded and wide strings on a ASCII terminal.
mime_fputs(MIME_UTF_BOM "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E\x0A", stdout);
produces the string:
which is 日本語 with a newline at the end. (The string means "Japanese language")
NULL
strings are allowed.mbstowcs()
is provided by Standard C to convert from multibyte
strings to wide strings, but suffers from some problems that make it
inadequate for use with this library.
LC_CTYPE
locale setting.wchar_t
is not
necessarily Unicode/ISO-10646 for some environments.#include <locale.h> #include <stdio.h> #include <stdlib.h> #include <wchar.h> #include "mime.h" int main(int argc, char *argv[]) { wchar_t *s; setlocale(LC_CTYPE, "ja_JP.eucJP"); /* convert UTF-8 char * to wide even though LC_CTYPE is set * so that char * should be EUC-JP. */ s = mime_utf_conv(MIME_UTF_BOM "\xe6\xbc\xa2\xe5\xad\x97\x0a"); if (s != NULL) { fputws(s, stdout); /* stdout has been changed to wide orientation so output the * non-wide string to stderr. */ mime_fputws(s, stderr); /* output EUC-JP */ fputc('\n', stderr); free(s); } else exit(EXIT_FAILURE); return 0; }
free()
d after use.sizeof
(wchar_t)
is two, then the
character is converted to two
surrogates.MIME_UTF_BOM
then the function behaves as if one allocated a string then called the
Standard C function mbstowcs()
, which is sensitive to the
global locale.mime_utf_conv()
will convert from UTF-8 to a
wchar_t
string where each character contains an ISO-10646
(Unicode) code.
漢字
/*0x8055180*/ L"\u6F22\u5B57\n" /*3*/
If you are using a terminal which can display Japanese, you should see the word "kanji" ("Chinese Character") in Japanese style ideographs followed by an escaped ASCII version. This is prefixed by the pointer location on your system and suffixed by the character (not the byte) count if debugging is enabled.
iMIME post processes all strings converted to UTF-8 through a HTML entity processor and numeric character reference processor. It understands all general entities in HTML 4.01 (including €, and understands decimal as well as hexadecimal character references.
mime_decode_html()
is called multiple times as characters
become available and are processed by a state machine.
NULL
.free()
d after use.#include <stdio.h> #include <stdlib.h> #include "mime.h" int main(int argc, char *argv[]) { int i; for (i = 1; i < argc; i++) { mime_html_state state; char *s; mime_init_html(&state, 0); s = mime_decode_html(argv[i], &state); mime_fputs(s, stdout); puts(""); free(s); s = mime_decode_html(NULL, &state); mime_fputs(s, stdout); puts(""); free(s); } return 0; }
You must initialize the HTML conversion state machine with
mime_init_html()
before calling
mime_decode_html()
.
Hello World! ♥
as the first argument produces the following output:
... which is "Hello World! ♥". Note that although
mime_decode_html()
outputs UTF-8, it does not prefix the string
with MIME_UTF_BOM
because the
magic is added for converted encodings
only.
When the library is compiled with debugging enabled through macros, the following additional general entities will be available.
Some people who are conservative about security believe it not wise to let outsiders know what version of software you're running lest script kiddies use known exploits for older versions, so the thinking is that live production code with debugging code removed should disable this entity, which it does.
iMIME understands both asctime() style time strings ("Sun Nov 6 08:49:37 1994") and RFC 850 style time strings. It also understands most of the U.S. timezone, military zone (A to Z), UTC/GMT zone, and some far east timezone (Japan and South Korea) labels.
#include <stdio.h> #include <time.h> #include "mime.h" int main(int argc, char *argv[]) { int i; for (i = 1; i < argc; i++) { time_t tod = mime_parse_time(argv[i]); fputs(asctime(gmtime(&tod)), stdout); fputs(asctime(localtime(&tod)), stdout); } return 0; }
mime_parse_time()
will return an encoded time, or
(time_t) -1
if the time couldn't be parsed.The last thing the world needs is another Quoted-Printable and Base64 decoder, but in case you do need it, iMIME's internal routines are exported for application use.
The parsers use a state object so you don't have to buffer the lines yourself. Lines can be arbitrarily long. To initialize the state machine, pass a pointer to an initialized or uninitialized structure to the following function.
Characters after the '=' padding are permitted and allowed in Base64 mode.
#include <stdio.h> #include "mime.h" int main(int argc, char *argv[]) { char s[] = "SGVsbG8gV29ybGQh"; /* "Hello World!" */ mime_enc_state state; size_t n; mime_init_enc(&state, mime_base64); n = mime_64_decode(s, s, &state); s[n] = '\0'; puts(s); return 0; }
The example above should print "Hello World!" on a line.
Content negotiation is a standard feature of HTTP/1.1 and is supported by most modern servers such as Apache. However, applications other than web servers may wish to conveniently figure out which language is most appropriate to return, given a set of languages with varying levels of quality matched with users desired languages.
Suppose we received a mail or HTTP request with the following header:
X-Accept-Language: fr-CA, fr; q=0.999, en; q=0.8, de; q=0.500; *; q=0
This could mean:"I am from Montreal and Canadian French is my native language. Given a choice, I prefer this dialect, but "standard" French is almost just as acceptable. I am bilingual though and understand English as well. I studied German and will take that if you have it, but only if you can't deliver in my two comfortable languages. I don't know any other languages, and don't even think about sending these to me."
{ const char *have = "Content-Language: en, de; q=0.9, ja; q=5, fr; q=0.1"; char *negotiated = mime_negotiate_content((mime_msg *) msg, have); if (negotiated == NULL) puts("NULL"); else { puts(negotiated); free(negotiated); } }
This example could be interpreted as follows: "The web page was written in English, but one of our web team members is German, knows our products and is familiar with our marketing pitch, and we trust his translations. Every once in a while we send the English page to a Japanese translation firm, and this page is not up-to-date, and the translator is not a technical person and not familiar with computer vocabulary. We received a French translation of our page once from a college student studying French translation.
fr
The above code will return "fr", even though it is the least well translated page. This is because the default macroed negotiation algorithm always gives preference What the client wants to accept, and only uses the quality of the content when the client prefers two or more available resources equally.
As iMIME will allow an application to easily receive and decode files, chances are that an interactive web application will want to immediately use that data-- whether it be an image (ex. face shot of the user submitted through a registration form), or the pronounciation of one's name.
Handling image, sound, and video is complicated as there are numerous standard formats. Many libraries exist to handle these files and iMIME does not bother to duplicate the functionality.
Modern GUI mail programs often default to outputting MIME formatted text/html unless modified to output text/plain. To deal with this common format, a convenience conversion filter are provided.
Automated processing of data received from mail and web clients nowadays
send some SGML application, such as DocBook, HTML, or XML. All of
these formats generally markup normal text with tags such as
<SAMPLE>
. iMIME converts marked up text to plain text
via the following methods:
With most markup languages, the removal of tags will cause the loss of too much information too be useful to humans, but the information could be useful for a text search engine or some other processor that needs only the unformatted raw text stream.
#include <stdlib.h> #include <stdio.h> #include "mime.h" int main(int argc, char *argv[]) { int i, result = EXIT_SUCCESS; mime_markup_state *state; if ((state = mime_init_markup("alt, longdesc")) == NULL) exit(EXIT_FAILURE); for (i = 1; i <= argc; i++) { const char *sgml = i == argc ? NULL : argv[i]; char *s = mime_transform_markup(sgml, state); if (s != NULL) { printf("%s%c", s, i == argc ? '\n' : ' '); free(s); } else { result = EXIT_FAILURE; break; } } mime_free_markup(state); return result; }
The library is written in Standard C and works best when called by C or C++ routines, but wrapper APIs have been written for popular scripting languages, especially scripting languages used for server side dynamic web content parsing, as the library is designed to work well with HTTP client data.
As this is library is open source, you can modify it in accordance with the license. The following areas have been set up to allow for trivial modification.
The C source code has plenty of macros which are designed to allow the developer to customize the operation of the library. Define and set these in your Makefile to override default behavior.
Note that changing these macros will probably cause the library to misbehave and not function properly as-is; additional hacking on the code will be necessary.
DMALLOC
MIME_REJECT_IDENT_ICONV
iconv_open()
to not be called and
an error returned when you try to convert from UTF-8 to UTF-8 (which does
nothing except error check the stream).MIME_PARSE_KEYWORDS
MIME_UNPARSED_RCVD
NDEBUG
and DEBUG
NDEBUG
is defined, the pointer address and
character count will not be output for each
mime_fputs()
.
Alternatively, you can enable debugging code by setting the
macro DEBUG
to true (non-zero).MIME_BACKSLASH_QUOTE
mime_fputs()
and
mime_fputws()
will be
printed as \" instead of \u0022.MIME_USE_UFFFF
MIME_HTML_2BYTE_UTF_U0000
MIME_CASE_INSENSITIVE_NAME_QUERY
MIME_USE_GLOBAL_LOCALE
LC_CTYPE
will be used instead of thread-safe substitutes that are hard-coded to the
"C" locale. Since message data from the
internet does not vary in format, there is no reason to use locale
sensitive parsing.MIME_PREFER_HAVE_OVER_WANT
The following macros are available to applications that #include
"mime.h"
.
MIME_UTF
iconv_open()
. The
wide character converter and
debug routines only understand
UTF-8.MIME_UTF_BOM
MIME_UTF
encoding. If you don't want a prefix, define
this to be "".MIME_UTF_DELIM
MIME_WIDE_DELIM
MIME_UTF_DELIM
, except it is
used by mime_utf_conv()
and
it is only one character and not a string. Usually set to U+FFFC, which is a Unicode Object Replacement
character.MIME_TMP_BASE
MIME_CHARSET_ENC
name
" attribute
in a <INPUT name="charset-enc" type="hidden"
value="encoding">
control.
All controls after this are converted to MIME_UTF
from encoding. Used to
let the form itself set the encoding.Strings of type mime_string
are dynamically preallocated with
a set amount of characters. When all the preallocated space is used up,
a realloc()
is called with a size either twice the previous
(if the inc
field is zero), or a size that is increased by
the value of the field inc
.
Allocate too little space, and performance degrades due to continuous
unnecessary realloc()
calls. Allocate too much space, and the
library consumes more free store than it needs to.
The const size_t
global variables at the top of
mime.c may be tuned for a specific
application.
mime_token_bufsiz
mime_response_bufsiz
mime_smtp_bufsiz
mime_b64qp_bufsiz
Within the source code the strings keywords are embedded in the comments to point out parts that you may wish to consider fixing or modifying.
Happy Hacking!
Additional documentation and
support for iMIME can be obtained from
Red Hat, Inc.