W3cubDocs

/DOM

WindowBase64.Base64 encoding and decoding

Base64 is a group of similar binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation. The term Base64 originates from a specific MIME content transfer encoding.

Base64 encoding schemes are commonly used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remain intact without modification during transport. Base64 is commonly used in a number of applications including email via MIME, and storing complex data in XML.

In JavaScript there are two functions respectively for decoding and encoding base64 strings:

The atob() function decodes a string of data which has been encoded using base-64 encoding. Conversely, the btoa() function creates a base-64 encoded ASCII string from a "string" of binary data.

Both atob() and btoa() work on strings. If you want to work on ArrayBuffers, please, read this paragraph.

Documentation

data URIs
data URIs, defined by RFC 2397, allow content creators to embed small files inline in documents.
Base64
Wikipedia article about Base64 encoding.
WindowOrWorkerGlobalScope mixin
Specifies the atob and btoa methods and states that they encode to base64 as specified by RFC 4648.
RFC 4648
Specifies the base64 algorithm in section 4, and also defines an alternate base64url algorithm for URLs in section 5 (which is not the one used by atob/btoa).
atob()
Decodes a string of data which has been encoded using base-64 encoding.
btoa()
Creates a base-64 encoded ASCII string from a "string" of binary data.
The "Unicode Problem"
In most browsers, calling btoa() on a Unicode string will cause a Character Out Of Range exception. This paragraph shows some solutions.
URIScheme
List of Mozilla supported URI schemes
StringView
In this article is published a library of ours whose aims are:
  • creating a C-like interface for strings (i.e. array of characters codes — ArrayBufferView in JavaScript) based upon the JavaScript ArrayBuffer interface,
  • creating a collection of methods for such string-like objects (since now: stringViews) which work strictly on array of numbers rather than on immutable JavaScript strings,
  • working with other Unicode encodings, different from default JavaScript's UTF-16 DOMStrings,

View All...

Tools

View All...

The "Unicode Problem"

Since DOMStrings are 16-bit-encoded strings, in most browsers calling window.btoa on a Unicode string will cause a Character Out Of Range exception if a character exceeds the range of a 8-bit byte (0x00~0xFF). There are two possible methods to solve this problem:

  • the first one is to escape the whole string (with UTF-8, see encodeURIComponent) and then encode it;
  • the second one is to convert the UTF-16 DOMString to an UTF-8 array of characters and then encode it.

Here is the prefered method along with two other "possible" methods.

Solution #1 –UTF-16 => binary UTF8-in-16

The fastest, most widely useable, most standard, and most future-ready way to solve the unicode problem is by transforming the UTF16 into UTF8-in-16 then use btoa on the UTF8-in-16 string. To understand this deeper, UTF16 uses 16 bits as the smallest unit of measure whereas UTF8 uses 8 bits as the smallest unit of measure. This means that the smallest a UTF16 character can be is 16 bits (2 bytes) meanwhile the smallest a UTF8 character can be is 8 bits (1 byte). Then, UTF8-in-16 is making the string encoded as UTF-8, but storing it as a UTF-16 string by only using 8 (from UTF8) of the 16 bits in each unit of measure of a UTF-16 string. This method is incredibly efficient because all it takes is a single simple super fast String.prototype.replace to convert between UTF-16 and UTF8-in-16. There is no need to encode the string as an array buffer then reencode it as a string. Rather, the below code simply and straightforwardly converts it in a single String.prototype.replace.

(function(window){
    "use strict";
    var log = Math.log;
    var LN2 = Math.LN2;
    var clz32 = Math.clz32 || function(x) {return 31 - log(x >>> 0) / LN2 | 0};
    var fromCharCode = String.fromCharCode;
    var originalAtob = atob;
    var originalBtoa = btoa;
    function btoaReplacer(nonAsciiChars){
        // make the UTF string into a binary UTF-8 encoded string
        var point = nonAsciiChars.charCodeAt(0);
        if (point >= 0xD800 && point <= 0xDBFF) {
            var nextcode = nonAsciiChars.charCodeAt(1);
            if (nextcode !== nextcode) // NaN because string is 1 code point long
                return fromCharCode(0xef/*11101111*/, 0xbf/*10111111*/, 0xbd/*10111101*/);
            // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
            if (nextcode >= 0xDC00 && nextcode <= 0xDFFF) {
                point = (point - 0xD800) * 0x400 + nextcode - 0xDC00 + 0x10000;
                if (point > 0xffff)
                    return fromCharCode(
                        (0x1e/*0b11110*/<<3) | (point>>>18),
                        (0x2/*0b10*/<<6) | ((point>>>12)&0x3f/*0b00111111*/),
                        (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/),
                        (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/)
                    );
            } else return fromCharCode(0xef, 0xbf, 0xbd);
        }
        if (point <= 0x007f) return inputString;
        else if (point <= 0x07ff) {
            return fromCharCode((0x6<<5)|(point>>>6), (0x2<<6)|(point&0x3f));
        } else return fromCharCode(
            (0xe/*0b1110*/<<4) | (point>>>12),
            (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/),
            (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/)
        );
    }
    window["btoaUTF8"] = function(inputString, BOMit){
        return originalBtoa((BOMit ? "\xEF\xBB\xBF" : "") + inputString.replace(
            /[\x80-\uD7ff\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?/g, btoaReplacer
        ));
    }
    //////////////////////////////////////////////////////////////////////////////////////
    function atobReplacer(encoded){
        var codePoint = encoded.charCodeAt(0) << 24;
        var leadingOnes = clz32(~codePoint);
        var endPos = 0, stringLen = encoded.length;
        var result = "";
        if (leadingOnes < 5 && stringLen >= leadingOnes) {
            codePoint = (codePoint<<leadingOnes)>>>(24+leadingOnes);
            for (endPos = 1; endPos < leadingOnes; ++endPos)
                codePoint = (codePoint<<6) | (encoded.charCodeAt(endPos)&0x3f/*0b00111111*/);
            if (codePoint <= 0xFFFF) { // BMP code point
            result += fromCharCode(codePoint);
            } else if (codePoint <= 0x10FFFF) {
            // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
            codePoint -= 0x10000;
            result += fromCharCode(
                (codePoint >> 10) + 0xD800,  // highSurrogate
                (codePoint & 0x3ff) + 0xDC00 // lowSurrogate
            );
            } else endPos = 0; // to fill it in with INVALIDs
        }
        for (; endPos < stringLen; ++endPos) result += "\ufffd"; // replacement character
        return result;
    }
    window["atobUTF8"] = function(inputString, keepBOM){
        if (!keepBOM && inputString.substring(0,3) === "\xEF\xBB\xBF")
            inputString = inputString.substring(3); // eradicate UTF-8 BOM
        // 0xc0 => 0b11000000; 0xff => 0b11111111; 0xc0-0xff => 0b11xxxxxx
        // 0x80 => 0b10000000; 0xbf => 0b10111111; 0x80-0xbf => 0b10xxxxxx
        return originalAtob(inputString).replace(/[\xc0-\xff][\x80-\xbf]*/g, atobReplacer);
    };
})(typeof global == "" + void 0 ? typeof self == "" + void 0 ? this : self : global);

This solution may look scary big, but no need to worry too much: after minification, the uncompressed raw file is scarcely over a kilobyte in size. This solution is very standard and very widely applicable because it can be used to encode high code point characters in a way that the browser can understand it as seen on the demo page here. The full github repository is AnonyCo's BestBase64EncoderDecoder.

Solution #2 – escaping the string before encoding it

function b64EncodeUnicode(str) {
    // first we use encodeURIComponent to get percent-encoded UTF-8,
    // then we convert the percent encodings into raw bytes which
    // can be fed into btoa.
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
        function toSolidBytes(match, p1) {
            return String.fromCharCode('0x' + p1);
    }));
}

b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64EncodeUnicode('\n'); // "Cg=="

To decode the Base64-encoded value back into a String:

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"
b64DecodeUnicode('Cg=='); // "\n"

Unibabel implements common conversions using this strategy.

Solution #3 – rewrite the DOMs atob() and btoa() using JavaScript's TypedArrays and UTF-8

Use a TextEncoder polyfill such as TextEncoding (also includes legacy windows, mac, and ISO encodings), TextEncoderLite, combined with a Buffer and a Base64 implementation such as base64-js.

When a native TextEncoder implementation is not available, the most light-weight solution would be to use TextEncoderLite with base64-js. Use the browser implementation when you can.

The following function implements such a strategy. It assumes base64-js imported as <script type="text/javascript" src="base64js.min.js"/>. Note that TextEncoderLite only works with UTF-8.

function Base64Encode(str, encoding = 'utf-8') {
    var bytes = new (TextEncoder || TextEncoderLite)(encoding).encode(str);        
    return base64js.fromByteArray(bytes);
}

function Base64Decode(str, encoding = 'utf-8') {
    var bytes = base64js.toByteArray(str);
    return new (TextDecoder || TextDecoderLite)(encoding).decode(bytes);
}

In some cases, the above conversion to UTF-8 and then to Base64 will not be very space efficient. UTF-8 produces longer output than UTF-16 when the text contains a large percentage of characters in the range U+0800-U+FFFF, which are encoded with three bytes in UTF-8 but two in UTF-16. In the case where the JavaScript string contains evenly-distributed UTF-16 code points, one might consider encoding to UTF-16 instead of UTF-8 before the conversion to Base64, for a 40% reduction in size.

© 2005–2018 Mozilla Developer Network and individual contributors.
Licensed under the Creative Commons Attribution-ShareAlike License v2.5 or later.
https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding