viernes, 19 de junio de 2009

Export to CSV in Unicode format so Excel can read Japanese characters in the right way

To make this history short, here is the problem I had. The Client wants to export the data from a report (basically a table) in csv format. This data could contain Unicode characters (basically Japanese characters). This file was intended to be used to move the data to other programs like excel or others.

Well to solve that requirement i made this method, that basically receive the data ready to be put in the csv format and save it to the response.

protected void ExportToCSV(object sender, EventArgs e)
 {
    if (String.IsNullOrEmpty(CSVData.Value)) return;

    var strFileNameexport = FileNameHidden.Value.Replace(" ", "_");
    Response.Clear();
    Response.ClearContent();
    Response.Buffer = true;
    Response.ContentType = "text/comma-separated-values";

    Response.ContentEncoding = Encoding.UTF8;
    Response.Charset = Encoding.UTF8.HeaderName;
    var data = CSVData.Value;    

    Response.AddHeader("Content-Disposition", string.Format("attachment;Filename={0}.csv", strFileNameexport));

    Response.Write(data);
    Response.End();
}

As you can see as far to this point everything seemed to be working fine. But suddenly when we try to open a exported file that contains Japanese characters, Excel was not able to display those characters. It replace those characters with garbage

So to Solve the Problem we...

  • found that excel expects the csv file to be in "utf-16le" (UCS-2 Little Endian) (weird, really weird).
  • when you put the data in this format, the "," character is not longer a valid column separator (even more weird, weird!!!!) so you need to use other character like "\t" (I know, I know... if you use \t as a separator the mime type should tab separated value files, but Excel seems to don't complain about it.
  • It is also recomendable that you enclose the data for each cell using quotes, this way the values inside won't be interpreted wrong if a separator character is present there.
  • then realize that we were receiving the data in utf-8 format so we first need to convert the bytes from utf-8 to utf-16
  • And the most important note: YOU NEED TO PUT THE MARKER BYTES SO THE PROGRAMS COULD RECOGNIZE THE ORDER OF THE BYTES IN THE FILE TO KNOW IF THEY ARE IN LITTLE ENDIAN OR IN BIG ENDIAN you can found a lot of information about it in the internet. but no one really tells you that this is necessary to make the file recognizeable by Excel as a csv file.
private static byte[] GetUCSStringFromData(string value)
{
    if (String.IsNullOrEmpty(value)) return new byte[] { };

    value = value.Replace("\",\"", "\"\t\"");

    var utf8Encoding = Encoding.UTF8;
    var utfBytes = utf8Encoding.GetBytes(value);

    var ucs2Encoding = Encoding.GetEncoding("utf-16le");
    var ucs2Bytes = Encoding.Convert(utf8Encoding, ucs2Encoding, utfBytes);

    var bytesFinal = new List {0xff, 0xfe};
    foreach (var b in ucs2Bytes)
    {
        bytesFinal.Add(b);
    }

    return bytesFinal.ToArray(); // ucs2Encoding.GetString(ucs2Bytes);

}

protected void ExportToCSVForExcel(object sender, EventArgs e)
{
    if (String.IsNullOrEmpty(CSVData.Value)) return;

    var strFileNameexport = FileNameHidden.Value.Replace(" ", "_");
    Response.Clear();
    Response.ClearContent();
    Response.Buffer = true;
    Response.ContentType = "text/comma-separated-values";

    //Those lines makes it possible to work with UCS2
    Response.ContentEncoding = Encoding.GetEncoding("utf-16le");
    Response.Charset = Response.ContentEncoding.HeaderName;

    var data = GetUCSStringFromData(CSVData.Value);

    Response.AddHeader("Content-Disposition", string.Format("attachment;Filename={0}.csv", strFileNameexport));
    
    Response.BinaryWrite(data);
    Response.End();
} 

So never forget it!