Preserving Character Encodings of a DSpace Metadata Export using MS Excel 2011 on OS X

Stencil Alphabet The problem I recently ran into was updating the metadata for a particular collection that was being moved from TDL’s repository into A&M’s repository. I able to quickly move the collection into the new repository using OAI-PMH harvesting with ORE support. However, the metadata needed a bit of cleaning up for it’s new repository home, such as changing to and inconsistent formats used in other fields. This is a perfect task for Stuart’s Bulk Metadata Export tool. This DSpace feature allows an administrator to download a Comma Separate Values (CSV) file of all the metadata in a particular collection, then open it up in MS Excel and edit the metadata naturally. Finally once the metadata is ready to go you can upload it back to the repository and all the fields will be updated correctly. It is a very nice feature that can save a ton of time.

The Problem

When I opened the file in Excel some of the characters were not showing up correctly. Specifically characters in titles and names which used non-English marks, in this case there were all from the extended Latin character set. If you ignore these problems, later when you try to upload the CSV file DSpace will pick up on these changes and cause the garbled characters to be introduced into the repository.

DSpace uses UTF-8 file encoding for everything as does almost all well-behaving application out there. Somewhere between downloading the file and opening it with Excel the encoding is being mis-interpreted. Excel has a CSV import tool where you can specify the character encoding but that tool does not work with DSpace’s export because the columns become miss-aligned if there are any new lines or charge returns in the metadata.

The Solution

The solution to this problem turned out to be using a command line tool I was not familiar with: “iconv“. It’s apparently a standard Linux command that is available with the default install of OS X. After some experimentation I found that when you “double click” open a CSV file Excel wants the file to be encoded using the default encoding found on Windows machines: Windows-1252. This encoding is very similar to ISO-8859-1 but has some key differences. Use the iconv command to convert the file into this encoding and you should be able to open the file with Excel OS X with just a double click. All the characters should be preserved.

iconv -f UTF-8 -t WINDOWS-1252 original.csv > excel.csv

However, you can’t upload this encoding back to DSpace because it won’t know what to do with it. DSpace expects it to be in the standard UTF-8 encoding. Once you are finished editing the file to your satisfaction in Excel you will need to convert it to UTF-8. When you save the file in Excel you will have two valid formating options you may select from the entire list shown below: “Comma Separated Values (.csv)” or “Windows Comma Separated (.csv)“. The key difference between these two formats is what character encoding Excel will use when saving the file.

If you went with the default Comma Separated Values format then you will need to run following command to convert the file back to UTF-8.

iconv -f MAC -t UTF-8 excel.csv > upload.csv

However if you chose the other valid option and saved the file as a “Windows Comma Separated (.csv)” then you will need to use the alternate command to convert the file back to UTF-8:

iconv -f WINDOWS-1252 -t UTF-8 excel.csv > upload.csv

You are ready to upload the resulting “upload.csv” file to DSpace or use command line tool for bulk editing.

Tags: , , , ,

Leave a Reply