Boilard: Reading UTF8 encoded CSV and converting to Unicode

Reading UTF8 encoded CSV and converting to Unicode

I'm reading in a CSV file that has UTF8 encoding:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print repr(row[0])
This works fine, and prints out what I expect it to print out; a UTF8
encoded str:
> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...
Furthermore when I simply print the str (as opposed to repr()) the output
displays ok (which I don't understand eitherway - shouldn't this cause an
error?):
> Álvaro Salazar
> Élodie Yung
but when I try to convert my UTF8 encoded strs to unicode:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print unicode(name, 'utf-8') # or name.decode('utf-8')
I get the infamous:
Traceback (most recent call last):
File "scripts/script.py", line 33, in <module>
print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in
position 0: ordinal not in range(128)
So I looked at the unicode strings that are created:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
unicode_name = unicode(name, 'utf-8')
print repr(unicode_name)
and the output is
> u'\xc1lvaro Salazar'
> u'\xc9lodie Yung'
So now I'm totally confused as these seem to be mangled hex values. I've
read this question:
Reading a UTF8 CSV file with Python
and it appears I am doing everything correctly, leading me to believe that
my file is not actually UTF8, but when I initially print out the repr
values of the cells, they appear to to correct UTF8 hex values. Can anyone
either point out my problem or indicate where my understanding is breaking
down (as I'm starting to get lost in the jungle of encodings)

As an aside, I believe I could use codecs to open the file and read it
directly into unicode objects, but the csv module doesn't support unicode
natively so I can use this approach.

Boilard

Wednesday, 28 August 2013

Reading UTF8 encoded CSV and converting to Unicode

No comments:

Post a Comment