个性化阅读
专注于IT技术分析

读取csv乱码:如何读取带�的csv?

点击下载

我有CSV文件(法语), 该文件的文本行如下所示:

"Vend, 21 sept, 2018", "43326370894332743328177832888443325333815370", "NX", "651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto", "RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux", "Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline", "Vendredi, 21 septembre, 2018", "", "", "3", "37089", "", "100", "", "204-7584", "MIller ", "claudia", "8:30 pt ne s'est pas présenté (gastro) veut un autre rdv", "370892192018", "581-309-1309660-3064fille254-6560cel650-4556"

我使用以下代码在Python中读取了它:

import csv
filepath = 'RDV.csv'
try:
    with open(filepath, 'rU') as file:
        try:
            reader = csv.reader(x.replace('\0', '') for x in file)
            for row in reader:
                try:
                    print(row)
                except Exception as ee:
                    print ee
        except Exception as eee:
            print eee
except Exception as e:
    print e

内容如下:

['Vend, 21 sept, 2018', '43326\x1d\x1d37089\x1d\x1d43327\x1d43328\x1d17783\x1d28884\x1d\x1d\x1d43325\x1d33381\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d5370', '\x1d\x1d\x1d\x1dNX', '651-2141\x1d\x1d652-1309\x1dNON\x1d666-3778\x1d692-2229\x1d581-300-6525\x1d622-9439\x1d\x1dNON\x1d581-998-8765\x1d827-3937\x1dSTOP\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dNON\x1d653-2541\x1d\x1d\x1dToronto', 'Roy\x1d\x1dRoy\x1d\x1dHoude\x1dOuellet\x1dFecteau\x1dRenaud\x1d\x1d\x1dBergeron\x1dLeclerc\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dBadeaux', 'Louise-Andr\x8ee\x1d\x1dAndr\x8e\x1d\x1dRichard\x1dAlexandra\x1dPauline\x1dEliane\x1d\x1d\x1dCharles-Eug\x8fne\x1dGuy\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dJacqueline', 'Vendredi, 21 septembre, 2018', '', '', '3', '37089', '', '100', '', '204-7584', 'MIller ', 'claudia', "8:30 pt ne s'est pas pr\x8esent\x8e (gastro) veut un autre rdv\x0b", '370892192018', '\x1d\x1d581-309-1309\x1d\x1d\x1d\x1d660-3064fille\x1d254-6560cel\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d650-4556']
  1. 如何将其读为纯文本而不是那些编码字符?
  2. 如何在值中查找字符�例如:

    Louise-Andr�eAndr�RichardAlexandraPaulineElianeCharles-Eug�neGuyJacqueline

编辑:

我尝试了来自snakecharmerb答案的代码, 但出现以下错误:

Traceback (most recent call last):
  File "<input>", line 20, in <module>
  File "<input>", line 9, in unicode_csv_reader
  File "<input>", line 15, in utf_8_encoder
  File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 701, in next
    return self.reader.next()
  File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 632, in next
    line = self.readline()
  File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 547, in readline
    data = self.read(readsize, firstline=True)
  File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 494, in read
    newchars, decodedbytes = self.decode(data, self.errors)
  File "/Users/simran/Documents/abc/venv/lib/python2.7/encodings/utf_16.py", line 112, in decode
    raise UnicodeError, "UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM

#1


该文件可能被编码为UTF-16。

>>> s = '"Vend, 21 sept, 2018", "43326370894332743328177832888443325333815370", "NX", "651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto", "RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux", "Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline", "Vendredi, 21 septembre, 2018", "", "", "3", "37089", "", "100", "", "204-7584", "MIller ", "claudia", "8:30 pt ne s\'est pas présenté (gastro) veut un autre rdv", "370892192018", "581-309-1309660-3064fille254-6560cel650-4556"'
>>> buf = io.BytesIO(s.decode('utf-8').encode('utf-16'))
>>> next(csv.reader(buf))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
 _csv.Error: line contains NULL byte

Python2的csv模块不处理UTF-16, unicodecsv包也不处理。但是, 我们可以从文档中的示例中修改unicode_csv_reader:

import codecs
import csv 


def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data), dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]


def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')


with codecs.open('french2.csv', 'rU', encoding='utf-16') as f:
    for row in unicode_csv_reader(f):
        for cell in row:
            print cell

代码产生以下输出(每行打印一个单元格只是为了显示带重音的字符):

Vend, 21 sept, 2018
43326370894332743328177832888443325333815370
NX
651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto
RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux
Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline
Vendredi, 21 septembre, 2018


3
37089

100

204-7584
MIller 
claudia
8:30 pt ne s'est pas présenté (gastro) veut un autre rdv
370892192018
581-309-1309660-3064fille254-6560cel650-4556

在Python3中, 这些都不是必需的, 你可以这样做:

with open(myfile, 'r', newline='', encoding='utf-16') as f:
    reader = csv.reader(f)
    for row in reader:
        ...

评论

识别编码

没有通用解决方案, 猜测未知编码是一个问题。在这种情况下, 我们知道编码的文本包含空字节, 并且删除空字节会留下十六进制转义, 在该处我们希望看到带有重音的欧洲字符, 但未带重音的欧洲字符保持不变。有足够的证据表明该文件可以编码为UTF-16。对于ASCII范围内的字符, UTF-16编码有效地在ASCII字符前添加了一个空字节或在其后附加了一个空字节。

>>> u = u'André'
>>> s = u.encode('utf-16-le')
>>> s
'A\x00n\x00d\x00r\x00\xe9\x00'

UTF-16编码可以是big-endian或little-endian;字节序确定空字节是在ASCII字符之前还是之后。字节可包括指示字节序的字节顺序标记(BOM);在这种情况下, 可以将编码指定为UTF-16, Python将选择正确的编码。如果没有BOM, 则必须明确指定utf-16-le或utf-16-be。

字符(‘\ uffd’)是unicode替换字符, 用于呈现无法以所选编码显示的字符(假设str.encode的errors参数设置为’replace’, 无论是显式还是隐式)

>>> print s
Andr�

读取csv

Python 2的csv模块不能很好地处理非ASCII编码。为了克服它的局限性, 上面的代码

  • 将文件内容从utf-16解码为unicode
  • 重新编码为utf-8(以避免空字节)
  • 将每个单元格的内容从utf-8解码为unicode

一旦内容以unicode的形式返回给程序, 就可以毫无问题地进行处理, 直到对其进行编码以写入文件或打印为止。

在Python 3中, 处理非ASCII文本要简单得多:此代码将完成所有工作:

with open('french.csv', newline='', encoding='utf-16') as f:
    reader = csv.reader(f)
    for row in f:
       print(row)
赞(0)
未经允许不得转载:srcmini » 读取csv乱码:如何读取带�的csv?

评论 抢沙发

评论前必须登录!