解決 Python 中 UnicodeDecodeError: 'cp950' codec can't decode

寫 Python 做資料處理時讀取中文檔案有時候會遇到 UnicodeDecodeError: 'cp950' codec can't decode byte 0xe8 in position 7: illegal multibyte sequence 的錯誤，每次發生我都要重新找一次解決方法，一直記不住… 乾脆直接寫下來!

我都很確認資料檔是以 UTF-8 編碼啊，但怎麼還是報 cp950，原來問題不是出在資料檔，而是 Python 腳本的預設編碼，所以我們只要告訴 Python 我們要讀取的檔案是以 UTF-8 編碼即可解決。

錯誤回報案例

1
2
3
4


Traceback (most recent call last):
  File ".\main.py", line 4, in <module>
    line = f.read()
UnicodeDecodeError: 'cp950' codec can't decode byte 0xe8 in position 7: illegal multibyte sequence

錯誤程式碼

1
2
3
4
5
6
7


import re

f = open("data1109.txt","r") #注意此行
line = f.read()

arr = re.findall(r"[0-9]+:[0-9]+\s", line)
print(len(arr))

修改後程式碼

在 open() 裡加上 encoding="utf-8" 即可解決。

1
2
3
4
5
6
7


import re

f = open("data1109.txt","r",encoding="utf-8")  #注意此行
line = f.read()

arr = re.findall(r"[0-9]+:[0-9]+\s", line)
print(len(arr))