Python 编码转换与中文处理

发布时间：2019-09-15 10:03:51编辑：auto阅读（1955）

Python 编码转换与中文处理

python 中的 unicode是让人很困惑、比较难以理解的问题. utf-8是unicode的一种实现方式，unicode、gbk、gb2312是编码字符集.

decode是将普通字符串按照参数中的编码格式进行解析，然后生成对应的unicode对象

写python时遇到的中文编码问题：

➜  /test sudo vim test.py
#!/usr/bin/python
#-*- coding:utf-8 -*-
def weather():
        import time
        import re
        import urllib2
        import itchat
        #模拟浏览器
        hearders = "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
        url = "https://tianqi.moji.com/weather/china/guangdong/shantou"    ##要爬去天气预报的网址
        par = '(<meta name="description" content=")(.*?)(">)'    ##正则匹配，匹配出网页内要的内容
        ##创建opener对象并设置为全局对象
        opener = urllib2.build_opener()
        opener.addheaders = [hearders]
        urllib2.install_opener(opener)
        ##获取网页
        html = urllib2.urlopen(url).read().decode("utf-8")
        ##提取需要爬取的内容
        data = re.search(par,html).group(2)
        print type(data)
        data.encode('gb2312')
        b = '天气预报'
        print type(b)
        c = b + '\n' + data
        print c
weather()

➜  /test sudo python test.py
<type 'unicode'>
<type 'str'>
Traceback (most recent call last):
  File "test.py", line 30, in <module>
    weather()
  File "test.py", line 28, in weather
    c = b + '\n' + data
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

解决方法：

➜  /test sudo vim test.py
#!/usr/bin/python
#-*- coding:utf-8 -*-
import sys
reload(sys)
# Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法，我们需要重新载入
sys.setdefaultencoding('utf-8')
def weather():
        import time
        import re
        import urllib2
        import itchat
        #模拟浏览器
        hearders = "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
        url = "https://tianqi.moji.com/weather/china/guangdong/shantou"    ##要爬去天气预报的网址
        par = '(<meta name="description" content=")(.*?)(">)'    ##正则匹配，匹配出网页内要的内容
        ##创建opener对象并设置为全局对象
        opener = urllib2.build_opener()
        opener.addheaders = [hearders]
        urllib2.install_opener(opener)
        ##获取网页
        html = urllib2.urlopen(url).read().decode("utf-8")
        ##提取需要爬取的内容
        data = re.search(par,html).group(2)
        print type(data)
        data.encode('gb2312')
        b = '天气预报'
        print type(b)
        c = b + '\n' + data
        print c
weather()

测试后：

➜  /test sudo python test.py
<type 'unicode'>
<type 'str'>

天气预报

汕头市今天实况：20度多云，湿度：57%，东风：2级。白天：20度,多云。夜间：晴，13度，天气偏凉了，墨迹天气建议您穿上厚些的外套或是保暖的羊毛衫，年老体弱者可以选择保暖的摇粒绒外套。

个人感觉网上说中文乱码通用解决办法都是错误的，因为类型不一样解决方法也不一样，所以最近刚好出现了这种问题，从网上找了很多办法没解决到，最后自己去查看资料，才发现需要对症下药。

这是一个抓取网页代码的python脚本

➜  /test sudo cat file.py
#!/usr/bin/python
#_*_ coding:UTF-8 _*_
import urllib,urllib2
import re
url = 'http://sports.sohu.com/nba.shtml' #抓取的url
par = '20180125.*\">(.*?)</a></li>'
req = urllib2.Request(url)
response = urllib2.urlopen(req).read()
#response = unicode(response,'GBK').encode('UTF-8')
print type(response)
print response

遇到的问题：

使用中文抓取中文网页时，print出来的中文会出现乱码

➜  /test sudo python file.py
special.wait({
itemspaceid : 99999,
form:"bigView",
adsrc : 200,
order : 1,
max_turn : 1,
spec :{
onBeforeRender: function(){
},
onAfterRender: function(){
},
isCloseBtn:true//�Ƿ��йرհ�ť
}
});

解决方法：

屏幕快照 2018-01-26 上午9.06.16.png

查看网页源代码发现charset=GBK的类型所以python中要进行类型转换

➜  /test sudo cat file.py
#!/usr/bin/python
#_*_ coding:UTF-8 _*_
import urllib,urllib2
import re
url = 'http://sports.sohu.com/nba.shtml' #抓取的url
par = '20180125.*\">(.*?)</a></li>'
req = urllib2.Request(url)
response = urllib2.urlopen(req).read()
response = unicode(response,'GBK').encode('UTF-8')
print type(response)
print response

➜  /test sudo python file.py
special.wait({
itemspaceid : 99999,
form:"bigView",
adsrc : 200,
order : 1,
max_turn : 1,
spec :{
onBeforeRender: function(){
},
onAfterRender: function(){
},
isCloseBtn:true//是否有关闭按钮
}
});

现在已经把中文乱码解决了

import json

#打印字典

dict = {'name': '张三'}

print json.dumps(dict, encoding="UTF-8", ensure_ascii=False) >>>{'name': '张三'}

#打印列表

list = [{'name': '张三'}]

print json.dumps(list, encoding="UTF-8", ensure_ascii=False) >>>[{'name': '张三'}]

关键字：

上一篇： python模块——hashlib

下一篇： Python mysql 爆破



搜索

热门推荐

最新文章

Ubuntu本地部署dots.ocr
 196°
Python搭建一个RAG系统(分片/检索/召回/重排序/生成)
 2359°
Browser-use:智能浏览器自动化(Web-Agent)
 3049°
使用 LangChain 实现本地 Agent
 2541°
使用 LangChain 构建本地 RAG 应用
 2497°
使用LLaMA-Factory微调大模型的function calling能力
 3158°
复现一个简单Agent系统
 2493°
LLaMA Factory-Lora微调实现声控语音多轮问答对话-1
 3314°
LLaMA Factory微调后的模型合并导出和部署-4
 5443°
LLaMA Factory微调模型的各种参数怎么设置-3
 5266°

博主信息

姓名：Run
职业：谜
邮箱：383697894@qq.com
定位：上海 · 松江

扫我打开

友情链接

百度 淘宝 腾讯 慕课网 CSDN 博客园 51cto博客