数据提取

通过之前的学习，我们已经可以通过多种方式获取到数据了，可能是直接可以使用的 HTML，也可能是 JS、CSS 代码、多媒体资源等。

接着我们需要从中提取出对我们有价值的数据，这一步是文本预处理。提取的方式有根据网页节点结构、JS 语法、标签属性、纯文本规律等。

正则表达式：根据文字规律，解析最快

Xpath：根据网页节点路径，解析较快

CSS：根据网页的 css 项，可读性更强

Beatifulsoup：根据属性、节点、css ，解析最慢

parsel：根据正则、网页节点路径、属性、节点、css 提取，解析较快

当然，除了对数据除了藏在网页之中，也有可能藏在 json 中。介绍 2 个处理 json 类型的模块。

json：python 标准库，只兼容标准 json，速度快。

execjs：兼容性最好，并且支持转换后执行 js，但是速度较慢。

正则表达式

正则表达式作为多编程语言中的数据匹配工具，实用又简单，预计学习时长 8 小时。这里送上学习笔记和思维导图。

经典示例

import re

# findall
target = 'life is short, i learn python.'
result = re.findall('python', target)
result1 = re.findall('java', target)
# findall是re库的一个重要方法，第一个参数是匹配规则，第二个参数是要匹配的目标字符串，还有第三个参数，我们之后讲，findall返回的结果是一个列表。
# result这行代码的意思是从target中匹配'python',如果匹配到就返回，没有匹配到就返回空列表。
print(result)# 得到的结果是['python']
print(result1)# 得到的结果是[]


# 元字符
target = 'abc acc aec agc adc aic'
result = re.findall('a[de]c', target)
# 这一行中的[de]表示这个位置上的字符是d或者是e都可以匹配出来
print(result)# 得到的结果是['aec', 'adc']

result = re.findall('a[b‐z]c', target)
# 这一行中的[b‐z]表示这个位置上的字符在b‐z范围内都可以匹配出来
print(result)# 得到的结果是['abc', 'acc', 'aec', 'agc', 'adc', 'aic']

result = re.findall('a[^c‐z]c', target)
# 这一行中的[^c‐z]表示这个位置上的字符不在c‐z范围内都可以匹配出来，注意是不在
print(result)# 得到的结果是['abc']


# 示例
text = '我住在3号楼666,我的电话号码是17606000003你后面有事给我打电话，打不通就打17327567890。实在不行就打固定电话010-7788'
result = re.findall('\d{3}[\d-]\d*',text)
# \d{3}代表至少3个数字起匹配（区号和电话号码都满足）
# [\d-]代表后面跟着的可以是数字（电话号码），也可以是-
# \d*代表后面的数字我都要
print(result)#结果是['17606000003', '17327567890', '010-7788']


# 分组
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
#re.M表示多行匹配，影响 ^ 和 $
#re.I 使匹配对大小写不敏感
if matchObj:
   print ("matchObj.group() : ", matchObj.group())#返回所有组
   print ("matchObj.group(1) : ", matchObj.group(1)) # 返回组1【注意不是从0开始】
   print ("matchObj.group(2) : ", matchObj.groups())# 返回所有组的元组形式
else:
   print ("No match!!")


# 替换与检索sub
phone = "2004-959-559 # 这是一个国外电话号码"
# 删除字符串中的 Python注释
num = re.sub(r'#.*$', "", phone)
print ("电话号码是: ", num)
# 删除非数字(-)的字符串
num = re.sub(r'\D', "", phone)
print ("电话号码是 : ", num)

# 将匹配的数字乘以 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)
s = 'A23G4HFD567'
print(re.sub('(?P<value>\d+)', double, s))


#贪婪与非贪婪
content = '发布于2018/12/23'
result = re.findall('.*?(\d.*\d)', content)
# 这里的?表示的就是非贪婪模式，第一个.*会尽可能少地去匹配内容，因为后面跟的是\d，所以碰见第一个数字就终止了。
print(result)

result = re.findall('.*(\d.*\d)', content)
# 这里的第一个.*后面没有添加问号，表示的就是贪婪模式，第一个.*会尽可能多地去匹配
#内容，后面跟的是\d，碰见第一个数字并不一定会终止，当它匹配到2018的2的时候，发现剩#下的内容依然满足(\d.*\d)，所以会一直匹配下去，直到匹配到12后面的/的时候，发现剩下
#的23依然满足(\d.*\d)，但是如果再匹配下去，匹配到23的2的话，剩下的3就不满足
#(\d.*\d)了，所以第一个.*就会停止匹配，(\d.*\d)最终匹配到的结果就只剩下23了。
print(result)

result = re.findall('.*?(\d.*?\d)', content)
# 这里的第一个.*?表示非贪婪模式(非贪婪模式就是尽可能少地去匹配字符)，匹配到2018前面的'于'之后就停止了
# 括号里的.*?也是表示非贪婪模式，括号里的内容从2018的2开始匹配，因为后面一个数字
#是0，那么也就满足了(\d.*?\d)，所以就直接返回结果了，同样的，接下来的18也是这样，一
#直匹配到23才结束
print(result)

Xpath

Xpath 的基本语法

更多实例可以查看菜鸟教程的Xpath 实例

表达式	描述
nodename	选取此节点的所有子节点
/	从根节点选取（取子节点）
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置（取子孙节点）
.	选取当前节点
..	选取当前节点的父节点
@	选取属性
*	匹配任何元素节点
@*	匹配任何属性节点
node()	匹配任何类型的节点
/bookstore/*	选取 bookstore 元素的所有子元素
//*	选取文档中的所有元素
//title[@*]	选取所有带有属性的 title 元素
text()	提取文本信息

下面是一些具体的例子，另外通过在路径表达式中使用"|"运算符，您可以选取若干个路径。譬如：

//title | //price 选取文档中的所有 title 和 price 元素。

表达式	描述
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素
/bookstore/book[position()< 3]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素
//title[@lang='eng']	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00
/bookstore/book[price>35.00]//title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00

CSS

CSS 的基本语法

更多实例可以查看CSS 选择器参考手册

选择器	例子	例子描述
.class	.intro	选择 class="intro" 的所有元素
.class1.class2	.name1.name2	选择 class 属性中同时有 name1 和 name2 的所有元素
.class1 .class2	.name1 .name2	选择作为类名 name1 元素后代的所有类名 name2 元素
#id	#firstname	选择 id="firstname" 的元素
*	*	选择所有元素
element	p	选择所有 p 元素
element.class	p.intro	选择 class="intro" 的所有 p 元素
element,element	div, p	选择所有 div 元素和所有 p 元素
element element	div p	选择 div 元素内的所有 p 元素
element>element	div > p	选择父元素是 div 的所有 p 元素
element+element	div + p	选择紧跟 div 元素的首个 p 元素
element1~element2	p ~ ul	选择前面有 p 元素的每个 ul 元素
[attribute]	[target]	选择带有 target 属性的所有元素
[attribute=value]	[target=_blank]	选择带有 target="_blank" 属性的所有元素
[attribute~=value]	[title~=flower]	选择 title 属性包含单词 "flower" 的所有元素
[attribute\|=value]	[lang\|=en]	选择 lang 属性值以 "en" 开头的所有元素。
[attribute^=value]	a[href^="https"]	选择其 src 属性值以 "https" 开头的每个 a 元素
[attribute$=value]	a[href$=".pdf"]	选择其 src 属性以 ".pdf" 结尾的所有 a 元素
[attribute*=value]	a[href*="w3schools"]	选择其 href 属性值中包含 "abc" 子串的每个 a 元素
:active	a:active	选择活动链接
::after	p::after	在每个 p 的内容之后插入内容
::before	p::before	在每个 p 的内容之前插入内容
:checked	input:checked	选择每个被选中的 input 元素
:default	input:default	选择默认的 input 元素
:disabled	input:disabled	选择每个被禁用的 input 元素
:empty	p:empty	选择没有子元素的每个 p 元素（包括文本节点）
:enabled	input:enabled	选择每个启用的 input 元素
:first-child	p:first-child	选择属于父元素的第一个子元素的每个 p 元素
::first-letter	p::first-letter	选择每个 p 元素的首字母
::first-line	p::first-line	选择每个 p 元素的首行
:first-of-type	p:first-of-type	选择属于其父元素的首个 p 元素的每个 p 元素
:focus	input:focus	选择获得焦点的 input 元素
:fullscreen	:fullscreen	选择处于全屏模式的元素
:hover	a:hover	选择鼠标指针位于其上的链接
:in-range	input:in-range	选择其值在指定范围内的 input 元素
:indeterminate	input:indeterminate	选择处于不确定状态的 input 元素
:invalid	input:invalid	选择具有无效值的所有 input 元素
:lang(language)	p:lang(it)	选择 lang 属性等于 "it"（意大利）的每个 p 元素
:last-child	p:last-child	选择属于其父元素最后一个子元素每个 p 元素
:last-of-type	p:last-of-type	选择属于其父元素的最后 p 元素的每个 p 元素
:link	a:link	选择所有未访问过的链接
:not(selector)	:not(p)	选择非 p 元素的每个元素
:nth-child(n)	p:nth-child(2)	选择属于其父元素的第二个子元素的每个 p 元素
:nth-last-child(n)	p:nth-last-child(2)	同上，从最后一个子元素开始计数
:nth-of-type(n)	p:nth-of-type(2)	选择属于其父元素第二个 p 元素的每个 p 元素
:nth-last-of-type(n)	p:nth-last-of-type(2)	同上，但是从最后一个子元素开始计数
:only-of-type	p:only-of-type	选择属于其父元素唯一的 p 元素的每个 p 元素
:only-child	p:only-child	选择属于其父元素的唯一子元素的每个 p 元素
:optional	input:optional	选择不带 "required" 属性的 input 元素
:out-of-range	input:out-of-range	选择值超出指定范围的 input 元素
::placeholder	input::placeholder	选择已规定 "placeholder" 属性的 input 元素
:read-only	input:read-only	选择已规定 "readonly" 属性的 input 元素
:read-write	input:read-write	选择未规定 "readonly" 属性的 input 元素
:required	input:required	选择已规定 "required" 属性的 input 元素
:root	:root	选择文档的根元素
::selection	::selection	选择用户已选取的元素部分
:target	#news:target	选择当前活动的 #news 元素
:valid	input:valid	选择带有有效值的所有 input 元素
:visited	a:visited	选择所有已访问的链接

Beatifulsoup

BS4 解析库用法详解

Beatifulsoup 解析器

这个模块属于模块名与下载名不一致，安装时需要注意。另外需要安装几个网页解析器，推荐使用 lxml 作为解析器，因为效率更高

pip install beautifulsoup4
pip install lxml
pip install html5lib

安装完成之后可以通过如下方式解析网页数据

html_doc = """
<html><head><title>"c语言中文网"</title></head>
<body>
<p class="title"><b>c.biancheng.net</b></p>
<p class="website">一个学习编程的网站
<a href="http://c.biancheng.net/python/" id="link1">python教程</a>
<a href="http://c.biancheng.net/c/" id="link2">c语言教程</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

print(soup.prettify()) #prettify()用于格式化输出html/xml文档

Beatifulsoup 对象

文档树中的每个节点都是 Python 对象，这些对象大致分为四类

Tag：标签类，HTML 文档中所有的标签都可以看做 Tag 对象。
NavigableString：字符串类，指的是标签中的文本内容，使用 text、string、strings 来获取文本内容。
BeautifulSoup：表示一个 HTML 文档的全部内容，您可以把它当作一个人特殊的 Tag 对象。
Comment：表示 HTML 文档中的注释内容以及特殊字符串，它是一个特殊的 NavigableString。

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p class="Web site url"><b>c.biancheng.net</b></p>', 'html.parser')
#获取整个p标签的html代码
print(soup.p)
#获取b标签
print(soup.p.b)
#获取p标签内容，使用NavigableString类中的string、text、get_text()
print(soup.p.text)
#返回一个字典，里面是多有属性和值
print(soup.p.attrs)
#查看返回的数据类型
print(type(soup.p))
#根据属性，获取标签的属性值，返回值为列表
print(soup.p['class'])
#给class属性赋值,此时属性值由列表转换为字符串
soup.p['class']=['Web','Site']
print(soup.p)

Beatifulsoup 节点

Tag 对象提供了许多遍历 tag 节点的属性，比如 contents、children 用来遍历子节点；parent 与 parents 用来遍历父节点；而 next_sibling 与 previous_sibling 则用来遍历兄弟节点。

#coding:utf8
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>"c语言中文网"</title></head>
<body>
<p class="title"><b>c.biancheng.net</b></p>
<p class="website">一个学习编程的网站</p>
<a href="http://c.biancheng.net/python/" id="link1">python教程</a>,
<a href="http://c.biancheng.net/c/" id="link2">c语言教程</a> and
"""
soup = BeautifulSoup(html_doc, 'html.parser')
body_tag=soup.body
print(body_tag)
#以列表的形式输出，所有子节点
print(body_tag.contents)

find_all()与 find()

find_all() 方法用来搜索当前 tag 的所有子节点，并判断这些节点是否符合过滤条件，最后以列表形式将符合条件的内容返回，语法格式如下：

find_all( name , attrs , recursive , text , limit )

参数说明： name：查找所有名字为 name 的 tag 标签，字符串对象会被自动忽略。 attrs：按照属性名和属性值搜索 tag 标签，注意由于 class 是 Python 的关键字吗，所以要使用 "class_"。 recursive：find_all() 会搜索 tag 的所有子孙节点，设置 recursive=False 可以只搜索 tag 的直接子节点。 text：用来搜文档中的字符串内容，该参数可以接受字符串、正则表达式、列表、True。 limit：由于 find_all() 会返回所有的搜索结果，这样会影响执行效率，通过 limit 参数可以限制返回结果的数量。

find() 方法与 find_all() 类似，不同之处在于 find_all() 会将文档中所有符合条件的结果返回，而 find() 仅返回一个符合条件的结果，所以 find() 方法没有 limit 参数。

使用 find() 时，如果没有找到查询标签会返回 None，而 find_all() 方法返回空列表。

from bs4 import BeautifulSoup
import re
html_doc = """
<html><head><title>"c语言中文网"</title></head>
<body>
<p class="title"><b>c.biancheng.net</b></p>
<p class="website">一个学习编程的网站</p>
<a href="http://c.biancheng.net/python/" id="link1">python教程</a>
<a href="http://c.biancheng.net/c/" id="link2">c语言教程</a>
<a href="http://c.biancheng.net/django/" id="link3">django教程</a>
<p class="vip">加入我们阅读所有教程</p>
<a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>
"""
#创建soup解析对象
soup = BeautifulSoup(html_doc, 'html.parser')

# find_all 语法
#查找所有a标签并返回
print(soup.find_all("a"))
#查找前两条a标签并返回
print(soup.find_all("a",limit=2))
#按照标签属性以及属性值查找 HTML 文档
print(soup.find_all("p",class_="website"))
print(soup.find_all(id="link4"))
#列表行书查找tag标签
print(soup.find_all(['b','a']))
#正则表达式匹配id属性值
print(soup.find_all('a',id=re.compile(r'.\d')))
print(soup.find_all(id=True))
#True可以匹配任何值，下面代码会查找所有tag，并返回相应的tag名称
for tag in soup.find_all(True):
    print(tag.name,end=" ")
#输出所有以b开始的tag标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

# find 语法
#查找所有a标签并返回
print(soup.find("a"))
#查找前两条a标签并返回
print(soup.find("a",limit=2))
#按照标签属性以及属性值查找 HTML 文档
print(soup.find("p",class_="website"))
print(soup.find(id="link4"))
#列表行书查找tag标签
print(soup.find(['b','a']))
#正则表达式匹配id属性值
print(soup.find('a',id=re.compile(r'.\d')))
print(soup.find(id=True))
#True可以匹配任何值，下面代码会查找所有tag，并返回相应的tag名称
for tag in soup.find(True):
    print(tag.name,end=" ")
#输出所有以b开始的tag标签
for tag in soup.find(re.compile("^b")):
    print(tag.name)

CSS 选择器

BS4 支持大部分的 CSS 选择器，比如常见的标签选择器、类选择器、id 选择器，以及层级选择器。Beautiful Soup 提供了一个 select() 方法，通过向该方法中添加选择器，就可以在 HTML 文档中搜索到与之对应的内容。

html_doc = """
<html><head><title>"c语言中文网"</title></head>
<body>
<p class="title"><b>c.biancheng.net</b></p>
<p class="website">一个学习编程的网站</p>
<a href="http://c.biancheng.net/python/" id="link1">python教程</a>
<a href="http://c.biancheng.net/c/" id="link2">c语言教程</a>
<a href="http://c.biancheng.net/django/" id="link3">django教程</a>
<p class="vip">加入我们阅读所有教程</p>
<a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>
<p class="introduce">介绍:
<a href="http://c.biancheng.net/view/8066.html" id="link5">关于网站</a>
<a href="http://c.biancheng.net/view/8092.html" id="link6">关于站长</a>
</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
#根据元素标签查找
print(soup.select('title'))
#根据属性选择器查找
print(soup.select('a[href]'))
#根据类查找
print(soup.select('.vip'))
#后代节点查找
print(soup.select('html head title'))
#查找兄弟节点
print(soup.select('p + a'))
#根据id选择p标签的兄弟节点
print(soup.select('p ~ #link3'))
#nth-of-type(n)选择器，用于匹配同类型中的第n个同级兄弟元素
print(soup.select('p ~ a:nth-of-type(1)'))
#查找子节点
print(soup.select('p > a'))
print(soup.select('.introduce > #link5'))

parsel

parsel 相当于 Beatifulsoup 的升级版，速度更快，而且后期可以无缝对接 Scrapy，因为经常需要结合 css 和 xpath 两种语法来提取，所以可以查看这个对比

基本用法

from parsel import Selector

html = '''
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
# 基本用法
selector = Selector(text=html)
items = selector.css('.item-0')
items2 = selector.xpath('//li[contains(@class, "item-0")]')

## 提取文本
for item in selector.css('.item-0'):
    text = item.xpath('.//text()') # // 是所有子孙节点，text()是提取文本
    print(text.get())# get 方法的作用是从 SelectorList 里面提取第一个 Selector 对象，然后输出其中的结果。
    print(text.getall())# getall 方法的作用是从 SelectorList 里面提取所有 Selector 对象，然后输出其中的结果。

## 提取属性
'''
用 css 和 xpath 方法实现。我们根据同时包含 item-0 和 active 这两个 class
来选取第三个 li 节点，然后进一步选取了里面的 a 节点
'''
result = selector.css('.item-0.active a::attr(href)').get() # ::attr()
print(result) # link3.html
result = selector.xpath('//li[contains(@class, "item-0") and contains(@class, "active")]/a/@href').get() # /@
print(result) # link3.html

## 正则提取
result = selector.css('.item-0').re('link.*') # 匹配包含 link 的所有结果。
print(result)
# ['link3.html"><span class="bold">third item</span></a></li>', 'link5.html">fifth item</a></li>']

parsel 命名空间

多个相同的 XML 文档被加载时，计算机不能正确区分。所以需要给不同的 XML 加上命名空间来区分。

命名空间是在元素的开始标签的 xmlns 属性中定义的。命名空间声明的语法是

xmlns:前缀="URI"
xmlns:gd="http://schemas.google.com/g/2005"

命名空间 URI 不会被解析器用于查找信息。如果像想要提取命名空间的 url，需要先移除命名空间。

import requests
from parsel import Selector
text = requests.get('https://feeds.feedburner.com/PythonInsider').text
sel = Selector(text=text, type='xml') # 使用xml解析器

# 我们可以尝试选择所有对象，然后看到它不起作用 （因为 Atom XML 命名空间混淆了这些节点）
sel.xpath("//link") # []

# 但是一旦我们调用Selector.remove_namespaces方法，就可以访问所有节点 直接通过他们的名字
sel.remove_namespaces()  # 移除命名空间
sel.xpath("//link")
# [<Selector xpath='//link' data='<link rel="alternate" type="text/html...'>,
#<Selector xpath='//link' data='<link rel="next" type="application/at...'>,
# ...]


# 为什么默认情况下不始终调用命名空间删除过程?
# 删除命名空间需要迭代和修改 文档，默认情况下执行的操作成本相当高 对于所有文档。
# 在某些情况下，实际上可能需要使用命名空间，在 某些元素名称在命名空间之间发生冲突的情况。虽然这些情况非常罕见

## 如果存在命名空间，也可以通过命名空间来选取对应的数据
sel.xpath("//a:entry/a:author/g:image/@src", namespaces={"a": "http://www.w3.org/2005/Atom","g": "http://schemas.google.com/g/2005"}).getall()

parsel 将 CSS 转换为 XPath

from parsel import css2xpath
css2xpath('h1.title')
"descendant-or-self::h1[@class and contains(concat(' ', normalize-space(@class), ' '), ' title ')]"
css2xpath('.profile-data') + '//h2'
"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' profile-data ')]//h2"

parsel 使用技巧

包括错误的 XPath 语法、XPath 和 css 组合使用

from scrapy import Selector
sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
xp = lambda x: sel.xpath(x).extract() # let's type this only once
xp('//a//text()') # take a peek at the node-set
  # [u'Click here to go to the ', u'Next Page']
xp('string(//a//text())')  # convert it to a string
  # [u'Click here to go to the ']


xp('//a[1]') # selects the first a node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
xp('string(//a[1])') # converts it to string
[u'Click here to go to the Next Page']

xp("//a[contains(., 'Next Page')]//text()") # good
#['Click here to go to the ', 'Next Page']
xp("//a[contains(.//text(), 'Next Page')]") # bad
# []

xp("substring-after(//a, 'Next ')")# good
# [u'Page']
xp("substring-after(//a//text(), 'Next ')")# bad
# [u'']

## 组合使用
sel = Selector(text='<p class="content-author">Someone</p><p class="content text-wrap">Some content</p>')
xp = lambda x: sel.xpath(x).getall()
sel.css('.content').xpath('@class').getall()
# ['content text-wrap']

parsel 对比 bs4

对比项包括：标签+class 属性、标签+多个 class 属性、标签+非 class 的属性键值、提取指定属性、选取指定节点。

可以看到能用 bs4 实现的，parsel 都可以实现，并且速度更快。 bs4 代码更多，但是可读性较好。parsel 语法更简洁，但是可读性较差。

from parsel import Selector
from bs4 import BeautifulSoup
from timeit import timeit

 # https://www.kickstarter.com/projects/cloverpress/pixiv-presents-artists-in-taiwan-and-korea?ref=section-homepage-featured-project
with open('1.html','r')as f:
    html = f.read()

def test1():
    sel = Selector(text=html)
    Title = sel.css('h2.project-name').xpath('.//text()').get()
    First_Page_Picture = sel.css('img.aspect-ratio--object::attr(src)').get()
    Pledged_Amount = sel.css('span.money')[4].xpath('.//text()').get()
    Due_Time = sel.css('span[data-test-id=deadline-exists]').xpath('.//text()').get()
    Time_Left = sel.css('span.block.type-16.type-28-md.bold.dark-grey-500')[1].xpath('.//text()').get()
    return Due_Time ,Time_Left,Pledged_Amount,Title,First_Page_Picture


def test2():
    newSoup = BeautifulSoup(html, "html.parser")
    Title = newSoup.find('h2',class_ = 'project-name').text
    First_Page_Picture = newSoup.find('img',class_ = 'aspect-ratio--object')["src"]
    Pledged_Amount = newSoup.find_all('span',class_ = 'money')[4].text
    Due_Time = newSoup.find('span',{'data-test-id': 'deadline-exists'}).text
    Time_Left = newSoup.find_all('span',class_= 'block type-16 type-28-md bold dark-grey-500')[1].text
    return Due_Time ,Time_Left,Pledged_Amount,Title,First_Page_Picture

print(test1() == test2()) # True

t1 = timeit('test1()', 'from __main__ import test1', number=10)
print(t1) # 0.3

t2 = timeit('test2()', 'from __main__ import test2', number=10)
print(t2) # 2

json

json 模块提供了 python->json 以及 json->python 两种格式，转换规则如下

JSON ->	Python ->	JSON
object -- 对象	dict	object -- 对象
array	list	array
string	str	string
number (int)	int	number
number (real)	float	number
TRUE	true
FALSE	false
	TRUE	true
	FALSE	false
null	None	null
	tuple	array

注意：JSON 中的键-值对中的键永远是 str 类型的。当一个对象被转化为 JSON 时，字典中所有的键都会被强制转换为字符串。这所造成的结果是字典被转换为 JSON 然后转换回字典时可能和原来的不相等。换句话说，如果 x 具有非字符串的键，则有 loads(dumps(x)) != x。

json 模块还有一些其他参数可以控制：编码形式、格式化输出等，不过很少用到

json 官方模块文档

json.load 与 json.dump

json.load 与 json.dump 是基于文件的转换

import json

data = {
    "name": "Satyam kumar",
    "place": "patna",
    "skills": [
        "Raspberry pi",
        "Machine Learning",
        "Web Development"
    ],
    "email": "xyz@gmail.com",
    "projects": [
        "Python Data Mining",
        "Python Data Science"
    ]
}
with open("data_file.json", "w") as write:
    json.dump(data, write)

with open("data_file.json", "r") as read_content:
    print(json.load(read_content))

json.loads 与 json.dumps

json.load 与 json.dump 是直接基于数据的转换

import json

# JSON string:
# Multi-line string
data = """{
    "Name": "Jennifer Smith",
    "Contact Number": 7867567898,
    "Email": "jen123@gmail.com",
    "Hobbies":["Reading", "Sketching", "Horse Riding"]
    }"""

# parse data:
res_p = json.loads(data)
print(type(res_p)) # <class 'dict'>

res_j = json.dumps(res_p)
print(type(res_j)) # <class 'str'>

execjs

execjs 可以使用 execjs.eval(data)的形式把数据转化为真正的 json 对象，并通过 call()方法执行这段 js 代码。

import execjs

data = """{
    "Name": "Jennifer Smith",
    "Contact Number": 7867567898,
    "Email": "jen123@gmail.com",
    "Hobbies":["Reading", "Sketching", "Horse Riding"]
    }"""

res_p = execjs.eval(data)
print(type(res_p)) #dict


res_j = execjs.compile(res_p)
print(type(res_j._source)) #dict

execjs.eval("'red yellow blue'.split(' ')")
#['red', 'yellow', 'blue']

ctx = execjs.compile("""
    function add(x, y) {
        return x + y;
    }
""")
ctx.call("add", 1, 2)
#3

数据提取​

正则表达式​

Xpath​

CSS​

Beatifulsoup​

Beatifulsoup 解析器​

Beatifulsoup 对象​

Beatifulsoup 节点​

find_all()与 find()​

CSS 选择器​

parsel​

基本用法​

parsel 命名空间​

parsel 将 CSS 转换为 XPath​

parsel 使用技巧​

parsel 对比 bs4​

json​

json.load 与 json.dump​

json.loads 与 json.dumps​

execjs​