pySpider

爬虫基础…

0x01 Beautiful Soup

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

解析器

解析器	使用	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库执行速度适中文档容错能力强
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml-xml”])	BeautifulSoup(markup, “xml”)速度块支持XML	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性	速度慢不依赖外部扩展

用法

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
#TagName
soup.title              # <title>The Dormouse's story</title>
soup.title.name         # u'title'

#TagName Attributes
soup.p                  # <p class="title"><b>The Dormouse's story</b></p>
soup.p.attrs            # {'class': ['title']}
soup.p['class']         # u'title'

#Text
soup.title.string       # u'The Dormouse's story'
soup.title.text         # u'The Dormouse's story'
soup.title.get_text()   # u'The Dormouse's story'

#Parent
soup.title.parent       # <head><title>The Dormouse's story</title></head>
soup.title.parent.name  # u'head'

方法

find()

find只能找到符合要求的第一个标签，他返回的是一个对象
soup.find('a')
soup.find('a', class_='xxx')
soup.find('a', title='xxx')
soup.find('a', id='xxx')
soup.find('a', id=re.compile(r'xxx'))

find_all()

返回一个列表，列表里面是所有的符合要求的对象
soup.find_all('a')
soup.find_all('a', class_='wang')
soup.find_all('a', id=re.compile(r'xxx'))
soup.find_all('a', limit=2)

0x02 lxml

xpath是一门在XML文档中查找信息的语言

概念

节点(Node)

元素、属性、文本、命名空间、文档（根）节点

节点关系

父（parent）
子（Children）
同胞（Sibling）
先辈（Ancestor）
后代（Descendant）

xpath语法

表达式	描述
//	从任意子节点中选取
/	从根节点选取
/text()	获取当前路径下的文本内容
`\|`	使用`\|`可选取若干个节点//p\|//div
`.`	选取当前节点
`..`	选取当前节点的父节点
@	选取属性

from lxml import etree

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


selector = etree.HTML(html_doc)

selector.xpath('//p[@class="title"]/b/text()')
# ["The Dormouse's story"]

selector.xpath('//p[@class="story"]/a/@href')
# ['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']

pySpider
@ Zhang zhiyang · Monday, Jan 1, 0001 · 2 minute read · Update at Monday, Jan 1, 0001

0x01 Beautiful Soup

解析器

用法

方法

0x02 lxml

概念

节点(Node)

节点关系

xpath语法

日程

我的一些开源项目

其他

Social Links