爬虫基础…
0x01 Beautiful Soup
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.
解析器
解析器 | 使用 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup, “html.parser”) | Python的内置标准库执行速度适中文档容错能力强 | |
lxml HTML 解析器 | BeautifulSoup(markup, “lxml”) | 速度快文档容错能力强 | 需要安装C语言库 |
lxml XML 解析器 | BeautifulSoup(markup, [“lxml-xml”]) | BeautifulSoup(markup, “xml”)速度块支持XML | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, “html5lib”) | 最好的容错性 | 速度慢不依赖外部扩展 |
用法
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
#TagName
soup.title # <title>The Dormouse's story</title>
soup.title.name # u'title'
#TagName Attributes
soup.p # <p class="title"><b>The Dormouse's story</b></p>
soup.p.attrs # {'class': ['title']}
soup.p['class'] # u'title'
#Text
soup.title.string # u'The Dormouse's story'
soup.title.text # u'The Dormouse's story'
soup.title.get_text() # u'The Dormouse's story'
#Parent
soup.title.parent # <head><title>The Dormouse's story</title></head>
soup.title.parent.name # u'head'
方法
- find()
find只能找到符合要求的第一个标签,他返回的是一个对象
soup.find('a')
soup.find('a', class_='xxx')
soup.find('a', title='xxx')
soup.find('a', id='xxx')
soup.find('a', id=re.compile(r'xxx'))
- find_all()
返回一个列表,列表里面是所有的符合要求的对象
soup.find_all('a')
soup.find_all('a', class_='wang')
soup.find_all('a', id=re.compile(r'xxx'))
soup.find_all('a', limit=2)
0x02 lxml
xpath是一门在XML文档中查找信息的语言
概念
节点(Node)
元素、属性、文本、命名空间、文档(根)节点
节点关系
- 父(parent)
- 子 (Children)
- 同胞 (Sibling)
- 先辈 (Ancestor)
- 后代 (Descendant)
xpath语法
表达式 | 描述 |
---|---|
// | 从任意子节点中选取 |
/ | 从根节点选取 |
/text() | 获取当前路径下的文本内容 |
| |
使用| 可选取若干个节点//p|//div |
. |
选取当前节点 |
.. |
选取当前节点的父节点 |
@ | 选取属性 |
from lxml import etree
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
selector = etree.HTML(html_doc)
selector.xpath('//p[@class="title"]/b/text()')
# ["The Dormouse's story"]
selector.xpath('//p[@class="story"]/a/@href')
# ['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']