pySpider
@ Zhang zhiyang · Monday, Jan 1, 0001 · 2 minute read · Update at Monday, Jan 1, 0001

爬虫基础…

0x01 Beautiful Soup

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

解析器

解析器 使用 优势 劣势
Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库执行速度适中文档容错能力强
lxml HTML 解析器 BeautifulSoup(markup, “lxml”) 速度快文档容错能力强 需要安装C语言库
lxml XML 解析器 BeautifulSoup(markup, [“lxml-xml”]) BeautifulSoup(markup, “xml”)速度块支持XML 需要安装C语言库
html5lib BeautifulSoup(markup, “html5lib”) 最好的容错性 速度慢不依赖外部扩展

用法

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
#TagName
soup.title              # <title>The Dormouse's story</title>
soup.title.name         # u'title'

#TagName Attributes
soup.p                  # <p class="title"><b>The Dormouse's story</b></p>
soup.p.attrs            # {'class': ['title']}
soup.p['class']         # u'title'

#Text
soup.title.string       # u'The Dormouse's story'
soup.title.text         # u'The Dormouse's story'
soup.title.get_text()   # u'The Dormouse's story'

#Parent
soup.title.parent       # <head><title>The Dormouse's story</title></head>
soup.title.parent.name  # u'head'

方法

  • find()
find只能找到符合要求的第一个标签,他返回的是一个对象
soup.find('a')
soup.find('a', class_='xxx')
soup.find('a', title='xxx')
soup.find('a', id='xxx')
soup.find('a', id=re.compile(r'xxx'))
  • find_all()
返回一个列表,列表里面是所有的符合要求的对象
soup.find_all('a')
soup.find_all('a', class_='wang')
soup.find_all('a', id=re.compile(r'xxx'))
soup.find_all('a', limit=2)

0x02 lxml

xpath是一门在XML文档中查找信息的语言

概念

节点(Node)

​ 元素、属性、文本、命名空间、文档(根)节点

节点关系

  • 父(parent)
  • ​子 (Children)
  • 同胞 (Sibling)
  • ​先辈 (Ancestor)
  • 后代 (Descendant)

xpath语法

表达式 描述
// 从任意子节点中选取
/ 从根节点选取
/text() 获取当前路径下的文本内容
| 使用|可选取若干个节点//p|//div
. 选取当前节点
.. 选取当前节点的父节点
@ 选取属性
from lxml import etree

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


selector = etree.HTML(html_doc)

selector.xpath('//p[@class="title"]/b/text()')
# ["The Dormouse's story"]

selector.xpath('//p[@class="story"]/a/@href')
# ['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']
Zhang zhiyang's blog
不过是些许风霜罢了
c cyber http linux math mysql php python 前端

© 2016 - 2022 Zhangzhiyang的博客

Powered by Hugo with theme Dream.

我听别人说这世界上有一种鸟是没有脚的,它只能够一直的飞呀飞呀,飞累了就在风里面睡觉,这种鸟一辈子只能下地一次,那一次就是它死亡的时候。

日程

Zhangzhiyang的 ❤️ 博客

其他

如果你喜欢我的开源项目或者它们可以给你带来帮助,可以赏一杯咖啡 ☕ 给我。~

If you like my open source projects or they can help you. You can buy me a coffee ☕.~

PayPal

https://paypal.me/g1eny0ung

Patreon

Become a Patron!

微信赞赏码

wechat

最好附加一下信息或者留言,方便我可以将捐助记录 📝 下来,十分感谢 🙏。

It is better to attach some information or leave a message so that I can record the donation 📝, thank you very much 🙏.