python基于正则爬虫小笔记

佚名 4年前 (2020-06-26) Safe 839人围观抢沙发百度已收录

一、re.match()，从字符串的起始位置开始匹配，比如hello，匹配模式第一个字符必须为 h

1、re.match()，模式'^hello.*Demo$'，匹配字符串符合正则的所有内容

SRE实战互联网时代守护先锋，助力企业售后服务体系运筹帷幄！一键直达领取阿里云限量特价优惠。

import re

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*Demo$',content)
print(result.group())

2、()、group(1)，匹配字符串中的某个字符串，匹配数字 (\d+)

group()匹配全部，group(1)匹配第一个()

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s(\d+)\s(\d+)\sWorld.*Demo$',content)
print(result.group(2))

3、\s只能匹配一个空格，若有多个空格呢，hello 123，用 \s+ 即可

4、匹配空格、或任意字符串，.*，为贪婪模式，会影响后面的匹配，比如 .*(\d+)，因此用 .*? 代替\s+

4.1 贪婪模式

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*(\d+)\s(\d+)\sWorld.*Demo$',content)
print(result.group(1))

输出 3

4.2 非贪婪模式

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*?(\d+).*?(\d+)\sWorld.*Demo$',content)
print(result.group(1))

输出123

5、匹配 123 4567，(.*?)

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s+(.*?)\s+World.*Demo$',content)
print(result.group(1))

输出 123 4567

当匹配特殊字符时，用转义，$5.00，转为后 \$5\.00

二、re.search()，扫描整个字符串，比如hello，匹配模式第一个不一定必须为 h，可以是 e

网上其它文章写的比较混乱，没有写出re.match与re.search之间的区别，只是写了一个re.search使用案例，无法让新手朋友深入理解各个模式之间的区别

1、这里在用前面的案例，匹配 “123 4567”

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.search('ello\s+(.*?)\s+World.*Demo$',content) #从ello开始，re.match()必须从 h 开始
print(result.group(1))

输出 123 4567