Posted 2024-02-26Updated 2024-03-03an hour read (About 11309 words)

python web scraping

http

uri url

URL stands for Uniform Resource Locator. A URL is nothing more than the address of a given unique resource on the Web. In theory, each valid URL points to a unique resource.

Alt text

http https

tls/ssl

Transport Layer Security (TLS) is a cryptographic protocol designed to provide communications security over a computer network. The protocol is widely used in applications such as email, instant messaging, and voice over IP, but its use in securing HTTPS remains the most publicly visible.

TLS builds on the now-deprecated SSL (Secure Sockets Layer) specifications (1994, 1995, 1996)

chrome developer tool
network

Status. The HTTP response code.

Type. The resource type.

Initiator. What caused a resource to be requested. Clicking a link in the Initiator column takes you to the source code that caused the request.

Time. How long the request took.

Waterfall. A graphical representation of the different stages of the request. Hover over a Waterfall to see a breakdown.

detail

The Headers tab is shown. Use this tab to inspect HTTP headers.

the Preview tab. A basic rendering of the HTML is shown.

the Response tab. The HTML source code is shown.

the Timing tab. A breakdown of the network activity for this resource is shown.

http request

header
content-type: internet media type MIME
HTML –> text/html
GIF –> image/gif
JSON –> application/json
xml –> text/xml
form file –> multipart/form-data
form data –> application/x-www-form-urlencoded

http response

Information responses
100 Continue
This interim response indicates that the client should continue the request or ignore the response if the request is already finished.

101 Switching Protocols
This code is sent in response to an Upgrade request header from the client and indicates the protocol the server is switching to.

102 Processing (WebDAV)
This code indicates that the server has received and is processing the request, but no response is available yet.

103 Early Hints
This status code is primarily intended to be used with the Link header, letting the user agent start preloading resources while the server prepares a response or preconnect to an origin from which the page will need resources.

Successful responses
200 OK
The request succeeded. The result meaning of “success” depends on the HTTP method:

GET: The resource has been fetched and transmitted in the message body.
HEAD: The representation headers are included in the response without any message body.
PUT or POST: The resource describing the result of the action is transmitted in the message body.
TRACE: The message body contains the request message as received by the server.
201 Created
The request succeeded, and a new resource was created as a result. This is typically the response sent after POST requests, or some PUT requests.

202 Accepted
The request has been received but not yet acted upon. It is noncommittal, since there is no way in HTTP to later send an asynchronous response indicating the outcome of the request. It is intended for cases where another process or server handles the request, or for batch processing.

203 Non-Authoritative Information
This response code means the returned metadata is not exactly the same as is available from the origin server, but is collected from a local or a third-party copy. This is mostly used for mirrors or backups of another resource. Except for that specific case, the 200 OK response is preferred to this status.

204 No Content
There is no content to send for this request, but the headers may be useful. The user agent may update its cached headers for this resource with the new ones.

205 Reset Content
Tells the user agent to reset the document which sent this request.

206 Partial Content
This response code is used when the Range header is sent from the client to request only part of a resource.

207 Multi-Status (WebDAV)
Conveys information about multiple resources, for situations where multiple status codes might be appropriate.

208 Already Reported (WebDAV)
Used inside a dav:propstat response element to avoid repeatedly enumerating the internal members of multiple bindings to the same collection.

226 IM Used (HTTP Delta encoding)
The server has fulfilled a GET request for the resource, and the response is a representation of the result of one or more instance-manipulations applied to the current instance.

Redirection messages
300 Multiple Choices
The request has more than one possible response. The user agent or user should choose one of them. (There is no standardized way of choosing one of the responses, but HTML links to the possibilities are recommended so the user can pick.)

301 Moved Permanently
The URL of the requested resource has been changed permanently. The new URL is given in the response.

302 Found
This response code means that the URI of requested resource has been changed temporarily. Further changes in the URI might be made in the future. Therefore, this same URI should be used by the client in future requests.

303 See Other
The server sent this response to direct the client to get the requested resource at another URI with a GET request.

304 Not Modified
This is used for caching purposes. It tells the client that the response has not been modified, so the client can continue to use the same cached version of the response.

305 Use Proxy Deprecated
Defined in a previous version of the HTTP specification to indicate that a requested response must be accessed by a proxy. It has been deprecated due to security concerns regarding in-band configuration of a proxy.

306 unused
This response code is no longer used; it is just reserved. It was used in a previous version of the HTTP/1.1 specification.

307 Temporary Redirect
The server sends this response to direct the client to get the requested resource at another URI with the same method that was used in the prior request. This has the same semantics as the 302 Found HTTP response code, with the exception that the user agent must not change the HTTP method used: if a POST was used in the first request, a POST must be used in the second request.

308 Permanent Redirect
This means that the resource is now permanently located at another URI, specified by the Location: HTTP Response header. This has the same semantics as the 301 Moved Permanently HTTP response code, with the exception that the user agent must not change the HTTP method used: if a POST was used in the first request, a POST must be used in the second request.

Client error responses
400 Bad Request
The server cannot or will not process the request due to something that is perceived to be a client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).

401 Unauthorized
Although the HTTP standard specifies “unauthorized”, semantically this response means “unauthenticated”. That is, the client must authenticate itself to get the requested response.

402 Payment Required Experimental
This response code is reserved for future use. The initial aim for creating this code was using it for digital payment systems, however this status code is used very rarely and no standard convention exists.

403 Forbidden
The client does not have access rights to the content; that is, it is unauthorized, so the server is refusing to give the requested resource. Unlike 401 Unauthorized, the client’s identity is known to the server.

404 Not Found
The server cannot find the requested resource. In the browser, this means the URL is not recognized. In an API, this can also mean that the endpoint is valid but the resource itself does not exist. Servers may also send this response instead of 403 Forbidden to hide the existence of a resource from an unauthorized client. This response code is probably the most well known due to its frequent occurrence on the web.

405 Method Not Allowed
The request method is known by the server but is not supported by the target resource. For example, an API may not allow calling DELETE to remove a resource.

406 Not Acceptable
This response is sent when the web server, after performing server-driven content negotiation, doesn’t find any content that conforms to the criteria given by the user agent.

407 Proxy Authentication Required
This is similar to 401 Unauthorized but authentication is needed to be done by a proxy.

408 Request Timeout
This response is sent on an idle connection by some servers, even without any previous request by the client. It means that the server would like to shut down this unused connection. This response is used much more since some browsers, like Chrome, Firefox 27+, or IE9, use HTTP pre-connection mechanisms to speed up surfing. Also note that some servers merely shut down the connection without sending this message.

409 Conflict
This response is sent when a request conflicts with the current state of the server.

410 Gone
This response is sent when the requested content has been permanently deleted from server, with no forwarding address. Clients are expected to remove their caches and links to the resource. The HTTP specification intends this status code to be used for “limited-time, promotional services”. APIs should not feel compelled to indicate resources that have been deleted with this status code.

411 Length Required
Server rejected the request because the Content-Length header field is not defined and the server requires it.

412 Precondition Failed
The client has indicated preconditions in its headers which the server does not meet.

413 Payload Too Large
Request entity is larger than limits defined by server. The server might close the connection or return an Retry-After header field.

414 URI Too Long
The URI requested by the client is longer than the server is willing to interpret.

415 Unsupported Media Type
The media format of the requested data is not supported by the server, so the server is rejecting the request.

416 Range Not Satisfiable
The range specified by the Range header field in the request cannot be fulfilled. It’s possible that the range is outside the size of the target URI’s data.

417 Expectation Failed
This response code means the expectation indicated by the Expect request header field cannot be met by the server.

418 I’m a teapot
The server refuses the attempt to brew coffee with a teapot.

421 Misdirected Request
The request was directed at a server that is not able to produce a response. This can be sent by a server that is not configured to produce responses for the combination of scheme and authority that are included in the request URI.

422 Unprocessable Content (WebDAV)
The request was well-formed but was unable to be followed due to semantic errors.

423 Locked (WebDAV)
The resource that is being accessed is locked.

424 Failed Dependency (WebDAV)
The request failed due to failure of a previous request.

425 Too Early Experimental
Indicates that the server is unwilling to risk processing a request that might be replayed.

426 Upgrade Required
The server refuses to perform the request using the current protocol but might be willing to do so after the client upgrades to a different protocol. The server sends an Upgrade header in a 426 response to indicate the required protocol(s).

428 Precondition Required
The origin server requires the request to be conditional. This response is intended to prevent the ‘lost update’ problem, where a client GETs a resource’s state, modifies it and PUTs it back to the server, when meanwhile a third party has modified the state on the server, leading to a conflict.

429 Too Many Requests
The user has sent too many requests in a given amount of time (“rate limiting”).

431 Request Header Fields Too Large
The server is unwilling to process the request because its header fields are too large. The request may be resubmitted after reducing the size of the request header fields.

451 Unavailable For Legal Reasons
The user agent requested a resource that cannot legally be provided, such as a web page censored by a government.

Server error responses
500 Internal Server Error
The server has encountered a situation it does not know how to handle.

501 Not Implemented
The request method is not supported by the server and cannot be handled. The only methods that servers are required to support (and therefore that must not return this code) are GET and HEAD.

502 Bad Gateway
This error response means that the server, while working as a gateway to get a response needed to handle the request, got an invalid response.

503 Service Unavailable
The server is not ready to handle the request. Common causes are a server that is down for maintenance or that is overloaded. Note that together with this response, a user-friendly page explaining the problem should be sent. This response should be used for temporary conditions and the Retry-After HTTP header should, if possible, contain the estimated time before the recovery of the service. The webmaster must also take care about the caching-related headers that are sent along with this response, as these temporary condition responses should usually not be cached.

504 Gateway Timeout
This error response is given when the server is acting as a gateway and cannot get a response in time.

505 HTTP Version Not Supported
The HTTP version used in the request is not supported by the server.

506 Variant Also Negotiates
The server has an internal configuration error: the chosen variant resource is configured to engage in transparent content negotiation itself, and is therefore not a proper end point in the negotiation process.

507 Insufficient Storage (WebDAV)
The method could not be performed on the resource because the server is unable to store the representation needed to successfully complete the request.

508 Loop Detected (WebDAV)
The server detected an infinite loop while processing the request.

510 Not Extended
Further extensions to the request are required for the server to fulfill it.

511 Network Authentication Required
Indicates that the client needs to authenticate to gain network access.

response header

content-type:
application/x-javascript –> javascirpt file

web 基础

html css js

html dom
document object model
Alt text

css selector

Selector	Example	Example description
.class	.intro	Selects all elements with class=”intro”
.class1.class2	.name1.name2	Selects all elements with both name1 and name2 set within its class attribute
.class1 .class2	.name1 .name2	Selects all elements with name2 that is a descendant of an element with name1
#id	#firstname	Selects the element with id=”firstname”
*	*	Selects all elements
element	p	Selects all `<p>` elements
element.class	p.intro	Selects all `<p>` elements with class=”intro”
element,element	div, p	Selects all `<div>` elements and all `<p>` elements
element element	div p	Selects all `<p>` elements inside `<div>` elements
element>element	div > p	Selects all `<p>` elements where the parent is a `<div>` element
element+element	div + p	Selects the first `<p>` element that is placed immediately after `<div>` elements
element1~element2	p ~ ul	Selects every `<ul>` element that is preceded by a `<p>` element
[attribute]	[target]	Selects all elements with a target attribute
[attribute=value]	[target=”_blank”]	Selects all elements with target=”_blank”
[attribute~=value]	[title~=”flower”]	Selects all elements with a title attribute containing the word “flower”
[attribute\|=value]	[lang\|=”en”]	Selects all elements with a lang attribute value equal to “en” or starting with “en-“
[attribute^=value]	a[href^=”https”]	Selects every `<a>` element whose href attribute value begins with “https”
[attribute$=value]	a[href$=”.pdf”]	Selects every `<a>` element whose href attribute value ends with “.pdf”
[attribute*=value]	a[href*=”w3schools”]	Selects every `<a>` element whose href attribute value contains the substring “w3schools”
:active	a:active	Selects the active link
::after	p::after	Insert something after the content of each `<p>` element
::before	p::before	Insert something before the content of each `<p>` element
:checked	input:checked	Selects every checked `<input>` element
:default	input:default	Selects the default `<input>` element
:disabled	input:disabled	Selects every disabled `<input>` element
:empty	p:empty	Selects every `<p>` element that has no children (including text nodes)
:enabled	input:enabled	Selects every enabled `<input>` element
:first-child	p:first-child	Selects every `<p>` element that is the first child of its parent
::first-letter	p::first-letter	Selects the first letter of every `<p>` element
::first-line	p::first-line	Selects the first line of every `<p>` element
:first-of-type	p:first-of-type	Selects every `<p>` element that is the first `<p>` element of its parent
:focus	input:focus	Selects the input element which has focus
:fullscreen	:fullscreen	Selects the element that is in full-screen mode
:hover	a:hover	Selects links on mouse over
:in-range	input:in-range	Selects input elements with a value within a specified range
:indeterminate	input:indeterminate	Selects input elements that are in an indeterminate state
:invalid	input:invalid	Selects all input elements with an invalid value
:lang(language)	p:lang(it)	Selects every `<p>` element with a lang attribute equal to “it” (Italian)
:last-child	p:last-child	Selects every `<p>` element that is the last child of its parent
:last-of-type	p:last-of-type	Selects every `<p>` element that is the last `<p>` element of its parent
:link	a:link	Selects all unvisited links
::marker	::marker	Selects the markers of list items
:not(selector)	:not(p)	Selects every element that is not a `<p>` element
:nth-child(n)	p:nth-child(2)	Selects every `<p>` element that is the second child of its parent
:nth-last-child(n)	p:nth-last-child(2)	Selects every `<p>` element that is the second child of its parent, counting from the last child
:nth-last-of-type(n)	p:nth-last-of-type(2)	Selects every `<p>` element that is the second `<p>` element of its parent, counting from the last child
:nth-of-type(n)	p:nth-of-type(2)	Selects every `<p>` element that is the second `<p>` element of its parent
:only-of-type	p:only-of-type	Selects every `<p>` element that is the only `<p>` element of its parent
:only-child	p:only-child	Selects every `<p>` element that is the only child of its parent
:optional	input:optional	Selects input elements with no “required” attribute
:out-of-range	input:out-of-range	Selects input elements with a value outside a specified range
::placeholder	input::placeholder	Selects input elements with the “placeholder” attribute specified
:read-only	input:read-only	Selects input elements with the “readonly” attribute specified
:read-write	input:read-write	Selects input elements with the “readonly” attribute NOT specified
:required	input:required	Selects input elements with the “required” attribute specified
:root	:root	Selects the document’s root element
::selection	::selection	Selects the portion of an element that is selected by a user
:target	#news:target	Selects the current active #news element (clicked on a URL containing that anchor name)
:valid	input:valid	Selects all input elements with a valid value
:visited	a:visited	Selects all visited links

from urllib.request import urlopen

url = "http://www.baidu.com"
resp = urlopen(url)
html_body = resp.read().decode("utf-8")
with open("mybaidu.html", mode="w") as f:
    f.write(html_body)
# resp.close()

服务器渲染

在没有AJAX的时候，也就是web1.0时代，几乎所有应用都是服务端渲染（此时服务器渲染非现在的服务器渲染），那个时候的页面渲染大概是这样的，浏览器请求页面URL，然后服务器接收到请求之后，到数据库查询数据，将数据丢到后端的组件模板（php、asp、jsp等）中，并渲染成HTML片段，接着服务器在组装这些HTML片段，组成一个完整的HTML，最后返回给浏览器，这个时候，浏览器已经拿到了一个完整的被服务器动态组装出来的HTML文本，然后将HTML渲染到页面中

Alt text

客户端渲染
前后端分离之后，网页开始被当成了独立的应用程序（SPA，Single Page Application），前端团队接管了所有页面渲染的事，后端团队只负责提供所有数据查询与处理的API，大体流程是这样的：首先浏览器请求URL，前端服务器直接返回一个空的静态HTML文件（不需要任何查数据库和模板组装），这个HTML文件中加载了很多渲染页面需要的 JavaScript 脚本和 CSS 样式表，浏览器拿到 HTML 文件后开始加载脚本和样式表，并且执行脚本，这个时候脚本请求后端服务提供的API，获取数据，获取完成后将数据通过JavaScript脚本动态的将数据渲染到页面中，完成页面显示。

Alt text

服务端渲染
服务端渲染。大体流程与客户端渲染有些相似，首先是浏览器请求URL，前端服务器接收到URL请求之后，根据不同的URL，前端服务器向后端服务器请求数据，请求完成后，前端服务器会组装一个携带了具体数据的HTML文本，并且返回给浏览器，浏览器得到HTML之后开始渲染页面，同时，浏览器加载并执行 JavaScript 脚本，给页面上的元素绑定事件，让页面变得可交互，当用户与浏览器页面进行交互，如跳转到下一个页面时，浏览器会执行 JavaScript 脚本，向后端服务器请求数据，获取完数据之后再次执行 JavaScript 代码动态渲染页面。

Alt text

响应头

cookie
token

请求头

user-agent
referer请求从那一页面来的（即上一页面地址）
cookie

GET
POST

import requests

url ="https://sogou.com/web?query=周杰伦"

my_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}
resp = requests.get(url, headers=my_headers)
print(resp)
html_resp = resp.text
print(html_resp)
resp.close()

import requests

url = "https://fanyi.baidu.com/sug"
word = input("english world: ")
dat = {
    "kw": word
}
resp = requests.post(url, data=dat)
print(resp.json())
resp.close()

string

the most common string methods

s.lower(), s.upper() – returns the lowercase or uppercase version of the string
s.strip() – returns a string with whitespace removed from the start and end
s.isalpha()/s.isdigit()/s.isspace()… – tests if all the string chars are in the various character classes
s.startswith(‘other’), s.endswith(‘other’) – tests if the string starts or ends with the given other string
s.find(‘other’) – searches for the given other string (not a regular expression) within s, and returns the first index where it begins or -1 if not found
s.replace(‘old’, ‘new’) – returns a string where all occurrences of ‘old’ have been replaced by ‘new’
s.split(‘delim’) – returns a list of substrings separated by the given delimiter. The delimiter is not a regular expression, it’s just text. ‘aaa,bbb,ccc’.split(‘,’) -> [‘aaa’, ‘bbb’, ‘ccc’]. As a convenient special case s.split() (with no arguments) splits on all whitespace chars.
s.join(list) – opposite of split(), joins the elements in the given list together using the string as the delimiter. e.g. ‘—‘.join([‘aaa’, ‘bbb’, ‘ccc’]) -> aaa—bbb—ccc

str.rsplit(sep=None, maxsplit=-1)
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done, the rightmost ones. If sep is not specified or None, any whitespace string is a separator. Except for splitting from the right, rsplit() behaves like split() which is described in detail below.

str.split(sep=None, maxsplit=- 1)
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified or -1, then there is no limit on the number of splits (all possible splits are made).

If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']). The sep argument may consist of multiple characters (for example, '1<>2<>3'.split('<>') returns['1', '2', '3']). Splitting an empty string with a specified separator returns [‘’].

Strings (Unicode vs bytes)

To convert a regular Python string to bytes, call the encode() method on the string. Going the other direction, the byte string decode() method converts encoded plain bytes to a unicode string:

> ustring = 'A unicode \u018e string \xf1'
> b = ustring.encode('utf-8')
> b
b'A unicode \xc6\x8e string \xc3\xb1'  ## bytes of utf-8 encoding. Note the b-prefix.
> t = b.decode('utf-8')                ## Convert bytes back to a unicode string
> t == ustring                         ## It's the same as the original, yay!
True

urllib

from urllib import request
url_1 = 'http://www.baidu.com/'
resp = request.urlopen(url_1)
print(resp)

real_url = resp.geturl()
print(real_url)  # http://www.baidu.com/

resp_code = resp.getcode()
print(resp_code)  # 200

html_source_bytes = resp.read()  # 字节串
html_source = html_source_bytes.decode()  # 字符串
print(html_source)

urllib.request headers

from urllib import request

url_1 = 'http://httpbin.org/get'
my_header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
}

req = request.Request(url=url_1, headers=my_header)

resp = request.urlopen(req)

html_source_bytes = resp.read()  # 字节串
html_source = html_source_bytes.decode()  # 字符串
print(html_source)

urllib.parse.urlencode

# 对URL编码
# https://www.baidu.com/s?wd=%E8%B5%B5%E4%B8%BD%E9%A2%96
# https://www.baidu.com/s?wd=赵丽颖


from urllib import parse, request

url_1 = 'http://www.baidu.com/s?'
params_1 = {
    'wd': '赵丽颖',
    'ie': 'utf-8',
}
# Encode a dict or sequence of two-element tuples into a URL query string
params_encoded = parse.urlencode(params_1)
url_0 = url_1 + params_encoded
print(url_0)
# http://www.baidu.com/s?wd=%E8%B5%B5%E4%B8%BD%E9%A2%96&ie=utf-8
# request.urlopen(url_0)

example: baidutieba

from urllib import request, parse
import time
import random

class BaiduTiebaSpider():

    def __init__(self):
        self.base_url = 'https://tieba.baidu.com/f?'
        self.params = {
            "kw": "赵丽颖吧",
            "ie": "utf-8",
            "pn": 0,
        }
        self.my_header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
        }
        self.data_folder = 'baidutieba_data'
    def get_html(self, url):
        req = request.Request(url, headers=self.my_header)
        resp = request.urlopen(req)
        html_source = resp.read().decode()
        return html_source
    def parse_html(self):
        pass
    def save_html(self, filename, content):
        with open(filename, 'w') as f:
            f.write(content)

    def run(self):
        name = input('tieba name:')
        start_page = int(input('start page:'))
        end_page = int(input('end page:'))
        self.params['kw'] = name
        for page in range(start_page, end_page + 1):
            self.params['pn'] = (page -1) * 50
            params_encode = parse.urlencode(self.params)
            url_0 = self.base_url + params_encode
            html_content = self.get_html(url_0)
            self.save_html(f'{self.data_folder}/{name}_page_{page}.html',html_content)
            print(f'page {page} finished!')
            time.sleep(random.randint(1, 3))


if __name__ == '__main__':
    spider = BaiduTiebaSpider()
    spider.run()

requests

import requests

headers

headers = {
    'User-Agent': '',
    'Cookie': '',
}

requests.get(url, headers=headers)

get

params = {
    'key1': 'value1'
}

## verify = True ## False

### timeout

## proxies
proxies = {
    'http': 'http://IP:PORT',
    'https': 'https://IP:PORT',
}

resp = request(url, params=params, verify=False, timeout=5, proxies=proxies)

post

data = {
    'key1': 'value1',
}

requests.post(url, data=data)

response

resp = requests.get(url)

resp.decoding = "utf-8"
resp.text   ## html

resp.content  ## b'asdfdsf'
resp.content.decode()

resp.json() # json字符串 -> python dict
dic = json.loads(resp.text) # json字符串 -> python dict

解析

re

Method/Attribute Purpose

match()
Determine if the RE matches at the beginning of the string.

search()
Scan through a string, looking for any location where this RE matches.

findall()
Find all substrings where the RE matches, and returns them as a list.

finditer()
Find all substrings where the RE matches, and returns them as an iterator.

Match object instances

group()
Return the string matched by the RE

Method/Attribute Purpose

start()
Return the starting position of the match

end()
Return the ending position of the match

span()
Return a tuple containing the (start, end) positions of the match

Compilation flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two names, a long name such as IGNORECASE and a short, one-letter form such as I

re.S
re.DOTALL
Makes the ‘.’ special character match any character at all, including a newline; without this flag, ‘.’ will match anything except a newline.

named groups: instead of referring to them by numbers, groups can be referenced by a name.
The syntax for a named group is one of the Python-specific extensions: (?P<name>...). name is, obviously, the name of the group.

The match object methods that deal with capturing groups all accept either integers that refer to the group by number or strings that contain the desired group’s name.

p = re.compile(r'(?P<word>\b\w+\b)')
m = p.search( '(((( Lots of punctuation )))' )
m.group('word')
# 'Lots'
m.group(1)
# 'Lots'

import re

result_list = re.findall(r"\d+", "我的电话号码是：10086")
print(result_list)

效率不高

#  iterator
it = re.finditer(r"\d+", "我的电话号码是：10086")
for i in it:
    print(i.group())

1
2
3

# match object group() return the first result
s = re.search(r"\d+", "我的电话号码是：10086，我女友的电话是：10010")
print(s.group())

正则预加载

obj = re.compile(r"\d+")

it = obj.finditer("10086，我女友的电话是：10010")
for i in it:
    print(i.group())

提取字符段

# 提取字符段
s = """
<div class='jay'><span id='1'>郭麒麟</span></div>
<div class='jj'><span id='2'>宋铁</span></div>
<div class='jolin'><span id='3'>大聪明</span></div>
"""
# re.S 让.能匹配换行符
obj = re.compile(r"<div class='.*?'><span id='(?P<id>\d+)'>(?P<name>.*?)</span></div>", re.S)

result = obj.finditer(s)
for it in result:
    print(it.group("name"))
    print(it.group("id"))

Python 3 the file must be opened in untranslated text mode with the parameters 'w', newline=''(empty string) or it will write \r\r\n on Windows, where the default text mode will translate each \n into \r\n.

import csv
it = obj.finditer(html_content)
with open("data05.csv", mode="w", newline='') as f:
    csvwriter = csv.writer(f)
    for i in it:
        dic = i.groupdict()
        dic["year"] = dic['year'].strip()
        print(dic.values())
        csvwriter.writerow(dic.values())

re

import re

content = 'ADBABDF ABVA BVAAB'
r_list = re.findall(r'AB', content, re.S)
print(r_list)  # ['AB', 'AB', 'AB']

re_pattern = re.compile(r'AB', re.S)
result = re_pattern.findall(content)
print(result)  # ['AB', 'AB', 'AB']

“?”

html_content = """
<div><p>hello world</p></div>
<div><p>hello world!</p></div>
"""

re_pattern_1 = re.compile(r'<div><p>.*</p></div>', re.S)
result_1 = re_pattern_1.findall(html_content)
print(result_1)  # ['<div><p>hello world</p></div>\n<div><p>hello world!</p></div>']


re_pattern_2 = re.compile(r'<div><p>.*?</p></div>', re.S)
result_2 = re_pattern_2.findall(html_content)
print(result_2)  # ['<div><p>hello world</p></div>', '<div><p>hello world!</p></div>']

group

html = 'A B C D'
pattern = re.compile(r'\w+\s+\w+')
r_list = pattern.findall(html)
print(r_list)  # ['A B', 'C D']

html = 'A B C D'
pattern = re.compile(r'(\w+)\s+\w+')
r_list = pattern.findall(html)
print(r_list)  # ['A', 'C']

html = 'A B C D'
pattern = re.compile(r'(\w+)\s+(\w+)')
r_list = pattern.findall(html)
print(r_list)  # [('A', 'B'), ('C', 'D')]

example

import re

html = """
<div class="animal">
    <p class="name">
        <a href="" title="Tiger"></a>
    </p>
    <p class="content">
        two tigers two tigers run fast
    </p>
</div>
<div class="animal">
    <p class="name">
        <a href="" title="Rabbit"></a>
    </p>
    <p class="content">
        small white rabbit white and white
    </p>
</div>
"""
re_pattern = re.compile(r'<div class="animal">.*?<a title="(.*?)".*?<p class="content">(.*?)</p>.*?</div>', re.S)
tuple_s = re_pattern.findall(html)
for t in tuple_s:
    result_1 = t[0]
    print(result_1)
    result_2 = t[1].strip()
    print(result_2)

example: maoyan.com

import csv
from urllib import request
import re
import time
import random

class MaoyanSpider():

    def __init__(self):
        self.base_url = 'https://movie.douban.com/top250?start='
        self.start = "0",
        self.my_header = {
            'Referer': 'https://movie.douban.com/top250',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
        }
        self.data_folder = 'douban_data'

        self.f = open('douban_data/douban.csv', 'w', newline='')
        self.writer = csv.writer(self.f)
        #

    def get_html(self, url):
        req = request.Request(url, headers=self.my_header)
        resp = request.urlopen(req)
        html_source = resp.read().decode()
        return html_source
    def parse_html(self, html_content):
        regex = r"""<div class="info">.*?<span class="title">(?P<title>.*?)</span>.*?<p.*?>(?P<people>.*?)<br>(?P<meta_1>.*?)</p>.*?<span class="rating_num".*?>(?P<stars>.*?)</span>.*?<span>(?P<comments_n>.*?)</span>"""
        re_pattern =re.compile(regex, re.S)
        r_list = re_pattern.findall(html_content)
        return r_list

    def save_html(self, r_list):
        for r in r_list:
            l = []
            for i in range(len(r)):
                l.append(r[i].strip())
            self.writer.writerow(l)


    def run(self):
        for page in range(4):
            self.start = str(page * 25)
            url_0 = self.base_url + self.start
            html_content = self.get_html(url_0)
            r_list = self.parse_html(html_content)
            self.save_html(r_list)
            print(f'page {page} finished!')
            time.sleep(random.randint(1, 3))
        self.f.close()


if __name__ == '__main__':
    spider = MaoyanSpider()
    spider.run()

import csv
from urllib import request
import re
import time
import random

class MaoyanSpider():

    def __init__(self):
        self.base_url = 'https://movie.douban.com/top250?start='
        self.start = "0",
        self.my_header = {
            'Referer': 'https://movie.douban.com/top250',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
        }
        self.data_folder = 'douban_data'

        self.data = []

    def get_html(self, url):
        req = request.Request(url, headers=self.my_header)
        resp = request.urlopen(req)
        html_source = resp.read().decode()
        return html_source
    def parse_html(self, html_content):
        regex = r"""<div class="info">.*?<span class="title">(?P<title>.*?)</span>.*?<p.*?>(?P<people>.*?)<br>(?P<meta_1>.*?)</p>.*?<span class="rating_num".*?>(?P<stars>.*?)</span>.*?<span>(?P<comments_n>.*?)</span>"""
        re_pattern =re.compile(regex, re.S)
        r_list = re_pattern.findall(html_content)

        for r in r_list:
            l = []
            for i in range(len(r)):
                l.append(r[i].strip())
            self.data.append(l)

    def save_html(self):
        # for r in r_list:
        #     data_row = list(r)
        #     print(data_row)
        #     with open(filename, 'a', newline='') as f:
        #         csv_writer = csv.writer(f)
        #         csv_writer.writerow(data_row)

        # with open(filename, 'a', newline='') as f:
        #     csv_writer = csv.writer(f)
        #     csv_writer.writerows(r_list)


        with open('douban_data/douban.csv', 'w', newline='') as f:
            writer = csv.writer(f)
            writer.writerows(self.data)

    def run(self):
        for page in range(4):
            self.start = str(page * 25)
            url_0 = self.base_url + self.start
            html_content = self.get_html(url_0)
            self.parse_html(html_content)

            print(f'page {page} finished!')
            time.sleep(random.randint(1, 3))
        self.save_html()


if __name__ == '__main__':
    spider = MaoyanSpider()
    spider.run()

example: baidu image

import os.path
import re
import urllib.parse
import requests
import time
import random


class BaiduImageSpider:
    def __init__(self):
        self.url = "https://image.baidu.com/search/index?"
        self.my_header = {
            'Referer': 'https://www.baidu.com',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
        }
        self.cookies = {
            'BIDUPSID': '08BC8B371D41816A15AD22D8AC4405C2',
            'BDRCVFR[dG2JNJb_ajR]': 'mk3SLVN4HKm',
            'BAIDUID': '08BC8B371D41816A13367C0125B857DA:FG=1',
            'userFrom': 'null',
            'BAIDUID_BFESS': '08BC8B371D41816A13367C0125B857DA:FG=1',
            'BDRCVFR[-pGxjrCMryR]': 'mk3SLVN4HKm',
            'ab_sr': '1.0.1_ZjRkZDI3ZDU5MTczYTFkMTNkMTZjM2ZiZDMxMWZiZDI3Y2M3ZTVlNTZlMjBhNDVkOWE0YmIzMzY3MjA2ZDNkYmU2NDU3ODIyYjkwZmExNWE5ZmM3NzI4ZTk3ZWU5MTQ3MTM5NzVlMTRhNThjNjYzYWRmMTcyZDAxOGMwZjM2MGE1YjE0ZjRmNzhkNzYyNGY1YWZkNTUxNDg2YTBmZTYyOA==',
        }
        self.n = 1


    def run(self):

        key_word = input("keyword:")
        if not key_word:
            key_word = '赵丽颖'
        params = {
            "tn": "baiduimage",
            "word": key_word,
        }
        params = urllib.parse.urlencode(params)
        url = f"{self.url}{params}"
        print(url)
        base_html = requests.get(url, headers=self.my_header, cookies=self.cookies)
        re_pattern_0 = re.compile(r"'imgData'(.*?)'fcadData'", re.S)
        result_0 = re_pattern_0.search(base_html.text)
        re_pattern_1 = re.compile(r'"thumbURL":.*?"(.*?)".*?"replaceUrl"', re.S)
        thumb_url_list = re_pattern_1.findall(result_0.group())

        for thumb_url in thumb_url_list:
            self.save_image(thumb_url, key_word)

    def save_image(self, url, key_word):
        print(url)
        resp = requests.get(url, headers=self.my_header, cookies=self.cookies)
        filename = f'{key_word}_{self.n}.jpg'
        directory_path = f'baidu_images/{key_word}/'
        if not os.path.exists(directory_path):
            os.makedirs(directory_path)
        with open(f"{directory_path}{filename}", 'wb') as f:
            f.write(resp.content)
        self.n += 1
        print(f"{filename} downloaded")
        time.sleep(random.randint(1, 4))

if __name__ == '__main__':
    spider = BaiduImageSpider()
    spider.run()

lxml xpath

<html>
<head>
 <title>My page</title>
</head>
<body>
 <h2>Welcome to my page<h2>
 <a href="www.example.com">page</a> 
    <p>This is the first paragraph</p>
 <h2>Hello World</h2>
</body>
</html>

For getting the text inside the <p> tag,

XPath : html/body/p/text()
Result : This is the first paragraph

For getting a value inside the <href> attribute in the anchor or <a> tag,

XPath : html/body/a/@href
Result: www.example.com

For getting the value inside the second <h2> tag,

XPath : html/body/h2[2]/text()
Result: Hello World

注意//h1/text()结果是个数组

Specifying a complete path with / as separator
title = root.xpath('/html/body/div/div/div[2]/h1')

is the full path to my blog title. Notice how we request the 2nd element of the third set of div elements using div[2] – xpath arrays are one-based, not zero-based.

Specifying a path with wildcards using //
This expression also finds the title but the preamble of /html/body/div/div is absorbed by the // wildcard match:

title = root.xpath('//div[2]/h1')

Specifying an element by attribute
We can select elements which have particular attribute values:

tagcloud = root.xpath('//*[@class="tagcloud"]')

this selects the tag cloud on my blog by selecting elements which having the class attribute “tagcloud”.

Select via a parent or sibling relationship
Sometimes we want to select elements by their relationship to another element, for example:

subtitle = root.xpath(‘//h1[contains(@class,”header_title”)]/../h2’)

this selects the h1 title of my blog (SomeBeans) then navigates to the parent with .. and selects the sibling h2 element (the subtitle “the makings of a small casserole”).

The same effect can be achieved with the following-sibling keyword:

subtitle = root.xpath(‘//h1[contains(@class,”header_title”)]/following-sibling::h2’)

Alt text

from lxml import etree

tree = etree.parse("data10.html")

# result = tree.xpath("/html/body/ul/li[1]/a/text()") # the first is "1"
# result = tree.xpath("/html/body/ol/li/a[@href='dapao']/text()") # the first is "1"
# result = tree.xpath("/book/author//nick/text()")
# result = tree.xpath("/book/author/*/nick/text()")
print(result)

result = tree.xpath("/html/body/ol/li") # 
the first is "1"
for li in result:
    print(li.xpath("./a/text()"))
    print(li.xpath("./a/@href"))
print(result)

The fromstring() function
The fromstring() function is the easiest way to parse a string:

some_xml_data = “data“

root = etree.fromstring(some_xml_data)
print(root.tag)
root
etree.tostring(root)
b’data‘

The XML() function
The XML() function behaves like the fromstring() function, but is commonly used to write XML literals right into the source:

There is also a corresponding function HTML() for HTML literals.

root = etree.HTML("<p>data</p>")

The parse() function
The parse() function is used to parse from files and file-like objects.

example

from lxml import etree
html = ''
parse_html = etree.HTML(html)
r_list = parse_html.xpath('//[@class="name"]/text()')
# div_list = parse_html.xpath('//[@class="name_1"]/div')
# r_list = div_list.xpath('.//[@class="name_2"]/img/@src')

xpath

//tagname
at nay level of parent element
//tagename[1]
//tagname[@attributeName="value"]
contain()
//tagname[contains(@attributeName,'value')]
starts-with()
and or
//tagname[(expression 1)and(expression 2)]

get text
//h1/text()

/
the children
//
all the children within any level
.
current
..
parent
*
any elements

css selector

Alt text

xpath: /html/body/p
CSS selector: html > body > p

Basic CSS Selectors Cheatsheet

Selector	Description	Example	Explanation
Tag Selector	Selects elements based on their tag name.	p	Selects all `<p>` elements.
Class Selector	Selects elements based on their class name.	.example	Selects all elements with the class name “example”.
ID Selector	Selects an element based on its ID.	#example	Selects the element with the ID “example”.
Attribute Selector	Selects elements based on their attribute and value.	[type=”text”]	Selects all elements with the attribute “type” and
Descendant Selector	Selects elements that are descendants of another element.	div p	Selects all `<p>` elements that are descendants of a `<div>`
Child Selector	Selects elements that are direct children of another element.	ul > li	Selects all `<li>` elements that are direct children of a `<ul>` element.
Pseudo-Class Selector	Selects elements based on their state or position in the document.	a:hover	Selects all `<a>` elements when the mouse is

There are many pseudo-class selectors, some of which are described in this table.

Pseudo-class Selector	Description
:hover	Selects an element when the mouse pointer
:active	Selects an element when it is being
:visited	Selects a link that has been visited by
:focus	Selects an element when it has focus (e.g.
:first-child	Selects the first child element of its
:last-child	Selects the last child element of its
:nth-child(n)	Selects the nth child element of its
:nth-of-type(n)	Selects the nth element of its type
:last-of-type	Selects the last occurrence of an

The CSS expression below shows how to select the first div of the body element.

html > body > div:nth-of-type(1)

<html>
    <body>
        <div>This one</div>
        <div>not This one</div>
        <div>not This one</div>
    </body>
</html>

The next-sibling combinator (+) separates two selectors and matches the second element only if it immediately follows the first element, and both are children of the same parent element.

<ul>
  <li>One</li>
  <li>Two!</li>
  <li>Three</li>
</ul>

select the <li>Two!</li>

1
2
3

li:first-of-type + li {
  color: red;
}

Select by attribute value containing
input[class*="example"]

Select by attribute value starting with
input[id^="example"]

Select by attribute value ending with
a[href$="example"]

XPath to CSS Selector Conversion

Equivalency	XPath Notation	CSS Selector
Select by element type	//div	div
Select by class name	`//div[@class=”example”]`	div.example
Select by ID	`//*[@id=”example”]`	#example
Select by attribute	`//input[@name=”example”]`	`input[name=”example”]`
Select by attribute value containing	`//input[contains(@class, “example”)]`	`input[class*=”example”]`
Select by attribute value starting with	`//input[starts-with(@id, “example”)]`	`input[id^=”example”]`
Select by attribute value ending with	`//a[ends-with(@href, “example”)]`	`a[href$=”example”]`
Select by sibling	//div/following-sibling::p	div + p
Select by descendant	//div//p	div p
Select by first child	`//div/p[1]`	div >
Select by last child	`//div/p[last()]`	div >

parsel.Selector

$ pip install parsel

.xpath() and .css() methods return a SelectorList instance, which is a list of new selectors.

If you want to extract only the first matched element, you can call the selector .get()

from parsel import Selector
html_text = "<html><body><h1>Hello, Parsel!</h1></body></html>"
html_selector = Selector(text=html_text)
html_selector.css('h1')
# [<Selector query='descendant-or-self::h1' data='<h1>Hello, Parsel!</h1>'>]
html_selector.xpath('//h1')  # the same, but now with XPath
# [<Selector query='//h1' data='<h1>Hello, Parsel!</h1>'>]

selecting the text inside the title tag:

selector.xpath('//title/text()')
# [<Selector query='//title/text()' data='Example website'>]

selector.css('title::text')
# [<Selector query='descendant-or-self::title/text()' data='Example website'>]

To actually extract the textual data, you must call the selector .get() or .getall() methods, as follows:

selector.xpath('//title/text()').getall()
# ['Example website']
selector.xpath('//title/text()').get()
# 'Example website'

query for attributes using .attrib property of a Selector:

1 2	[img.attrib['src'] for img in selector.css('img')] # ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']

As a shortcut, .attrib is also available on SelectorList directly; it returns attributes for the first matching element:

1 2	selector.css('img').attrib['src'] # 'image1_thumb.jpg'

1
2
3

from parsel import Selector
selector_1 = Selector(resp.text)
text_1 = selector_1.css("//div[@class="example"]").get()

bs4

page = BeautifulSoup(content, "html.parser")
# div_item = page.find("div", class_="item")
ol = page.find("ol", attrs={"class": "grid_view"})
# print(div_item)
lis = ol.find_all("li")

a = li.find("a")
a.get("href")

img = article.find_all("img")[0]
    img_src = img.get("src")
    img_resp = requests.get(img_src, headers=my_headers)
    img_resp.close()
    img_name = img_src.split("/")[-1]
    with open(f"data08/{img_name}", mode="wb") as f1:
        f1.write(img_resp.content)

Alt text

bs4 lxml

html = requests.get("https://www.google.com/search?q=minecraft", headers=headers)
soup = BeautifulSoup(html.text, "lxml")

for result in soup.select(".tF2Cxc"):
    title = result.select_one(".DKV0Md").text
    link = result.select_one(".yuRUbf a")["href"]
    displayed_link = result.select_one(".lEBKkf span").text
    snippet = result.select_one(".lEBKkf span").text

    print(f"{title}\n{link}\n{displayed_link}\n{snippet}\n")

bs4

Alt text

获取信息：

html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>baidu</title>
</head>
<body link="#0000cc">
    <div id="wrapper">
        <div id="head">
            <div class="head_wrapper">
                <div id="u1">
                    <a href="http://news.baidu.com" class="mnav" name="tj_trnews"><!--news--></a>
                    <a href="http://news.baidu.com" class="mnav" name="tj_trnews">news</a>
                    <a href="https://www.hao123.com" class="mnav" name="tj_trhao123">hao123</a>
                    <a href="http://map.baidu.com" class="mnav" name="tj_trmap">map</a>
                    <a href="http://v.baidu.com" class="mnav" name="tj_trvideo">video</a>
                    <a href="http://tieba.baidu.com" class="mnav" name="tj_trtieba">tieba</a>
                    <a href="http://www.baidu.com/more/" class="bri" name="tj_briicon" style="">more</a>
                </div>
            </div>
        </div>
    </div>
</body>
</html>
"""
bs = BeautifulSoup(html, "lxml")
first_a_link = bs.find(name="a")
print(first_a_link)
# <a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--news--></a>

节点名称
print(first_a_link.name) # "a"

节点属性

1 2	print(first_a_link.attrs) # dictinary : {'href': 'http://news.baidu.com', 'class': ['mnav'], 'name': 'tj_trnews'} print(first_a_link.attrs["href"]) # http://news.baidu.com

节点文本内容
print(first_a_link.string) # "news"

嵌套选择节点

1
2
3

first_div_element = bs.find(name="div")
a_in_div = first_div_element.find(name="a")
print(a_in_div) #<a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--news--></a>

find findall

findall(name=””, attrs={}, text=””)

name节点名称
attrs节点属性

常用属性id class 直接传入

1 2	print(bs.find(id="head")) print(bs.find(class_="mnav"))

text节点文本内容

a_link = bs.find_all("a", attrs={"href": "http://news.baidu.com"})
# [<a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--news--></a>,
# <a class="mnav" href="http://news.baidu.com" name="tj_trnews">news</a>]
print(a_link)   # list

beautifulsoup

install packages:

requests
beautifulsoup4
lxml

bs4

import requests
from bs4 import BeautifulSoup

url = 'www.google.com'

result = requests.get(url)

content = result.text

soup = BeautifulSoup(content, 'lxml')

# soup.find('tagname', class_='')
# soup.find('tagname', id='')
# soup.find('tagname')
#
# soup.find_all('h2')

subscript = soup.find('div', class_='full-script').get_text(separator="\n", strip=True)

def get_text(self,
separator: str = “”,
strip: bool = False,
types: tuple[Type[NavigableString], …] = …) -> str
Get all child strings of this PageElement, concatenated using the given separator.
Params:
separator – Strings will be concatenated using this separator.
strip – If True, strings will be stripped before being concatenated.
types – A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. Although there are exceptions, the default behavior in most cases is to consider only NavigableString and CData objects. That means no comments, processing instructions, etc.
Returns:
A string.

movie_urls = movie_list.find_all('a', href=True)

links = []
for link in movie_urls:
    links.append(link['href'])

subslikescript.com

import requests
from bs4 import BeautifulSoup

root = 'https://subslikescript.com'
# website = 'https://subslikescript.com/movies?page=2'
website = f'{root}/movies'
# website = 'https://subslikescript.com/movie/Titanic-120338'

result = requests.get(website, timeout=5, verify=False)
content = result.text
soup = BeautifulSoup(content, 'lxml')
# print(soup.prettify())

# pagination
nav = soup.find('ul', class_='pagination')
pages = nav.find_all('li', class_='page-item')[-2].get_text(strip=True)
print(pages)
pages = 2

links = []
for page in range(1, int(pages) + 1):
    website = f'{root}/movies?page={page}'

    result = requests.get(website, timeout=5, verify=False)
    content = result.text
    soup = BeautifulSoup(content, 'lxml')
    # print(soup.prettify())

    movie_list = soup.find('ul', class_='scripts-list')
    movie_urls = movie_list.find_all('a', href=True)


    for link in movie_urls:
        links.append(link['href'])
    # print(links)

for link in links:
    try:
        website = f"{root}/{link}"

        result = requests.get(website, timeout=10, verify=False)
        content = result.text
        soup = BeautifulSoup(content, 'lxml')

        article = soup.find('article', class_='main-article')
        title = article.find('h1').get_text()
        print(title)
        subscript = article.find('div', class_='full-script').get_text(separator="\n", strip=True)
        # print(subscript)

        with open(f'subslikescript_com/{title}.txt', 'w') as file:
            file.write(subscript)
    except:
        pass

css select

print(bs.select("div"))
print(bs.select("div#head"))
print(bs.select("a.mnav"))
print(bs.select('a[class="mnav"]'))
print(bs.select('div a'))

pandas

import pandas
film_names = ["无间道", "霸王别姬", "楚门的世界"]
film_scores = ["9.38", "9.0", "9.1"]
df = pandas.DataFrame()
df["电影名称"] = film_names
df["评分"] = film_scores

df.to_excel("films.xlsx", index = False)  # index = False 去掉索引列

json

json_string = resp.text().decode() # 是 json格式字符串

json_dict = json.loads(json_string, encoding=’utf-8’)

json dict -> json string
json.dumps()

example

response = requests.post(url, data=data, headers=headers)

json_dict = response.json()
# json_list = response.json()

fp = open('./dog.json', 'w', encoding='utf-8')
json.dump(json_dict, fp=fp, ensure_ascii=False)
# json.dump(json_list, fp=fp, ensure_ascii=False)
fp.close()

session

import requests

url = "https://passport.17k.com/ck/user/login"
session = requests.session()
data = {
    "loginName": "some username",
    "password": "some password",
}
login_resp = session.post(url, data=data)
print(login_resp.cookies)

shelf_resp = session.get("https://user.17k.com/ck/author2/shelf?page=1&appKey=2406394919")
print(shelf_resp.json())

login_resp.close()
shelf_resp.close()

my_header = {
    "Cookie": "some cookie from web browser"
}

shelf_d_resp = requests.get("https://user.17k.com/ck/author2/shelf?page=1&appKey=2406394919", headers=my_header)
print(shelf_d_resp.json())
shelf_d_resp.close()

referer

my_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36",
    "Referer": url,
}

pycrypto 安装

The Visual C++ Redistributable installs Microsoft C and C++ (MSVC) runtime libraries. These libraries are required by many applications built by using Microsoft C and C++ tools. If your app uses those libraries, a Microsoft Visual C++ Redistributable package must be installed on the target system before you install your app.

Microsoft C++ 生成工具通过可编写脚本的独立安装程序提供 MSVC 工具集，无需使用 Visual Studio。如果从命令行界面（例如，持续集成工作流中）生成面向 Windows 的 C++ 库和应用程序，则推荐使用此工具。

Win7安装pycrypto报错ucrt\inttypes.h(26): error C2061: syntax error: identifier ‘intmax_t‘
1.将C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include下的stdint.h复制到C:\Program Files (x86)\Windows Kits\10\Include\10.0.18362.0\ucrt2.编辑C:\Program Files (x86)\Windows Kits\10\Include\10.0.18362.0\ucrt下的inttypes.h将#include <stdint.h>改为#include “stdint.h”, 目的是让它使用上面第一点复制的头文件stdint.h

pycrypto is no longer maintained: see pycrypto.org pycryptodome is the modern maintained replacement for pycrypto

视频

m3u8

Alt text

https%3A%2F%2Fnew.1080pzy.co%2F20230116%2F34sxZOJQ%2Findex.m3u8

https://new.1080pzy.co/20230116/34sxZOJQ/index.m3u8
#EXTM3U
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1100000,RESOLUTION=960x540
/20230116/34sxZOJQ/1100kb/hls/index.m3u8

https://new.1080pzy.co/20230116/34sxZOJQ/1100kb/hls/index.m3u8
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:4
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MEDIA-SEQUENCE:0
#EXTINF:3.086,
https://hey05.cjkypo.com/20230116/34sxZOJQ/1100kb/hls/c16ZdUp5.ts
#EXTINF:2.085,
https://hey05.cjkypo.com/20230116/34sxZOJQ/1100kb/hls/zAiBKR1T.ts
#EXTINF:2.085,
https://hey05.cjkypo.com/20230116/34sxZOJQ/1100kb/hls/na7WHAiK.ts

The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:

>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'

ffmpeg使用语法:

具体一点来说：

-f concat，-f 一般设置输出文件的格式，如-f psp（输出psp专用格式），但是如果跟concat，则表示采用concat协议，对文件进行连接合并。
-safe 0，用于忽略一些文件名错误，如长路径、空格、非ANSIC字符
-i D:\ProgramData\study\mov\order.m3u8，-i后面加输入文件名，当然也可以加输入文件名组成的文件名，即order.m3u8，但是要满足文件格式，即类似于下面这种:

file ‘D:\ProgramData\study\mov\tsfiles\MQJ9iKoM.ts’
file ‘D:\ProgramData\study\mov\tsfiles\8LeDe7Wu.ts’
-c copy D:\ProgramData\study\mov\hello.mp4，-c表示输出文件采用的编码器，后面跟copy，表示直接复制，不重新编码。

并发

from threading import Thread

def func():
    for i in range(1000):
        print("func", i)

if __name__ == '__main__':
    thread_1 = Thread(target=func)
    thread_1.start()
    for i in range(1000):
        print("main", i)

from threading import Thread
class MyThread(Thread):
    def run(self):
        for i in range(1000):
            print("child thread", i)


if __name__ == '__main__':
    thread_1 = MyThread()
    thread_1.start()
    for i in range(1000):
        print("main", i)

池

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def fn(name):
    for i in range(1000):
        print(name, i)


if __name__ == '__main__':
    with ThreadPoolExecutor(8) as t_pool:
        for i in range(100):
            t_pool.submit(fn, name=f"thread{i}")

    # 等待进程池中的任务全部完成是才向下执行
    print("finished!")

协程coroutine

阻塞
requests.get(url)
网络请求返回数据之前，处于阻塞状态。

协程：
当程序遇见IO操作时，选择性的切换到其他任务
微观上，单线程下，一个任务一个任务的进行切换，切换条件即IO操作
宏观上，多个任务同时执行，即多任务异步操作。

DeprecationWarning: The explicit passing of coroutine objects to asyncio.wait() is deprecated since Python 3.8, and scheduled for removal in Python 3.11.

The asyncio.wait() documentation obviously says nothing about this or what you’re supposed to do instead, but as far as I can figure out you replace asyncio.wait([a, b]) with asyncio.gather(a, b).

import time
import asyncio


async def func1():
    print("hello world 1")
    # time.sleep(4)
    await asyncio.sleep(4)
    print("hello world 1")


async def func2():
    print("hello world 2")
    # time.sleep(3)
    await asyncio.sleep(3)
    print("hello world 2")


async def func3():
    print("hello world 3")
    # time.sleep(2)
    await asyncio.sleep(2)
    print("hello world 3")

async def main():
    # # Schedule three calls *concurrently*:
    # L = await asyncio.gather(
    #     func1(),
    #     func2(),
    #     func3(),
    # )
    # print(L)

    await asyncio.wait([
        asyncio.create_task(func1()),
        asyncio.create_task(func2()),
        asyncio.create_task(func3()),
    ])
t1 = time.time()
asyncio.run(main())
t2 = time.time()
print(t2 - t1)

import aiohttp
import asyncio

urls = [
    "https://p.qqan.com/up/2023-9/16951819146781660.jpg",
    "https://p.qqan.com/up/2023-9/16950082886215315.jpg",
    "https://p.qqan.com/up/2023-9/16957095513204923.jpg"
]

async def aio_download(url):
    # "sdfa".rsplit()
    name = url.rsplit("/", 1)[1]
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            # resp.text()
            # resp.json()
            # await resp.content.read()
            ## aiofiles 异步读写文件
            with open(f"data21/{name}", mode="wb") as f:
                f.write(await resp.content.read())

    print(f"{name}")

async def main():
    tasks = []
    for url in urls:
        task = asyncio.create_task(aio_download(url))
        tasks.append(task)
    await asyncio.wait(tasks)


if __name__ == '__main__':
    asyncio.run(main())

aes 解密 TS

async def dec_ts(name, key):
    aes =AES.new(key=key, IV=b'0000000000000000', mode=AES.MODE_CBC)
    async with aiofiles.open(f"data23/{name}", mode="rb") as f1,\
        aiofiles.open(f"data23/temp_{name}", mode= "wb") as f2:
        bs = await f1.read()
        await f2.write(aes.decrypt(bs))
    print(f"{name} decrypt finished")

ts合并

# mac os
def merge_ts():
    file_list = []
    with open("data15/temp2.m3u8", mode="r", encoding="utf-8") as f:
        for line in f :
            if line.startswith("#"):
                continue
            line = line.strip()
            name = line.rsplit("/", 1)[1]
            file_list.append(f"data23/temp_{name}")
    s = " ".join(file_list)
    os.system(f"cat {s} > movie.mp4")

# windows os
#文件夹下的ts文件的命名必须按照字母顺序排列, 否则合并文件后视频片段会产生混乱.（注意：名为‘10.ts’的文件的顺序会排在名为'9.ts'文件的前面，
# 也就是说，这里的字母顺序是指字符串的顺序。如果要用字符数字来命名ts文件，那么就需要给某些数字加上前导0）
def merge_ts2():
    os.system('copy /b ' + r'C:\Users\lcf\Documents\learning\xiaoyuan\data23\*.ts ' + r'C:\Users\lcf\Documents\learning\xiaoyuan\data23\new.ts')
    print("合并成功")

selenium

pip install selenium

install chrome driver
copy to the python.exe and scripts folder

if you have selenium above the 4.6.0 you don’t need to add executable_url and in the latest version of Selenium you don’t need to download webdriver.
With latest selenium(v4.6.0 and onwards), its in-built tool known as SeleniumManger can download and handle the driver.exe if you do not specify.

Selenium Manager provides automated driver management for: Google Chrome, Mozilla Firefox, Microsoft Edge.
Selenium Manager is invoked transparently by the Selenium bindings when:
No browser driver is detected on the PATH
No third party driver manager is being used

from selenium.webdriver import Chrome, ChromeOptions
chrome_options = ChromeOptions()
chrome_options.add_experimental_option("detach", True)
web_browser = Chrome(options=chrome_options)
web_browser.get("https://www.baidu.com")
print(web_browser.title)

With WebDriverWait, you don’t really have to take that into account. It will wait only as long as necessary until the desired element shows up (or it hits a timeout).

A WebElement is a Selenium object representing an HTML element.

There are many actions that you can perform on those objects, here are the most useful:

Accessing the text of the element with the property element.text
Clicking the element with element.click()
Accessing an attribute with element.get_attribute('class')
Sending text to an input with element.send_keys('mypassword')

WebDriver provides two main methods for finding elements.

find_element
find_elements

Type	Description	DOM Sample	Example
By.ID	Searches for elements based on their HTML ID	`<div id="myID">`	find_element(By.ID, “myID”)
By.NAME	Searches for elements based on their name attribute	`<input name="myNAME">`	find_element(By.NAME, “myNAME”)
By.XPATH	Searches for elements based on an XPath expression	`<span>My <a>Link</a></span>`	find_element(By.XPATH, “//span/
By.LINK_TEXT	Searches for anchor elements based on a match of their text content	`<a>My Link</a>`	find_element(By.LINK_TEXT, “My Link”)
By.PARTIAL_LINK_TEXT	Searches for anchor elements based on a sub-string match of their text content	`<a>My Link</a>`	find_element(By.PARTIAL_LINK_TEXT, “Link”)
By.TAG_NAME	Searches for elements based on their tag name	`<h1>`	find_element(By.TAG_NAME, “h1”)
By.CLASS_NAME	Searches for elements based on their HTML classes	`<div class="myCLASS">`	find_element(By.CLASSNAME,
By.CSS_SELECTOR	Searches for elements based on a CSS selector	`<span>My <a>Link</a></span>`	find_element(By.CSS_SELECTOR,

# from selenium.webdriver import Chrome, ChromeOptions
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
web_browser = Chrome(options=chrome_options)


url = "https://www.lagou.com/"
web_browser.get(url)
# print(web_browser.title)

# 在selenium环境下可以大胆chrome复制XPATH
location = web_browser.find_element(by=By.XPATH, value='//*[@id="changeCityBox"]/p[1]/a')
location.click()

time.sleep(3)

search_input = web_browser.find_element(by=By.XPATH, value='//*[@id="search_input"]')
search_input.send_keys("python", Keys.ENTER)

## selenium 动态执行JS
web_browser.execute_script("""
let a = document.getElementsByClassName("un-login-banner")[0];
if (a) {
a.style.display = "none";
}
""")


time.sleep(2)
#    //*[@id="jobList"]/div[1]/div[1]/div[1]/div[2]/div[1]/a
jobs = web_browser.find_elements(by=By.XPATH, value='//*[@id="jobList"]/div[1]/div')
for job in jobs:
    job_name = job.find_element(By.XPATH, './/*[@id="openWinPostion"]')
    # company_name = job.find_element(By.XPATH, './div[1]/div[2]/div[1]/a')
    # print(job_name.text, company_name.text)
    job_name.click()
    ##切换 TAB
    time.sleep(2)
    web_browser.switch_to.window(web_browser.window_handles[-1])
    job_detail = web_browser.find_element(By.XPATH,'//*[@id="job_detail"]/dd[2]/div').text
    print(job_detail)
    ## close tab
    web_browser.close()
    web_browser.switch_to.window(web_browser.window_handles[0])
    # break
# web_browser.quit()

browser = Chrome()
browser.get("")
html_source = browser.page_source

from selenium.webdirver import ChromeOptions

option = ChromeOptions

option.add_experimental_option('excludeSwitches', ['enable-automation'])
dirver = Chrome(options=option)

selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time
import pandas as pd

website = 'https://www.adamchoi.co.uk/overs/detailed'
country = 'Spain'

driver = webdriver.Chrome()
driver.get(website)

time.sleep(5)

all_matches_button = driver.find_element(by=By.XPATH, value="//label[@analytics-event='All matches']")
all_matches_button.click()

dropdown = Select(driver.find_element(by=By.ID, value='country'))
dropdown.select_by_visible_text(country)

time.sleep(5)

matches = driver.find_elements(by=By.TAG_NAME, value='tr')

dates = []
home_team = []
score = []
away_team = []
for match in matches:
    # print(match.text)
    date = match.find_element(by=By.XPATH, value='./td[1]').text
    print(date)
    dates.append(date)
    home_team.append(match.find_element(by=By.XPATH, value='./td[2]').text)
    score.append(match.find_element(by=By.XPATH, value='./td[3]').text)
    away_team.append(match.find_element(by=By.XPATH, value='./td[4]').text)


df = pd.DataFrame({
    'date': dates,
    'home_team': home_team,
    'score': score,
    'away_team': away_team,
})
df.to_csv(f"www_adamchoi_co_uk/{country}_footbal_data.csv", index=False)

headless mode

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless=new')
driver = webdriver.Chrome(CHROMEDRIVER_PATH, options=options)

audible.com

import time

from selenium import webdriver
from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

# from selenium.webdriver.chrome.options import Options

import pandas

# website = 'https://www.audible.com/search'
website = 'https://www.audible.com/adblbestsellers'

# options = Options()
# options.add_argument('--headless=new')
# options.add_argument('window-size=1920x1080')

# driver = webdriver.Chrome(options=options)
driver = webdriver.Chrome()
driver.get(website)
driver.maximize_window()

# time.sleep(5)

# pagination

pagination_li_s = (WebDriverWait(driver,10)
             .until(expected_conditions.presence_of_all_elements_located(
    (By.XPATH, '//ul[contains(@class,"pagingElements")]/li'))))

# pagination_li_s = driver.find_elements(by=By.XPATH, value='//ul[contains(@class,"pagingElements")]/li')
last_page = int(pagination_li_s[-2].text)
print(last_page)

book_titles = []
book_authors = []
book_lengths = []

for page in range(1, last_page + 1):
    # time.sleep(5)

    # container = (WebDriverWait(driver,10)
    #              .until(expected_conditions.presence_of_element_located(
    #     (By.XPATH, '//*[@id="center-3"]/div/div/div/span[2]/ul'))))
    # book_list = container.find_elements(by=By.XPATH, value='./li')
    book_list = (WebDriverWait(driver, 5)
                 .until(expected_conditions.presence_of_all_elements_located(
        (By.XPATH, '//*[@id="center-3"]/div/div/div/span[2]/ul/li'))))

    # book_list = driver.find_elements(by=By.XPATH, value='//*[@id="center-3"]/div/div/div/span[2]/ul/li')

    for book in book_list:
        title = book.find_element(by=By.XPATH, value=".//h3/a").text
        print(title)
        author = book.find_element(by=By.XPATH, value=".//li[contains(@class,'authorLabel')]/span").text
        length = book.find_element(by=By.XPATH, value=".//li[contains(@class,'runtimeLabel')]/span").text
        book_titles.append(title)
        book_authors.append(author)
        book_lengths.append(length)

    next_page = driver.find_element(By.XPATH, '//*[contains(@class,"nextButton")]')
    next_page.click()

df = pandas.DataFrame({
    "book_titles": book_titles,
    "book_authors": book_authors,
    "book_lengths": book_lengths,
})

df.to_csv('www_audible_com/books.csv', index=False)

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

web = 'https://twitter.com/'

driver = webdriver.Chrome()
driver.get(web)

time.sleep(3)

login = driver.find_element(By.XPATH, '//*[@data-testid="loginButton"]')
login.click()

time.sleep(5)

username = driver.find_element(By.XPATH, '//*[@name="text"]')
username.send_keys("email@qq.com")

next_btn = driver.find_element(By.XPATH, '//*[@role="dialog"]/div/div/div[2]/div[2]/div/div/div/div[6]')
next_btn.click()

time.sleep(5)
## 账号异常，需要输入用户名
phone_n = driver.find_element(By.XPATH, '//*[@name="text"]')
phone_n.send_keys("username")

next_btn = driver.find_element(By.XPATH, '//*[@role="dialog"]/div/div/div[2]/div[2]/div[2]/div/div/div/div/div')
next_btn.click()

time.sleep(4)

phone_n = driver.find_element(By.XPATH, '//*[@name="password"]')
phone_n.send_keys("password123")

next_btn = driver.find_element(By.XPATH, '//*[@role="dialog"]/div/div/div[2]/div[2]/div[2]/div/div[1]/div/div/div/div')
next_btn.click()


time.sleep(100)

快速生成由复制的文本生成 DICT

regex: (.*): (.*)
replace: "$1": "$2",

i: 好人
from: auto
to: 
dictResult: true
keyid: webfanyi

"i": "好人",
"from": "auto",
"to": "",
"dictResult": "true",
"keyid": "webfanyi",

md5

from hashlib import md5

def md5_string(string_0):
s = md5()
s.update(string_0.encode())

return s.hexdigest()

python web scraping

http

web 基础

string

the most common string methods

Strings (Unicode vs bytes)

urllib

urllib.request headers

urllib.parse.urlencode

example: baidutieba

requests

headers

get

post

response

解析

re

re

“?”

group

example

example: maoyan.com

example: baidu image

lxml xpath

example

xpath

css selector

parsel.Selector

bs4

bs4 lxml

bs4

获取信息：

find findall

beautifulsoup

bs4

subslikescript.com

css select

pandas

json

example

session

referer

pycrypto 安装

视频

并发

池

协程coroutine

aes 解密 TS

ts合并

selenium

selenium

headless mode

audible.com

twitter.com login

快速生成由复制的文本生成 DICT

md5

Categories