qiangfeng python

网络请求库 –> 浏览器客户端

网站服务器
WSGI: The Web Server Gateway Interface is a simple calling convention for web servers to forward requests to web applications or frameworks written in the Python programming language.

浏览器客户端
WebKit is a browser engine developed by Apple and primarily used in its Safari web browser, as well as all web browsers on iOS and iPadOS.


HTTP Requests
Start line
HTTP requests are messages sent by the client to initiate an action on the server. Their start-line contain three elements:

  1. An HTTP method, a verb (like GET, PUT or POST) or a noun (like HEAD or OPTIONS), that describes the action to be performed. For example, GET indicates that a resource should be fetched or POST means that data is pushed to the server (creating or modifying a resource, or generating a temporary document to send back).

  2. The request target, usually a URL, or the absolute path of the protocol, port, and domain are usually characterized by the request context. The format of this request target varies between different HTTP methods.

  3. The HTTP version, which defines the structure of the remaining message, acting as an indicator of the expected version to use for the response.

HTTP request header
The start-line and HTTP headers of the HTTP message are collectively known as the head of the requests, whereas its payload is known as the body.

Alt text

Alt text

1
2
3
4
5
6
7
8
9
10
11
12
13

GET /home.html HTTP/1.1
Host: developer.mozilla.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://developer.mozilla.org/testpage.html
Connection: keep-alive
Upgrade-Insecure-Requests: 1
If-Modified-Since: Mon, 18 Jul 2016 02:36:04 GMT
If-None-Match: "c561c68d0ba92bbeb8b0fff2a9199f722e3a621a"
Cache-Control: max-age=0

Body
The final part of the request is its body. Not all requests have one: requests fetching resources, like GET, HEAD, DELETE, or OPTIONS, usually don’t need one. Some requests send data to the server in order to update it: as often the case with POST requests (containing HTML form data).

Bodies can be broadly divided into two categories:

  • Single-resource bodies, consisting of one single file, defined by the two headers: Content-Type and Content-Length.
  • Multiple-resource bodies, consisting of a multipart body, each containing a different bit of information. This is typically associated with HTML Forms.

HTTP Responses

Status line
The start line of an HTTP response, called the status line, contains the following information:

The protocol version, usually HTTP/1.1.

  • A status code, indicating success or failure of the request. Common status codes are 200, 404, or 302
  • A status text. A brief, purely informational, textual description of the status code to help a human understand the HTTP message.
  • A typical status line looks like: HTTP/1.1 404 Not Found.

Alt text


网络请求:

  • urllib
    urllib is a package that collects several modules for working with URLs:
  • requests
    Requests is an HTTP client library for the Python programming language. Requests is one of the most, if not the most, popular Python libraries that is not included with Python due to its elegant mapping of the HTTP protocol onto Python’s object-oriented semantics
    It is implemented as a wrapper for urllib3, another 3rd party Python HTTP library.
  • selenium
  • appium 手机app

数据解析

  • re
    This module provides regular expression matching operations similar to those found in Perl.
  • xpath
    XPath (XML Path Language) is an expression language designed to support the query or transformation of XML documents.
    The XPath language is based on a tree representation of the XML document, and provides the ability to navigate around the tree, selecting nodes by a variety of criteria.
  • bs4
  • json

数据存储

  • pymysql
  • mongodb
  • elasticsearch

多任务

  • 多线程 threading/queue
    抢占式多任务处理(Preemptive multitasking)是计算机操作系统中,一种实现多任务处理(multi task)的方式。相对于协作式多任务处理而言。协作式环境下,下一个进程被调度的前提是当前进程主动放弃时间片;抢占式环境下,操作系统完全决定进程调度方案,操作系统可以剥夺耗时长的进程的时间片,提供给其它进程。
  • 协程 asynio | gevent/eventlet
    协程非常类似于线程。但是协程是协作式多任务的,而典型的线程是内核级抢占式多任务的。
    协作式多任务(Cooperative Multitasking),是一种多任务处理方式,多任务是使电脑能同时处理多个程序的技术,相对于抢占式多任务(Preemptive multitasking)由操作系统决定任务切换时机。协作式多任务要求每一个运行中的程序,定时放弃(yield)自己的执行权利,告知操作系统可让下一个程序执行,因为需要程序之间合作达到调度,故称作协作式多任务。

反爬虫

  • UA(user-agent)
  • 登录cookie
  • 请求频次 -> IP代理
  • 验证码
  • 动态js(splash/api)