grab

Subpackages

Submodules

Package Contents

Classes

Grab

Document

Network response.

Request

Attributes

DataNotFound

__version__

class grab.Grab(transport: None | BaseTransport | type[BaseTransport] = None, **kwargs: Any)[source]
property doc: None | Document
__slots__ = ['proxylist', 'config', 'transport', 'request_method', 'cookies', 'meta', '_doc']
document_class :type[grab.document.Document]
clonable_attributes = ['proxylist']
process_transport_option(transport: None | BaseTransport | type[BaseTransport], default_transport: type[grab.base_transport.BaseTransport]) grab.base_transport.BaseTransport
clone(**kwargs: Any) Grab

Create clone of Grab instance.

Cloned instance will have the same state: cookies, referrer, response document data

Parameters

\**kwargs – overrides settings of cloned grab instance

dump_config() collections.abc.MutableMapping[str, Any]

Make clone of current config.

load_config(config: grab.types.GrabConfig) None

Configure grab instance with external config object.

setup(**kwargs: Any) None

Set up Grab instance configuration.

prepare_request() grab.request.Request

Configure all things to make real network request.

This method is called before doing real request via transport extension.

create_request_from_config(config: collections.abc.MutableMapping[str, Any]) grab.request.Request
log_request(req: grab.request.Request, extra: str = '') None

Send request details to logging system.

find_redirect_url(doc: grab.document.Document) None | str
request(url: None | str = None, **kwargs: Any) grab.document.Document

Perform network request.

You can specify grab settings in **kwargs. Any keyword argument will be passed to self.config.

Returns: Document objects.

submit(make_request: bool = True, **kwargs: Any) None | Document

Submit current form.

Parameters

make_request – if False then grab instance will be configured with form post data but request will not be performed

For details see Document.submit() method

process_request_result(req: grab.request.Request) grab.document.Document

Process result of real request performed via transport extension.

reset_temporary_options() None
change_proxy(random: bool = True) None

Set random proxy from proxylist.

classmethod common_headers() dict[str, str]

Build headers which sends typical browser.

make_url_absolute(url: str, resolve_base: bool = False) str

Make url absolute using previous request url as base url.

clear_cookies() None

Clear all remembered cookies.

__getstate__() dict[str, Any]
__setstate__(state: collections.abc.Mapping[str, Any]) None
class grab.Document(body: None | bytes = None, *, document_type: None | str = 'html', head: None | bytes = None, headers: None | email.message.Message = None, encoding: None | str = None, code: None | int = None, url: None | str = None, cookies: None | CookieJar = None)[source]

Network response.

property status: None | int
property json: Any

Return response body deserialized into JSON object.

property pyquery: Any

Return pyquery handler.

property body: None | bytes
property tree: lxml.etree._Element

Return DOM tree of the document built with HTML DOM builder.

property form: lxml.html.FormElement

Return default document’s form.

If form was not selected manually then select the form which has the biggest number of input elements.

The form value is just an lxml.html form element.

Example:

g.request('some URL')
# Choose form automatically
print g.form

# And now choose form manually
g.choose_form(1)
print g.form
__slots__ = ['document_type', 'code', 'head', '_bytes_body', 'headers', 'url', 'cookies', 'encoding',...
__call__(query: str) selection.SelectorList[lxml.etree._Element]
select(*args: Any, **kwargs: Any) selection.SelectorList[lxml.etree._Element]
process_encoding(encoding: None | str = None) str

Process explicitly defined encoding or auto-detect it.

If encoding is explicitly defined, ensure it is a valid encoding the python can deal with. If encoding is not specified, auto-detect it.

Raises unicodec.InvalidEncodingName if explicitly set encoding is invalid.

copy() Document
save(path: str) None

Save response body to file.

url_details() urllib.parse.SplitResult

Return result of urlsplit function applied to response url.

query_param(key: str) str

Return value of parameter in query string.

browse() None

Save response in temporary file and open it in GUI browser.

__getstate__() collections.abc.Mapping[str, Any]

Reset cached lxml objects which could not be pickled.

__setstate__(state: collections.abc.Mapping[str, Any]) None

Search the substring in response body.

Parameters
  • anchor – string to search

  • byte – if False then anchor should be the unicode string, and search will be performed in response.unicode_body() else anchor should be the byte-string and search will be performed in response.body

If substring is found return True else False.

text_assert(anchor: str | bytes) None

If anchor is not found then raise DataNotFound exception.

text_assert_any(anchors: list[str | bytes]) None

If no anchors were found then raise DataNotFound exception.

rex_text(regexp: str | bytes | Pattern[str] | Pattern[bytes], flags: int = 0, default: Any = NULL) Any

Return content of first matching group of regexp found in response body.

Search the regular expression in response body.

Return found match object or None

rex_assert(rex: str | bytes | Pattern[str] | Pattern[bytes]) None

Raise DataNotFound exception if rex expression is not found.

get_body_chunk() None | bytes
unicode_body() None | str

Return response body as unicode string.

set_body(body: bytes) None
classmethod wrap_io(inp: bytes | str) StringIO | BytesIO
classmethod _build_dom(content: bytes | str, mode: str, encoding: str) lxml.etree._Element
build_html_tree() lxml.etree._Element
build_xml_tree() lxml.etree._Element
choose_form(number: None | int = None, xpath: None | str = None, name: None | str = None, **kwargs: Any) None

Set the default form.

Parameters
  • number – number of form (starting from zero)

  • id – value of “id” attribute

  • name – value of “name” attribute

  • xpath – XPath query

Raises

DataNotFound if form not found

Raises

GrabMisuseError if method is called without parameters

Selected form will be available via form attribute of Grab instance. All form methods will work with default form.

Examples:

# Select second form
g.choose_form(1)

# Select by id
g.choose_form(id="register")

# Select by name
g.choose_form(name="signup")

# Select by xpath
g.choose_form(xpath='//form[contains(@action, "/submit")]')
get_cached_form() None | FormElement

Get form which has been already selected.

Returns None if form has not been selected yet.

It is for testing mainly. To not trigger pylint warnings about accessing protected element.

set_input(name: str, value: Any) None

Set the value of form element by its name attribute.

Parameters
  • name – name of element

  • value – value which should be set to element

To check/uncheck the checkbox pass boolean value.

Example:

g.set_input('sex', 'male')

# Check the checkbox
g.set_input('accept', True)
set_input_by_id(_id: str, value: Any) None

Set the value of form element by its id attribute.

Parameters
  • _id – id of element

  • value – value which should be set to element

set_input_by_number(number: int, value: Any) None

Set the value of form element by its number in the form.

Parameters
  • number – number of element

  • value – value which should be set to element

set_input_by_xpath(xpath: str, value: Any) None

Set the value of form element by xpath.

Parameters
  • xpath – xpath path

  • value – value which should be set to element

process_extra_post(post_items: list[tuple[str, Any]], extra_post_items: collections.abc.Sequence[tuple[str, Any]]) list[tuple[str, Any]]
clean_submit_controls(post: collections.abc.MutableMapping[str, Any], submit_name: None | str) None
get_form_request(submit_name: None | str = None, url: None | str = None, extra_post: None | Mapping[str, Any] | Sequence[tuple[str, Any]] = None, remove_from_post: None | Sequence[str] = None) tuple[str, str, bool, collections.abc.Sequence[tuple[str, Any]]]

Submit default form.

Parameters
  • submit_name – name of button which should be “clicked” to submit form

  • url – explicitly specify form action url

  • extra_post – (dict or list of pairs) additional form data which will override data automatically extracted from the form.

  • remove_from_post – list of keys to remove from the submitted data

Following input elements are automatically processed:

  • input[type=”hidden”] - default value

  • select: value of last option

  • radio - ???

  • checkbox - ???

Multipart forms are correctly recognized by grab library.

build_fields_to_remove(fields: collections.abc.Mapping[str, Any], form_inputs: collections.abc.Sequence[lxml.html.HtmlElement]) set[str]
process_form_fields(fields: collections.abc.MutableMapping[str, Any]) None
form_fields() collections.abc.MutableMapping[str, lxml.html.HtmlElement]

Return fields of default form.

Fill some fields with reasonable values.

choose_form_by_element(xpath: str) None
grab.DataNotFound[source]
exception grab.GrabError[source]

Bases: Exception

All custom Grab exception should be children of that class.

exception grab.GrabMisuseError[source]

Bases: GrabError

Indicates incorrect usage of grab API.

exception grab.GrabNetworkError(*args: Any, **kwargs: Any)[source]

Bases: OriginalExceptionGrabError

Raises in case of network error.

exception grab.GrabTimeoutError(*args: Any, **kwargs: Any)[source]

Bases: GrabNetworkError

Raises when configured time is outed for the request.

class grab.Request(method: str, url: str, *, headers: None | dict[str, Any] = None, timeout: None | int | Timeout = None, cookies: None | dict[str, Any] = None, encoding: None | str = None, proxy_type: None | str = None, proxy: None | str = None, proxy_userpwd: None | str = None, fields: Any = None, body: None | bytes = None, multipart: None | bool = None, document_type: None | str = None)[source]
get_full_url() str
_process_timeout_param(value: None | float | Timeout) grab.util.timeout.Timeout
grab.__version__ = 0.6.41[source]