grab.spider

Subpackages

Submodules

Package Contents

Classes

Spider

Asynchronous scraping framework.

Task

Task for spider.

class grab.spider.Spider(task_queue: None | BaseTaskQueue = None, thread_number: None | int = None, network_try_limit: None | int = None, task_try_limit: None | int = None, priority_mode: str = 'random', meta: None | dict[str, Any] = None, config: None | dict[str, Any] = None, parser_requests_per_process: int = 10000, parser_pool_size: int = 1, network_service: None | BaseNetworkService = None, grab_transport: None | BaseTransport | type[BaseTransport] = None)[source]

Asynchronous scraping framework.

spider_name
initial_urls :list[str] = []
collect_runtime_event(name: str, value: None | str) None
setup_queue(*_args: Any, **_kwargs: Any) None

Set up queue.

add_task(task: grab.spider.task.Task, queue: None | BaseTaskQueue = None, raise_error: bool = False) bool

Add task to the task queue.

stop() None

Instruct spider to stop processing new tasks and start shutting down.

load_proxylist(source: str | BaseProxySource, source_type: None | str = None, proxy_type: str = 'http', auto_init: bool = True, auto_change: bool = True) None

Load proxy list.

Parameters
  • source – Proxy source. Accepts string (file path, url) or BaseProxySource instance.

  • source_type – The type of the specified source. Should be one of the following: ‘text_file’ or ‘url’.

  • proxy_type – Should be one of the following: ‘socks4’, ‘socks5’ or’http’.

  • auto_change – If set to True then automatically random proxy rotation will be used.

Proxy source format should be one of the following (for each line): - ip:port - ip:port:login:password

render_stats() str
prepare() None

Do additional spider customization here.

This method runs before spider has started working.

shutdown() None

Override this method to do some final actions after parsing has been done.

update_grab_instance(grab: grab.base.Grab) None

Update config of any Grab instance created by the spider.

WTF it means?

create_grab_instance(**kwargs: Any) grab.base.Grab
task_generator() collections.abc.Iterator[grab.spider.task.Task]

You can override this method to load new tasks.

It will be used each time as number of tasks in task queue is less then number of threads multiplied on 2 This allows you to not overload all free memory if total number of tasks is big.

check_task_limits(task: grab.spider.task.Task) tuple[bool, str]

Check that task’s network & try counters do not exceed limits.

Returns: * if success: (True, None) * if error: (False, reason)

generate_task_priority() int
process_initial_urls() None
get_task_from_queue() None | Literal[True] | Task
setup_grab_for_task(task: grab.spider.task.Task) grab.base.Grab
is_valid_network_response_code(code: int, task: grab.spider.task.Task) bool

Test if response is valid.

Valid response is handled with associated task handler. Failed respoosne is processed with error handler.

process_parser_error(func_name: str, task: grab.spider.task.Task, exc_info: tuple[type[Exception], Exception, types.TracebackType]) None
find_task_handler(task: grab.spider.task.Task) collections.abc.Callable[Ellipsis, Any]
log_network_result_stats(res: grab.spider.service.network.NetworkResult, task: grab.spider.task.Task) None
process_grab_proxy(task: grab.spider.task.Task, grab: grab.base.Grab) None

Assign new proxy from proxylist to the task.

change_active_proxy(task: grab.spider.task.Task, grab: grab.base.Grab) None
get_task_queue() grab.spider.queue_backend.base.BaseTaskQueue
is_idle_estimated() bool
is_idle_confirmed(services: list[grab.spider.service.base.BaseService]) bool

Test if spider is fully idle.

WARNING: As side effect it stops all services to get state of queues anaffected by sercies.

Spider is full idle when all conditions are met: * all services are paused i.e. the do not change queues * all queues are empty * task generator is completed

run() None
shutdown_services(services: list[grab.spider.service.base.BaseService]) None
log_failed_network_result(res: grab.spider.service.network.NetworkResult) None
log_rejected_task(task: grab.spider.task.Task, reason: str) None
get_fallback_handler(task: grab.spider.task.Task) None | Callable[..., Any]
srv_process_service_result(result: Task | None | Exception | dict[str, Any], task: grab.spider.task.Task, meta: None | dict[str, Any] = None) None

Process result submitted from any service to task dispatcher service.

Result could be: * Task * None * Task instance * ResponseNotValid-based exception * Arbitrary exception * Network response:

{ok, ecode, emsg, exc, grab, grab_config_backup}

Exception can come only from parser_service and it always has meta {“from”: “parser”, “exc_info”: <…>}

srv_process_network_result(result: grab.spider.service.network.NetworkResult, task: grab.spider.task.Task) None
srv_process_task(task: grab.spider.task.Task) None
exception grab.spider.SpiderError[source]

Bases: grab.errors.GrabError

Base class for Spider exceptions.

exception grab.spider.SpiderMisuseError[source]

Bases: SpiderError

Improper usage of Spider framework.

exception grab.spider.FatalError[source]

Bases: SpiderError

Fatal error which should stop parsing process.

exception grab.spider.SpiderInternalError[source]

Bases: SpiderError

Raises when error throwned by internal spider logic.

Like spider class discovering, CLI error.

exception grab.spider.NoTaskHandler[source]

Bases: SpiderError

Raise when no handler found to process network response.

exception grab.spider.NoDataHandler[source]

Bases: SpiderError

Raise when no handler found to process Data object.

class grab.spider.Task(name: None | str = None, url: None | str = None, grab: None | Grab = None, grab_config: None | GrabConfig = None, priority: None | int = None, priority_set_explicitly: bool = True, network_try_count: int = 0, task_try_count: int = 1, valid_status: None | list[int] = None, use_proxylist: bool = True, delay: None | int = None, raw: bool = False, callback: None | Callable[..., None] = None, fallback_name: None | str = None, store: None | dict[str, Any] = None, disable_cache: bool = False, refresh_cache: bool = False, cache_timeout: None | int = None, **kwargs: Any)[source]

Bases: BaseTask

Task for spider.

process_init_url_grab_options(url: None | str, grab: None | Grab, grab_config: None | GrabConfig) None
get(key: str, default: Any = None) Any

Return value of attribute or None if such attribute does not exist.

process_delay_option(delay: None | float) None
setup_grab_config(grab_config: grab.types.GrabConfig) None
test_clone_options_integrity(url: None | str, grab: None | Grab, grab_config: None | GrabConfig) None
clone(**kwargs: Any) Task

Clone Task instance.

Reset network_try_count, increase task_try_count. Reset priority attribute if it was not set explicitly.

__repr__() str

Return repr(self).

__lt__(other: Task) bool

Return self<value.

__eq__(other: object) bool

Return self==value.