Module grab.spider¶
- class grab.spider.base.Spider(thread_number=None, network_try_limit=None, task_try_limit=None, request_pause=<object object>, priority_mode='random', meta=None, only_cache=False, config=None, args=None, parser_requests_per_process=10000, parser_pool_size=1, network_service='threaded', grab_transport='urllib3', transport=None)[source]¶
Asynchronous scraping framework.
- check_task_limits(task)[source]¶
Check that task’s network & try counters do not exceed limits.
Returns: * if success: (True, None) * if error: (False, reason)
- is_valid_network_response_code(code, task)[source]¶
Answer the question: if the response could be handled via usual task handler or the task failed and should be processed as error.
- load_proxylist(source, source_type=None, proxy_type='http', auto_init=True, auto_change=True)[source]¶
Load proxy list.
- Parameters
source – Proxy source. Accepts string (file path, url) or
BaseProxySourceinstance.source_type – The type of the specified source. Should be one of the following: ‘text_file’ or ‘url’.
proxy_type – Should be one of the following: ‘socks4’, ‘socks5’ or’http’.
auto_change – If set to True then automatical random proxy rotation will be used.
- Proxy source format should be one of the following (for each line):
ip:port
ip:port:login:password
- prepare()[source]¶
You can do additional spider customization here before it has started working. Simply redefine this method in your Spider class.
- process_next_page(grab, task, xpath, resolve_base=False, **kwargs)[source]¶
Generate task for next page.
- Parameters
grab – Grab instance
task – Task object which should be assigned to next page url
xpath – xpath expression which calculates list of URLS
**kwargs – extra settings for new task object
Example:
self.follow_links(grab, 'topic', '//div[@class="topic"]/a/@href')
- setup_queue(backend='memory', **kwargs)[source]¶
Setup queue.
- Parameters
backend – Backend name Should be one of the following: ‘memory’, ‘redis’ or ‘mongo’.
kwargs – Additional credentials for backend.
- shutdown()[source]¶
You can override this method to do some final actions after parsing has been done.
- stop()[source]¶
This method set internal flag which signal spider to stop processing new task and shuts down.