Skip to main content

An extension module to send data to elasticsearch in bulk format.

Project description

###scrapy-elasticsearch-extension

A Scrapy Extension with the following functionaltity:

- to bulk export data to elasticsearch
- delete outdated documents

###required modules

[pyes](http://pyes.readthedocs.org/en/latest/)


###installation

generaly information to be found in the [Scrapy Extensions installation guide](http://doc.scrapy.org/en/latest/topics/extensions.html)

add the following line to the **EXTENSIONS** setting in your Scrapy settings:

```
'scrapyes.Sender' : 1000
```

###configuration

the module can be configured per project in your Scrapy settings using the following options:

```
ELASTICSEARCH_SERVER = "localhost"
ELASTICSEARCH_PORT = 9200
ELASTICSEARCH_INDEX = "sixx"
ELASTICSEARCH_TYPE = "text"
ELASTICSEARCH_BULK_SIZE = 10
SCRAPYES_ENABLED = True
```

### index configuration

the index used in Elastic Search insertion can be configured per spider [by initializing an attribute on the spider](http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments), named index, and passing the desired value when the spider
job is scheduled.
example:
```
curl http://192.168.33.10:6800/schedule.json -d project=psd_search_crawler \
-d spider=sixx_spider \
-d index=my_index

```
if the index is not configured on the running spider, the crawler settings value for variable **ELASTICSEARCH_INDEX** will be used.

if the item declares an id field, it will be used to update ES


### deleting outdated documents

If the document has been indexed with fiels 'spider_name' and 'last_indexed'
documents indexed before the latest run of the spider
will be removed when the spider closes,in case the spider has
finished its task

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ScrapyEs-0.23.tar.gz (2.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page