dcard-spider

A spider for Dcard through its newest API.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Get posts and forums resourses through Dcard practical API on website.

Feature

Embrace asynchronous tasks and multithreads. All works done in parallel or coroutine-like. Spider needs for speed.

Installation

$ pip install dcard-spider

Dependencies

Python 2.6+, Python 3.3+

Example

from dcard import Dcard


def 先過濾出標題含有作品關鍵字(metas):
    return [meta['id'] for meta in metas if '#作品' in meta['title']]


if __name__ == '__main__':

    dcard = Dcard()

    ids = dcard.forums('photography').get_metas(pages=3, callback=先過濾出標題含有作品關鍵字)
    posts = dcard.posts(ids).get(comments=False, links=False)

    resources = posts.parse_resources(constraints={'likeCount': '>=20'})

    status = posts.download(resources)
    print('成功下載！' if all(status) else '出了點錯下載不完全喔')

Usage

Basic

取得看板資訊 (metadata)
- 可用參數no_school調整是否取得學校看版內容。

forums = dcard.forums.get()
forums = dcard.forums.get(no_school=True)

print(len(forums))

取得看板文章資訊 (metadata)，一頁有30篇文章
- 可用 pages 指定頁數數量
- 文章排序有兩種選擇: new / popular

ariticle_metas = dcard.forums('funny').get_metas(pages=5, sort='new')
ariticle_metas = dcard.forums('funny').get_metas(pages=1, sort='popular')

print(len(ariticle_metas))

提供一次取得單篇/多篇文章詳細資訊(全文、引用連結、所有留言)

# 放入 文章編號/單一meta資訊 => return 單篇文章
article = dcard.posts(224341009).get()
article = dcard.posts(ariticle_metas[0]).get()

# 放入 複數文章編號/多個meta資訊 => return 一串文章
ids = [meta['id'] for meta in ariticle_metas]
articles = dcard.posts(ids).get()
articles = dcard.posts(ariticle_metas).get()

下載文章中的資源 (目前支援文中 imgur 連結的圖片)
可加入限制 (constraints) 過濾出符合條件的文章後，再進行分析
可以使用多個限制條件
預設每篇圖片儲存至文章標題 (#文章編號) 為名的新資料夾

resources = posts.parse_resources(constraints={'likeCount': '>=100')
resources = posts.parse_resources(constraints={'likeCount': '>=20', 'commentCount': '>10'})

status = posts.download(resources)

Advanced

提供自定義 callback function，可在接收回傳值前做處理 (filter / reduce data)。

def collect_ids(metas):
    return [meta['id'] for meta in metas]


def 標題含有圖片關鍵字(metas):
    return [meta['id'] for meta in metas if '#圖' in meta['title']]


ids = dcard.forums('funny').get_metas(pages=5, callback=collect_ids)
ids = dcard.forums('funny').get_metas(pages=5, callback=標題含有圖片關鍵字)

print(len(ids))

爬取文章時提供 content, links, comments 三個參數，能選擇略過不需要的資訊以加快爬蟲速度。

posts = dcard.posts(ids).get(comments=False, links=False)
print(len(posts))

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.0

May 26, 2017

0.2.14a0 pre-release

Apr 17, 2017

0.2.13

Dec 9, 2016

0.2.12

Dec 9, 2016

0.2.11

Aug 7, 2016

0.2.9

Aug 1, 2016

0.2.6

Jul 26, 2016

0.2.5

Jul 26, 2016

0.2.2

May 6, 2017

0.2.1

Jul 25, 2016

This version

0.2.0

Jul 25, 2016

0.1.0

Jul 20, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcard-spider-0.2.0.zip (15.6 kB view hashes)

Uploaded Jul 25, 2016 Source

Hashes for dcard-spider-0.2.0.zip

Hashes for dcard-spider-0.2.0.zip
Algorithm	Hash digest
SHA256	`2a4668e5d106b50ee4940dfdd881bcda1ce53168961938e6cf7f8805ae045498`
MD5	`98904e92d5b7f1578972939c247cfe2c`
BLAKE2b-256	`40d1d995f6470ea1b30b67328c2a9d93048a41dda017f8a5d86147cb20276f69`