A Python library manipulate html
Project description
ReadMe
Installation
$ pip3 install nvhtml
refer to Installation
Documentation
refer to https://nvhtml.readthedocs.io
License
MIT
Quickstart
refer to Usage
Features
search html via tags-pathlist
search html via breadth,depth
search html via tag
width first traverse html
description matrix of html
depth first traverse html (sax)
beautify html
- get html node via relationship:
ancestor,descendants,parent,grand-parent,children,
siblings,lsib,rcib,preceding-sib,following-sib,
lcin,rcin
References
elist
mlconjug(for docs template format)
Installation
From pip
To install nvhtml, run this command in your terminal:
$ pip3 install nvhtml
From sources
The sources for nvhtml can be downloaded from the github repo:
$ git clone git@github.com:ihgazni2/nvhtml.git
Once you have a copy of the source, you can install it with:
$ python setup.py install
Usage
Command Line
rshtml write rule
- # pound:
comment
- - dash:
attrib
- . dot:
text and tail
- | vbar:
text and tail content can multiline
rshtml
NVHTML-BENCH# cat tstrs.rshtml html head meta -http-equiv X-UA-Compatible -content IE=edge,chrome=1 meta -name viewport -content user-scalable=yes, initial-scale=1.0, minimum-scale=1.0, maximum-scale=3.0 link body div -id menu-item-27961 -class qtranxs-lang-menu-item menu-item-object-custom .text |hello |hihihi .tail |this is a tail li div li div #comment .text |this is acomment NVHTML-BENCH# NVHTML-BENCH# nvhtml_rsh2html -input tstrs.rshtml NVHTML-BENCH# ls -l | egrep tstrs.rshtml -rw-r--r-- 1 root root 597 Aug 19 11:36 tstrs.rshtml -rw-r--r-- 1 root root 658 Aug 19 11:56 tstrs.rshtml.html NVHTML-BENCH# cat tstrs.rshtml.html <html> <head> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> </meta> <meta name="viewport" content="user-scalable=yes, initial-scale=1.0, minimum-scale=1.0, maximum-scale=3.0"> </meta> <link> </link> </head> <body> <div id="menu-item-27961" class="qtranxs-lang-menu-item menu-item-object-custom"> hello hihihi <li> </li> </div> this is a tail <div> <li> </li> </div> <div> </div> </body> <!-- this is acomment --> </html> NVHTML-BENCH# vice versa NVHTML-BENCH# nvhtml_html2rsh -input tstrs.html
vertical show
with color
NVHTML-BENCH# nvhtml_struct_show -input disp.html color: True -------------------------- | html | -------------------------- | head | body | -------------------------- | | div | div | -------------------------- | | | div | div | --------------------------
no color
#if two big, disable color , and open it in editor such as notepad ++ NVHTML-BENCH# nvhtml_struct_show -input opis.html -color "no" > html_txt_tb
nvrsh_struct_show
TEST# nvrsh_struct_show -input jobj2.rshtml color: True ------------------------------- | html | ------------------------------- | body | ------------------------------- | im-dict-root | ------------------------------- | l | t | s | ------------------------------- | o0 | o1 | o0 | o1 | u0 | u1 | ------------------------------- | a0 | a1 | t0 | t1 | s0 | s1 | ------------------------------- NVHTML# cat TEST/jobj2.rshtml im-dict-root l o0 a0 o1 a1 t o0 t0 o1 t1 s u0 s0 u1 s1 NVHTML#
beautify
nvhtml_beauty -input opis.html vim opis.html.out.html # nvhtml_beauty -h usage: nvhtml_beauty [-h] [-input INPUT_HTML_FILE] [-output OUTPUT_HTML_FILE] [-codec INPUT_CODEC] optional arguments: -h, --help show this help message and exit -input --input_html_file input html file name -output --output_html_file output html file name -codec --input_codec input html file codec
search with loc
NVHTML-BENCH# nvhtml_loc -h usage: nvhtml_loc [-h] [-input INPUT_HTML_FILE] [-codec INPUT_CODEC] [-loc MAT_LOC] optional arguments: -h, --help show this help message and exit -input --input_html_file input html file name -codec --input_codec input html file codec -loc --mat_loc depth,breadth NVHTML-BENCH# nvhtml_loc -input opis.html -loc 11,2 <------json----- { 'pl': [ 'html', 'body', 'div', 'div', 'header', 'div', 'nav', 'div', 'ul', 'li', 'ul', 'li' ], 'breadth': 2, 'depth': 11, 'pbreadth': 2, 'samepl_sibseq': 0, 'samepl_breadth': 0, 'tag': 'li', 'sibseq': 0, 'attrib': { 'id': 'menu-item-22951', 'class': 'menu-item menu-item-type-post_type menu-item-object-page menu-item-22951' }, 'text': None, 'tail': '\n\t', 'text_intag': '' } ------json----> <----------html-------------- <li id="menu-item-22951" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-22951"> <a href="https://opistobranquis.info/en/home/presentacio/"> <span> Presentation </span> </a> </li> <----------html--------------
tag search with depth
NVHTML-BENCH# nvhtml_tag -h usage: nvhtml_tag [-h] [-input INPUT_HTML_FILE] [-codec INPUT_CODEC] [-tag HTML_TAG] [-which WHICH_TAG] [-sdepth START_LEVEL_DEPTH] [-edepth END_LEVEL_DEPTH] optional arguments: -h, --help show this help message and exit -input --input_html_file input html file name -codec --input_codec input html file codec -tag --html_tag html tag for search -which --which_tag sequence of tag-array -sdepth --start_level_depth start level depth -edepth --end_level_depth end level depth NVHTML-BENCH# NVHTML-BENCH# nvhtml_tag -input opis.html.out.html -tag img -sdepth 3 -edepth 6 <img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Twitter.png"> </img> <img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png"> </img> <img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Twitter.png"> </img> <img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png"> </img> <img style="float: right; display: none" class="loading" src="https://opistobranquis.info/wp-content/plugins/jetpack/modules/sharedaddy/images/loading.gif" alt="loading" width="16" height="16"> </img> NVHTML-BENCH#
html to db
NVHTML-BENCH# nvhtml_sqlite -input opis.html db: ./opis.html.sqlite.db table: tb_html NVHTML-BENCH# NVHTML-BENCH# sqlite3 opis.html.sqlite.db SQLite version 3.22.0 2018-01-22 18:45:57 Enter ".help" for usage hints. sqlite> sqlite> .table tb_html sqlite> sqlite> .schema tb_html CREATE TABLE IF NOT EXISTS "tb_html" ( "index" INTEGER, "_pl" TEXT, "_breadth" TEXT, "_depth" TEXT, "_pbreadth" TEXT, "_samepl_sibseq" TEXT, "_samepl_breadth" TEXT, "_tag" TEXT, "_sibseq" TEXT, "_text" TEXT, "_tail" TEXT, "class" TEXT, "href" TEXT, "id" TEXT, "style" TEXT, "type" TEXT, "src" TEXT, "rel" TEXT, "target" TEXT, "title" TEXT, "content" TEXT, "alt" TEXT, "media" TEXT, "name" TEXT, "align" TEXT, "property" TEXT, "role" TEXT, "value" TEXT, "data-shared" TEXT, "hreflang" TEXT, "for" TEXT, "aria-current" TEXT, "colspan" TEXT, "method" TEXT, "action" TEXT, "sizes" TEXT, "placeholder" TEXT, "height" TEXT, "width" TEXT, "http-equiv" TEXT, "autocomplete" TEXT, "data-layout" TEXT, "data-orig-file" TEXT, "data-href" TEXT, "lang" TEXT, "data-image-title" TEXT, "data-recalc-dims" TEXT, "data-attachment-id" TEXT, "data-text" TEXT, "data-flxmap" TEXT, "size" TEXT, "data-image-description" TEXT, "data-medium-file" TEXT, "async" TEXT, "language" TEXT, "srcset" TEXT, "data-comments-opened" TEXT, "data-large-file" TEXT, "data-via" TEXT, "defer" TEXT, "data-permalink" TEXT, "data-image-meta" TEXT, "data-noptimize" TEXT, "data-url" TEXT, "data-orig-size" TEXT ); CREATE INDEX "ix_tb_html_index"ON "tb_html" ("index"); sqlite> sqlite> sqlite> SELECT src FROM tb_html WHERE _tag=="img" AND src like "%acebook.png"; https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png sqlite> sqlite> sqlite> sqlite> sqlite> SELECT href FROM tb_html WHERE _tag=="link" AND href like "%.com"; //s0.wp.com //c0.wp.com //i0.wp.com //i1.wp.com //i2.wp.com sqlite>
html to dir
NVHTML-BENCH# mkdir TMP NVHTML-BENCH# nvhtml_dir -input opis.html -wkdir TMP NVHTML-BENCH# tree -fdL 4 TMP | head TMP └── TMP/html.0 ├── TMP/html.0/body.1 │ ├── TMP/html.0/body.1/<comment>.91 │ ├── TMP/html.0/body.1/div.90 │ │ ├── TMP/html.0/body.1/div.90/<comment>.4 │ │ ├── TMP/html.0/body.1/div.90/<comment>.7 │ │ ├── TMP/html.0/body.1/div.90/div.0 │ │ ├── TMP/html.0/body.1/div.90/div.1 │ │ ├── TMP/html.0/body.1/div.90/div.2 NVHTML-BENCH# NVHTML-BENCH# tree -fdL 4 TMP | tail ├── TMP/html.0/head.0/style.45 ├── TMP/html.0/head.0/style.55 ├── TMP/html.0/head.0/style.56 ├── TMP/html.0/head.0/style.57 ├── TMP/html.0/head.0/style.58 ├── TMP/html.0/head.0/style.78 ├── TMP/html.0/head.0/style.79 └── TMP/html.0/head.0/title.7 138 directories NVHTML-BENCH# ls -l TMP/html.0/body.1/div.90/div.2 total 36 drwxr-xr-x 3 root root 4096 Aug 11 02:49 a.3 drwxr-xr-x 3 root root 4096 Aug 11 02:49 a.4 -rw-r--r-- 1 root root 7 Aug 11 02:49 attrib.class -rw-r--r-- 1 root root 7 Aug 11 02:49 attrib.id -rw-r--r-- 1 root root 538 Aug 11 02:49 outter_html -rw-r--r-- 1 root root 3 Aug 11 02:49 tag -rw-r--r-- 1 root root 1 Aug 11 02:49 tail -rw-r--r-- 1 root root 4 Aug 11 02:49 text -rw-r--r-- 1 root root 8 Aug 11 02:49 text_intag NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/attrib.id srights NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/attrib.class socials NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/outter_html <div class="socials" id="srights"> <a target="_blank" href="https://twitter.com/InfoOpk" class="socialicons social -Twitter external" title="Twitter"> <img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/t empera/images/socials/Twitter.png"/> </a> <a target="_blank" href="https://www.facebook.com/OPK.Opistobranquis/" class="s ocialicons social-Facebook external" title="Facebook"> <img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/ tempera/images/socials/Facebook.png"/> </a></div> NVHTML-BENCH# NVHTML-BENCH# ls -al TMP/html.0/body.1/div.90/div.2 | egrep " \.[a-z]" -rw-r--r-- 1 root root 1 Aug 11 02:49 .breadth -rw-r--r-- 1 root root 1 Aug 11 02:49 .depth -rw-r--r-- 1 root root 27 Aug 11 02:49 .mkdir_pth -rw-r--r-- 1 root root 2 Aug 11 02:49 .pbreadth -rw-r--r-- 1 root root 18 Aug 11 02:49 .pl -rw-r--r-- 1 root root 1 Aug 11 02:49 .samepl_breadth -rw-r--r-- 1 root root 1 Aug 11 02:49 .samepl_sibseq -rw-r--r-- 1 root root 1 Aug 11 02:49 .sibseq NVHTML-BENCH# NVHTML-BENCH# NVHTML-BENCH# NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.breadth 2 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.depth 3 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.pbreadth 90 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.pl /html/body/div/div NVHTML-BENCH# NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.samepl_breadth 2 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.samepl_sibseq 2 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.sibseq 2 NVHTML-BENCH#
find all
NVHTML-BENCH# nvhtml_find_all -input opis.html -attrib "http-equiv" [ 'X-UA-Compatible', 'Content-Type' ] NVHTML-BENCH# NVHTML-BENCH# nvhtml_find_all -input opis.html -attrib "href" | egrep "jorunna-e" 'https://opistobranquis.info/en/guia/nudibranchia/doridina/doridoidei/doridoidea/jorunna-efe/', 'https://opistobranquis.info/en/guia/nudibranchia/doridina/doridoidei/doridoidea/jorunna-evansi/', NVHTML-BENCH# NVHTML-BENCH# nvhtml_find_all -input opis.html common attribs: [ '_pl', '_breadth', '_depth', '_pbreadth', '_samepl_sibseq', '_samepl_breadth', '_tag', '_sibseq', '_text', '_tail' ] attrib_names:frequency { 'class': 947, 'href': 810, 'id': 181, 'style': 80, 'type': 78, 'src': 55, 'rel': 49, 'target': 41, 'title': 36, 'content': 23, 'alt': 19, 'media': 17, 'name': 15, 'align': 13, 'property': 12, 'role': 9, 'value': 7, 'hreflang': 4, 'data-shared': 4, 'colspan': 3, 'for': 3, 'aria-current': 3, 'sizes': 3, 'action': 3, 'method': 3, 'placeholder': 2, 'width': 2, 'http-equiv': 2, 'height': 2, 'data-permalink': 1, 'data-recalc-dims': 1, 'srcset': 1, 'size': 1, 'data-layout': 1, 'data-orig-size': 1, 'language': 1, 'data-medium-file': 1, 'data-href': 1, 'data-image-description': 1, 'data-image-title': 1, 'data-orig-file': 1, 'defer': 1, 'data-flxmap': 1, 'data-noptimize': 1, 'data-image-meta': 1, 'lang': 1, 'data-url': 1, 'data-large-file': 1, 'autocomplete': 1, 'data-via': 1, 'async': 1, 'data-comments-opened': 1, 'data-attachment-id': 1, 'data-text': 1 } NVHTML-BENCH#
nvhtml_wfs_udlrpls
NVHTML-BENCH#nvhtml_wfs_udlrpls -input xxx.html
nvhtml_wfs_dulrpls
NVHTML-BENCH#nvhtml_wfs_dulrpls -input xxx.html
Examples
relation get
html_str = fs.rfile("./test.html") root = LXHTML(html_str) ele = engine.xpath(root,"//div",5) engine.parent(ele) engine.grand_parent(ele) engine.ancestors(ele) engine.parent(ele) engine.grand_parent(ele) engine.ancestors(ele) engine.lsib(ele) engine.rsib(ele) engine.lcin(ele) engine.rcin(ele) engine.siblings(ele) engine.descendants(ele,5,6) engine.layer(ele) engine.breadth(ele) engine.depth(ele) engine.pathlist(ele)
description matrix
html_str = fs.rfile("./test.html") root = LXHTML(html_str) wfs = engine.WFS(root) pobj(wfs.mat[3][1])
width-first-traverse
html_str = fs.rfile("./test.html") root = LXHTML(html_str) pls = engine.wfspls(root) utils.parr(pls[:10])
depth-first-traverse
import lxml.sax html_str = fs.rfile("./test.html") root = LXHTML(html_str) dfs = engine.DFS() lxml.sax.saxify(root, dfs) utils.parr(dfs.pls[:5]) utils.parr(dfs.pls[-10:])
beautify
html_str = fs.rfile("./test.html") root = LXHTML(html_str) html_str = engine.beautify(root) print(html_str[:480])
lvsrch
[
'a',
'abbr',
'acronym',
'address',
'applet',
'area',
'arguments',
'article',
'aside',
'audio',
'b',
'base',
'basefont',
'bdi',
'bdo',
'big',
'blockquote',
'body',
'br',
'button',
'canvas',
'caption',
'center',
'cite',
'code',
'col',
'colgroup',
'command',
'datalist',
'dd',
'del_',
'details',
'dfn',
'dialog',
'dir',
'div',
'dl',
'dt',
'elel',
'em',
'embed',
'engine',
'fieldset',
'figcaption',
'figure',
'font',
'footer',
'form',
'frame',
'frameset',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'head',
'header',
'hr',
'html',
'i',
'iframe',
'img',
'input',
'ins',
'isindex',
'kbd',
'keygen',
'label',
'legend',
'li',
'link',
'map',
'mark',
'menu',
'menuitem',
'meta',
'meter',
'nav',
'noframes',
'noscript',
'object',
'ol',
'optgroup',
'option',
'output',
'p',
'param',
'pre',
'progress',
'q',
'rp',
'rt',
'ruby',
's',
'samp',
'script',
'section',
'select',
'small',
'source',
'span',
'srch',
'strike',
'strong',
'style',
'sub',
'summary',
'sup',
'table',
'tbody',
'td',
'textarea',
'tfoot',
'th',
'thead',
'time',
'title',
'tr',
'track',
'tt',
'u',
'ul',
'utils',
'var',
'video',
'wbr',
'xmp'
]
engine
[
'BEAUTIFY',
'ContentHandler',
'DFS',
'WFS',
'ancestor',
'ancestors',
'beautify',
'between_levels_cond_func',
'breadth',
'child',
'children',
'copy',
'default_wfs_handler',
'depth',
'descendants',
'descendants_pls',
'dfs_traverse',
'dfspls',
'disconnect',
'elel',
'extract_pls',
'following_sibs',
'grand_parent',
'html',
'init_cls_wfs_arguments',
'is_leaf',
'layer',
'layer_wfs_handler',
'lcin',
'leaf_descendants',
'leaf_descendants_pls',
'loc',
'loc2node',
'lsib',
'lxe',
'lxml',
'nonleaf_descendants',
'nonleaf_descendants_pls',
'parent',
'pathlist',
'plget',
'preceding_sibs',
'rcin',
're',
'rootnode',
'rsib',
'samepl_breadth',
'samepl_siblings',
'samepl_sibseq',
'siblings',
'sibseq',
'source',
'text_intag',
'txtize',
'utils',
'wfs_traverse',
'wfspls',
'xpath',
'xpath_levels'
]
History
0.0.1 (2018-12-18)
first commit