piedomains

Predict categories based domain names and it's content

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

https://ci.appveyor.com/api/projects/status/k0b72xay9i4ufxff?svg=true

https://img.shields.io/pypi/v/piedomains.svg

This package used Shallalist dataset to train the model. Scrapped homepages of the domains mentioned in above dataset. This package predicts the category based on the domain name, text content and domain screenshot.

Install

We strongly recommend installing piedomains inside a Python virtual environment (see venv documentation)

pip install piedomains

General API

domain.pred_shalla_cat_with_text(input)

What it does:

predicts category based on domain name and text content

Input

list of domains (optional, if not provided, html_path is required)

path where htmls are stored (optional, if not provided, domains is required)

use latest model (optional)

Output

Returns panda dataframe with label and probabilities

from piedomains import domain
domains = [
    "forbes.com",
    "xvideos.com",
    "last.fm",
    "facebook.com",
    "bellesa.co",
    "marketwatch.com"
]
# with only domains
result = domain.pred_shalla_cat_with_text(domains)
# with html path where htmls are stored (offline mode)
result = domain.pred_shalla_cat_with_text(html_path="path/to/htmls")
# with domains and html path, html_path will be used to store htmls
result = domain.pred_shalla_cat_with_text(domains, html_path="path/to/htmls")
print(result)

domain.pred_shalla_cat_with_images(input)

What it does:

predicts category based on domain name and domain screenshot

Input

list of domains (optional, if not provided, image_path is required)

path where images are stored (optional, if not provided, domains is required)

use latest model (optional)

Output

Returns panda dataframe with label and probabilities

from piedomains import domain
domains = [
    "forbes.com",
    "xvideos.com",
    "last.fm",
    "facebook.com",
    "bellesa.co",
    "marketwatch.com"
]
# with only domains
result = domain.pred_shalla_cat_with_images(domains)
# with image path where images are stored (offline mode)
result = domain.pred_shalla_cat_with_images(image_path="path/to/images")
# with domains and image path, image_path will be used to store images
result = domain.pred_shalla_cat_with_images(domains, image_path="path/to/images")
print(result)

domain.pred_shalla_cat(input)

What it does:

predicts category based on domain name, text content and domain screenshot

Input

list of domains (optional, if not provided, html_path and image_path is required)

path where htmls are stored (optional, if not provided, domains is required)

path where images are stored (optional, if not provided, domains is required)

use latest model (optional)

Output

Returns panda dataframe with label and probabilities

from piedomains import domain
domains = [
    "forbes.com",
    "xvideos.com",
    "last.fm",
    "facebook.com",
    "bellesa.co",
    "marketwatch.com"
]
# with only domains
result = domain.pred_shalla_cat(domains)
# with html path where htmls are stored (offline mode)
result = domain.pred_shalla_cat(html_path="path/to/htmls")
# with image path where images are stored (offline mode)
result = domain.pred_shalla_cat(image_path="path/to/images")
print(result)

Examples

from piedomains import domain
domains = [
    "forbes.com",
    "xvideos.com",
    "last.fm",
    "facebook.com",
    "bellesa.co",
    "marketwatch.com"
]
result = domain.pred_shalla_cat(domains)
print(result)

Output -

                name text_pred_label  text_label_prob img_pred_label  \
0       forbes.com            news         0.575000     recreation
1      xvideos.com            porn         0.897716           porn
2          last.fm           music         0.229545       shopping
3     facebook.com      recreation         0.200815           porn
4       bellesa.co            porn         0.962932       shopping
5  marketwatch.com         finance         0.790576     recreation

  img_label_prob  used_domain_content  used_domain_screenshot  \
0        0.911997                 True                    True
1        0.755726                 True                    True
2        0.416521                 True                    True
3        0.274597                 True                    True
4        0.374870                 True                    True
5        0.366329                 True                    True

                                  text_domain_probs  \
0  {'adv': 0.010590500641848523, 'aggressive': 0....
1  {'adv': 0.002181818181818182, 'aggressive': 9....
2  {'adv': 0.002181818181818182, 'aggressive': 0....
3  {'adv': 0.006381039197812215, 'aggressive': 0....
4  {'adv': 0.00021545223423966907, 'aggressive': ...
5  {'adv': 0.0007271669575334497, 'aggressive': 9...

                                    img_domain_probs
0  {'adv': 9.541013423586264e-05, 'aggressive': 1...
1  {'adv': 0.00041423083166591823, 'aggressive': ...
2  {'adv': 0.008832501247525215, 'aggressive': 0....
3  {'adv': 0.027437569573521614, 'aggressive': 0....
4  {'adv': 0.0008953566430136561, 'aggressive': 3...
5  {'adv': 0.007870808243751526, 'aggressive': 0....

Authors

Rajashekar Chintalapati and Gaurav Sood

Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct.

License

The package is released under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.19

Apr 28, 2023

0.0.18

Apr 20, 2023

0.0.17

Apr 17, 2023

0.0.16

Apr 14, 2023

0.0.15

Apr 14, 2023

0.0.14

Apr 13, 2023

0.0.13

Apr 13, 2023

0.0.12

Apr 13, 2023

0.0.11

Feb 5, 2023

This version

0.0.10

Feb 4, 2023

0.0.9

Feb 4, 2023

0.0.8

Jan 29, 2023

0.0.7

Jan 29, 2023

0.0.6

Jan 28, 2023

0.0.5

Jan 28, 2023

0.0.4

Oct 28, 2022

0.0.3

Oct 28, 2022

0.0.2

May 4, 2022

0.0.1

May 3, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piedomains-0.0.10.tar.gz (2.9 MB view hashes)

Uploaded Feb 4, 2023 Source

Built Distribution

piedomains-0.0.10-py2.py3-none-any.whl (3.0 MB view hashes)

Uploaded Feb 4, 2023 Python 2 Python 3

Hashes for piedomains-0.0.10.tar.gz

Hashes for piedomains-0.0.10.tar.gz
Algorithm	Hash digest
SHA256	`ba8ac22c293466dc5e2f7ea3e47bf34817adc1ae847e71a5c79b1ee4cfeb7385`
MD5	`29decbce029ac6c8fdaff72660791017`
BLAKE2b-256	`d11945113c817d1dc5af53c2665060604fa6789b8aafb4304684b4abc3e172a1`

Hashes for piedomains-0.0.10-py2.py3-none-any.whl

Hashes for piedomains-0.0.10-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a33508194c6c6eeb7b301e3e396212def980b878467c3d1ef08fc1687001978`
MD5	`392a6890b828570b72db5d09233af581`
BLAKE2b-256	`ab7cc04a0fc60412d3d1849f6b5641211731871abdcdead20c86d5c7b896d78b`