Skip to main content

Sumav is a fully automated labeling tool that assigns each file a family name based on AV labels.

Project description

Sumav

Sumav is a fully automated labeling tool that assigns each file a family name based on AV labels. Creates a relationship graph between AV label tokens through VirusTotal file feed (similar to ML training), and then infers family name from AV label for one input file (similar to ML serving). Image

Quickstart with docker-compose

We assume that you already have docker-compose installed.

NOTICE: We provide a sql file of pregenerated Suamv graph for a quick start, but for the best performance, you should update Sumav graph periodically.

$ git clone https://github.com/bestksw/sumav.git
$ cd sumav
$ zip -F sumav_paper_graph.zip --out full-archive.zip && unzip full-archive.zip && rm full-archive.zip
$ docker-compose up -d --no-build
$ docker exec -it sumav_sumav_1 sumav run select '["PUP/Win32.Dealply.C3316715", "Win32:DealPly-AJ [Adw]", "a variant of Win32/DealPly.RC potentially unwanted", null]'
dealply

Install

Prerequsite

  • Above Python 3.5 with Ubuntu 16.04
  • Above PostgreSQL 11 server and clients

Tested on PostgreSQL 11.1 and Python 3.8 with Ubuntu 20.04 Desktop

pip install

pip3 install sumav

How to use

You should build your own graph from VT filefeed/apiv2 before run Sumav.

Create graph

CMD

$ sumav
usage: sumav [-h] [-l {n,c,e,w,i,d}] [-v] [-s] {build,migrate,run} ...

positional arguments:
  {build,migrate,run}   command to run

optional arguments:
  -h, --help            show this help message and exit
  -l {n,c,e,w,i,d}, --logging_level {n,c,e,w,i,d}
                        n:none, c:critical, e:error, w:warn, i:info, d:debug
  -v, --version         show program's version number and exit
  -s, --show-config     show configs came from environment variables. (refer
                        to /home/bestksw/workspace_sumav/Sumav/sumav/conf.py)
...

$ sumav build
usage: sumav build [-h] [-p] {vt,none} ...

positional arguments:
  {vt,none}             preprocess data from
    vt                  virustotal
    none                skip preprocess

optional arguments:
  -h, --help            show this help message and exit
  -p, --preprocess-only

$ PSQL_HOST=172.31.10.10 PSQL_DB=sumav_exp sumav build vt /srv/vt_file_feed  # Takes one day per a month of data
...

$ PSQL_HOST=172.31.10.10 PSQL_DB=sumav_exp sumav run 
usage: sumav run [-h] {select,compare,similar} ...

positional arguments:
  {select,compare,similar}
                        query methods

optional arguments:
  -h, --help            show this help message and exit

$ sumav run select '["PUP/Win32.Dealply.C3316715", "Win32:DealPly-AJ [Adw]", "a variant of Win32/DealPly.RC potentially unwanted", null]'
dealply

API

$ PSQL_HOST=172.31.10.10 PSQL_DB=sumav_exp python3
>>> import sumav
>>> from sumav.conf import get_conf, psql_conf
>>> help(sumav)  # API documentation
>>> get_conf()   # Can see configuration
...
>>> # Convert VT file feed to Sumav DB
>>> from_vt = FromVT(**psql_conf)
>>> from_vt.convert('/srv/vt_file_feed')
...
>>> from_vt.close() # Close connection
>>> # Build graph
>>> builder = SumavGraphBuilder(**psql_conf)
>>> builder.build_graph(processes=30, skip_build_token_node=False)
...
>>> builder.close() # Close connection

Run Sumav

$ PSQL_HOST=172.31.10.10 PSQL_DB=sumav_exp python3
>>> import sumav
>>> from sumav.conf import get_conf, psql_conf
>>> get_conf()   # Can see configuration
...
>>> # If you want to designate dumped graph, execute below line
>>> # psql_conf['database'] = 'sumav_srv_200101-200601'
>>> searcher = sumav.SumavGraphSearcher(**psql_conf)
>>> # searcher.load_dumped_graph()  # Automatically load dumped graph came from remote
>>> dn = ['PUP/Win32.Dealply.C3316715', 'Win32:DealPly-AJ [Adw]',
          'a variant of Win32/DealPly.RC potentially unwanted',
          None, 'DealPly Updater (PUA)', None, None]
>>> searcher.get_representative_token(av_labels=dn)
dealply

License

Apache 2.0 + You must notify "Sumav is used" in your project even if only its outputs are used.

Paper

Kim, Sangwon, et al. "Sumav: Fully automated malware labeling." ICT Express (2022). Download pdf

Kim, Sangwon, et al. "Sumav: A Scheme for Selecting a Keyword from Detection Names using Conditional Probability." 2020 International Conference on Information and Communication Technology Convergence (ICTC). IEEE, 2020. Download pdf

Patent

KR102199704B1

US11,232,200B2

Authors

  • Sangwon Kim
  • Wookhyun Jung
  • Eltak Kim

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

sumav-1.0.3-py3-none-any.whl (3.2 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page