Skip to main content

mag-archiver is an Azure service that automatically archives Microsoft Academic Graph (MAG) releases so that they can be transferred to other cloud services.

Project description

MAG Archiver

MAG Archiver is an Azure Function App that automatically archives Microsoft Academic Graph (MAG) releases so that they can be transferred to other cloud services.

License Python Version Python package codecov

Status

This is a proof of concept; the functionality for archiving and compressing each MAG release has not been implemented yet.

Setup

The following instructions explain how to setup Mag Archiver.

Dependencies

  • Install Azure CLI
  • Install Azure Functions Core Tools
  • Create an Azure Storage Account
    • Region: choose an Azure region that is close to the other cloud provider that you want to transfer the data to.
    • Access tier: hot (need to be able to delete containers without cold storage deletion fees)
    • Create container: mag-snapshots
    • Under Blob Service > Lifecycle Management > Code view: add the life-cycle rules from lifecycle-rules.json
      • These rules move blobs to the cold tier after 30 days and delete the blobs after 61 days.
  • Create an Azure Function App
    • Take note of your function app name, you will need it later.
    • Under Settings > Configuration > Application settings, add the following Application settings (name: value):
      • STORAGE_ACCOUNT_NAME: the name of your storage account.
      • STORAGE_ACCOUNT_KEY: they key for your storage account.
      • TARGET_CONTAINER: mag-snapshots
  • Subscribe to Microsoft Academic Graph on Azure storage

Deploy to Azure

To deploy mag-archiver follow the instructions below.

Setup Azure account

Make sure that the Azure account that your Function App is deployed to is set as the default.

To do this, list your accounts and copy the id of the account that should be the default account:

az account list

Set the account to the Azure account that your Function App is deployed to:

az account set -s <insert your account id here>

Check that the correct account is set, you should see your account show up:

az account show

Deploy the Function App

Clone the project:

git clone git@github.com:The-Academic-Observatory/mag-archiver.git

Enter the function app folder:

cd mag-archiver

Deploy the function:

func azure functionapp publish <your function app name> --python

Architecture

The architecture of MAG Archiver is illustrated via the deployment and process view diagrams below.

Process View

The MAG subscription adds each new MAG release as a new Azure Blob storage container in the user's Azure Storage account.

An Azure Function App runs every 10 minutes and checks to see if any new MAG release containers have been added.

process view

Metadata about which MAG releases have been discovered and processed are stored in an Azure Table Storage table called MagReleases. The MagReleases table is also used used to enable the Apache Airflow MAG workflow to query and find out what MAG releases have finished processing and where on the Azure blob storage container they can be downloaded from. A share access signature (SAS) with read only privileges is used to provide the Apache Airflow MAG workflow with access to the table.

When the Function App finds a new MAG release, it copies the files from the container onto a shared container called mag-snapshots under a folder with the same name as the container it was copied from. After copying the files, the original container is deleted.

The Function App copies the MAG files to a shared container so that the Apache Airflow MAG workflow only needs to hold a single SAS token, one for the shared container. In the future the copying of files by the Cloud Function can be replaced by a service that compresses the files, as shown in the diagram above.

A total of two SAS tokens are shared: one for the MagReleases table and one for the mag-snapshots container.

Deployment View

The deployment view below shows what services are used and where they are deployed.

deployment view

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mag-archiver-2020.12.0.tar.gz (85.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page