Skip to main content

Jina is the cloud-native neural search solution powered by the state-of-the-art AI and deep learning

Project description

Jina banner

An easier way to build neural search in the cloud

Quick StartHello WorldLearnContributeJobsWebsiteSlack

Jina Python 3.7 3.8 PyPI Docker Image Version (latest semver) CI CD codecov
EnglishFrançaisDeutsch中文日本語한국어PortuguêsРусский языкукраїнський

Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g. text, images, video, audio) in the cloud.

⏱️ Time Saver - Bootstrap an AI-powered system in just a few minutes.

🧠 First-Class AI Models - The design pattern for neural search systems, with first-class support for state-of-the-art AI models.

🌌 Universal Search - Large-scale indexing and querying of any kind of data on multiple platforms: video, image, long/short text, music, source code, etc.

☁️ Cloud Ready - Decentralized architecture with cloud-native features out-of-the-box: containerization, microservice, scaling, sharding, async IO, REST, gRPC.

🧩 Plug & Play - Easily extendable with Pythonic interface.

❤️ Made with Love - Quality first, never compromises, maintained by a full-time, venture-backed team.

Installation

On Linux/macOS with Python 3.7/3.8:

pip install -U jina

To install Jina with extra dependencies, or install on Raspberry Pi please refer to the documentation. Windows users can use Jina via the Windows Subsystem for Linux. We welcome the community to help us with native Windows support.

In a Docker Container

Our universal Docker image supports multiple architectures (including x64, x86, arm-64/v7/v6). They are ready-to-use:

docker run jinaai/jina --help

Jina "Hello, World!" 👋🌍

As a starter, you can try "Hello, World" - a simple demo of image neural search for Fashion-MNIST. No extra dependencies needed, just run:

jina hello-world

...or even easier for Docker users, no install required:

docker run -v "$(pwd)/j:/j" jinaai/jina hello-world --workdir /j && open j/hello-world.html  
# replace "open" with "xdg-open" on Linux
Click here to see console output

hello world console output

It downloads the Fashion-MNIST training and test dataset and tells Jina to index 60,000 images from the training set. Then it randomly samples images from the test set as queries and asks Jina to retrieve relevant results. The whole process takes about 1 minute, and eventually opens a webpage and shows results like this:

Jina banner

Intrigued? Play with different options:

jina hello-world --help

Get Started

Create

Jina provides a high-level Flow API to ease the build of search/index workflow. To create a new Flow,

from jina.flow import Flow
f = Flow().add()

This creates a simple Flow with one Pod. You can chain multiple .add() in a Flow.

Visualize

To visualize it, you can simply chain it with .plot(). If you are using Jupytner notebook, it will render a flowchart inline.

Gateway is the entrypoint of the Flow.

Feed Data

Let's try send some random data to it via index functions:

with f:
    f.index_ndarray(numpy.random.random[4,2], output_fn=print)  # index ndarray data, document sliced on first dimension
    f.index_lines(['hello world!', 'goodbye world!'])  # index textual data, each element is a document
    f.index_files(['/tmp/*.mp4', '/tmp/*.pdf'])  # index files and wildcard globs, each file is a document
    f.index((jina_pb2.Document() for _ in range(10)))  # index raw Jina Documents

To use a Flow, use with context manager to open it, like opening a file in Python. output_fn is the callback function invoked once a batch is done. In the example above, our Flow simply passes the message then prints the result. The whole data stream is async and efficient.

Add Logic

To add a logic to the Flow, one can use uses keyword to attach Pod with an Executor. uses accepts multiple types of values including: class name, Docker image, (inline) YAML, built-in shortcut.

f = (Flow().add(uses='MyBertEncoder')  # a class name of a Jina Executor
           .add(uses='jinahub/pretrained-cnn:latest')  # a Dockerized Jina Pod
           .add(uses='myencoder.yaml')  # a YAML serialization of a Jina Executor
           .add(uses='!WaveletTransformer | {freq: 20}')  # an inline YAML config
           .add(uses='_pass'))  # an built-in shortcut executor

The power of Jina lies in its decentralized architecture: each add creates a new Pod, these Pods can be at local thread/process, at remote process, inside a Docker container, or even inside a remote Docker container.

Inter & Intra Parallelism

Chaining .add() creates a sequential Flow. To introduce parallelism, specifiy needs parameter:

f = (Flow().add(name='p1', needs='gateway')
           .add(name='p2', needs='gateway')
           .add(name='p3', needs='gateway')
           .needs(['p1','p2', 'p3'], name='r1').plot())

p1, p2, p3 now subscribe to Gateway and conduct the work in parallel. The last .needs() block all Pods until they finish their work. Note, parallelism can be also archieved inside a Pod with parallel:

f = (Flow().add(name='p1', needs='gateway')
           .add(name='p2', needs='gateway')
           .add(name='p3', parallel=3)
           .needs(['p1','p3'], name='r1').plot())

That's all you need to know for understanding the magic behind hello-world. Now let's dive into it.

Breakdown of hello-world

Customize Encoder

Let's first build a naive image encoder that embeds images into vectors using an orthogonal projection. To do that, we simply inherit from BaseImageEncoder: a base class from the jina.executors.encoders module. We then override its __init__() and encode() methods.

import numpy as np
from jina.executors.encoders import BaseImageEncoder

class MyEncoder(BaseImageEncoder):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        np.random.seed(1337)
        H = np.random.rand(784, 64)
        u, s, vh = np.linalg.svd(H, full_matrices=False)
        self.oth_mat = u @ vh

    def encode(self, data: 'np.ndarray', *args, **kwargs):
        return (data.reshape([-1, 784]) / 255) @ self.oth_mat

Jina provides a family of Executor classes, which summarizes frequently-used algorithmic components in the neural search. This family consists of encoders, indexers, crafters, evaluators, and classifiers; each has a well-designed interface. You can find the list of all 107 built-in executors at here. If they do not meet your need, inheriting from one of them is the easiest way to bootstrap your own executor. Simply use our Jina Hub CLI via:

pip install jina[hub] && jina hub new

Test Encoder in Flow

Let's test our encoder in the Flow with some synthetic data:

def validate(docs):
    assert len(docs) == 100
    assert GenericNdArray(docs[0].embedding).value.shape == (64,)

f = Flow().add(uses='MyEncoder')

with f:
    f.index_ndarray(np.random.random([100, 28, 28]), output_fn=validate, callback_on='docs')

All good! Now our validate function confirms that all one hundred 28x28 synthetic images have been embedded into 100x64 vectors.

Parallelism & Batching

By setting the input bigger, you can play with batch_size and parallel a bit.

f = Flow().add(uses='MyEncoder', parallel=10)

with f:
    f.index_ndarray(np.random.random([60000, 28, 28]), batch_size=1024)

Add Data Indexer

Now we need to add an indexer to store all the embeddings and the picture for later retrieval. Jina has provided a simple numpy-powered vector indexer NumpyIndexer, and a key-value indexer BinaryPbIndexer. We can combining them together in a single YAML file.

!CompoundIndexer
components:
  - !NumpyIndexer
    with:
      index_filename: vec.gz
  - !BinaryPbIndexer
    with:
      index_filename: chunk.gz
metas:
  workspace: ./

! tags structure with a class name; with keyword defines the arguments for initializing this class object. Essentially, the above YAML config equals to the following Python code:

from jina.executors.indexers.vector import NumpyIndexer
from jina.exeuctors.indexers.keyvalue import BinaryPbIndexer

a = NumpyIndexer(index_filename='vec.gz')
b = BinaryPbIndexer(index_filename='vec.gz')
c = CompoundIndexer()
c.components = lambda: [a, b]

Compose Flow in Python/YAML

Now adding our indexer YAML file to the flow by .add(uses=). Let's also add two shards to the indexer to improve its scalability:

f = Flow().add(uses='MyEncoder', parallel=2).add(uses='myindexer.yml', shards=2, separated_workspace=True).plot()

When the number of arguments become big, constructing Flow in Python could be cumbersome. One can simply move all arguments into one flow.yml as follows:

!Flow
pods:
  encode:
    uses: MyEncoder
    parallel: 2
  index:
    uses: myindexer.yml
    shards: 2
    separated_workspace: true

And then load it in Python via:

f = Flow.load_config('myflow.yml')

Search via Query Flow

Querying a Flow is very similar to what we have seen in the indexing. Simply load the query Flow and switch from f.index to f.search. Say you want to retrieve the top-50 documents that are most similar to your query and then plot them in a HTML:

f = Flow.load_config('flows/query.yml')
with f:
    f.search_ndarray(shuffle=True, size=128, output_fn=plot_in_html, top_k=50)

REST Interface of Query Flow

In practice, the query Flow and the client (i.e. data sender) are often physically seperated. Moreover, the client may prefer to use REST API instead of gRPC when querying. One can set port_expose to the public port and turn on REST support via rest_api=True:

f = Flow(port_expose=45678, rest_api=True)

with f:
    f.block()

That is the essense behind jina hello-world. It is just a taste of what Jina can do. We’re really excited to see what you do with Jina! You can easily create a Jina project from templates with one terminal command:

pip install jina[hub] && jina hub new --type app

This creates a Python entrypoint, YAML configs and a Dockerfile. You can start from there.

Tutorials

Jina 101 Concept Illustration Book, Copyright by Jina AI Limited   

Jina 101: First Things to Learn About Jina

  English日本語FrançaisPortuguêsDeutschРусский язык中文عربية
Level Tutorials

🐣

Build an NLP Semantic Search System

Search South Park scripts and practice with Flows and Pods

🐣

My First Jina App

Using cookiecutter for bootstrap a jina app

🐣

Fashion Search with Query Language

Spice up the Hello-World with Query Language

🕊

Use Chunk to search Lyrics

Split documents in order to search on a finegrained level

🕊

Mix and Match images and captions

Search cross modal to get images from captions and vice versa

🚀

Scale Up Video Semantic Search

Improve performance using prefetching and sharding

Documentation

Documentation is built on every push, merge, and release of Jina's master branch.

The Basics

Reference

Are you a "Doc"-star? Join us! We welcome all kinds of improvements on the documentation.

Documentation for older versions is archived here.

Contributing

We welcome all kinds of contributions from the open-source community, individuals and partners. We owe our success to your active involvement.

Contributors ✨

All Contributors

Community

  • Code of conduct - play nicely with the Jina community
  • Slack workspace - join #general on our Slack to meet the team and ask questions
  • YouTube channel - subscribe to the latest video tutorials, release demos, webinars and presentations.
  • LinkedIn - get to know Jina AI as a company and find job opportunities
  • Twitter Follow - follow and interact with us using hashtag #JinaSearch
  • Company - know more about our company and how we are fully committed to open-source.

Open Governance

GitHub milestones lay out the path to Jina's future improvements.

As part of our open governance model, we host Jina's Engineering All Hands in public. This Zoom meeting recurs monthly on the second Tuesday of each month, at 14:00-15:30 (CET). Everyone can join in via the following calendar invite.

The meeting will also be live-streamed and later published to our YouTube channel.

Join Us

Jina is an open-source project. We are hiring full-stack developers, evangelists, and PMs to build the next neural search ecosystem in open source.

License

Copyright (c) 2020 Jina AI Limited. All rights reserved.

Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jina-0.7.8.tar.gz (258.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page