What are Vectors? Breaking down how a disruptive technology is silently powering applications we use everyday

22 September 2020 - 6 mins reading time

What are Vectors?

(also known as: embedding, latent space representation, vector embeddings)

Vector embeddings are meaningful numerical representations of rich data in multi-dimensional space. Vectors can be used to represent any kind of data, such as image, text, audio, videos, users, etc.

'This is a text' --> [0.123, 0.345, 0.567, 0.768, 0.89]

A common way to obtain a vector is through deep learning embeddings. For example, to obtain image vectors we first train a convolution neural network for image classification. Then once the network is trained, instead of taking the predicted labels for an image we take the outputs from one of the layers. The output of this layer is a vector and it represents the image. This vector can be used for vector search, which allows for use cases such as reverse image search, visual based product recommendations, etc.

(also known as: nearest neighbor search, KNN, neural search, distance search)

Once data is represented as vectors, you can calculate the similarities between them by calculating the distance between the vectors in the multidimensional space. Vector search is the process of calculating the distance between a search query vector and all the vectors stored in a database, to find the most similar vector and the data that it represents.

Since we can't visualise the high dimensions that vectors can represent, we show an example of the space by reducing vectors down to two or three dimensions.

How are vectors and vector search being used today?

Google Search, Youtube Recommendations, Spotify Discovery are all services we use on a daily basis. These are all services created by large teams from billion dollar tech companies and powered by vectors and vector search.

For example lets have a quick look at Spotify Discover Weekly architecture:

Slide from Spotify's presentation: https://www.slideshare.net/MrChrisJohnson/from-idea-to-execution-spotifys-discover-weekly/29-Discover_Weekly_Data_Flow

Lets focus on the right hand side as this is the core vector and vector search part of the system and break it down:

Audio

1) 'Raw Audio' feed into 'Batch Audio Models'. Here raw audio is fed into machine learning models for audio. The embedding from this model then pooled and extracted to form Audio Vectors. These vectors can represent features such as tempo, key, time signature, etc. So that when a vector search is performed with this audio vector, the results will rank similar sounding songs higher.

NLP Model

2) 'News, Blogs and Text' and 'Track metadata' feed into 'Batch NLP Models'. Here text from blogs, news, etc crawled from the internet about the song and artist alongside the metadata of the song itself such as lyrics, title, description, etc is fed into text NLP models such as Word2Vec/BERT to form Word Vectors. Then when vector search is performed the songs that have similar meaning lyrics and similar critiques will rank higher.

Collaborative Filtering Model

3) 'Play logs' and 'Track metadata' feed into 'Batch CF Models' (collaborative filtering). Collaborative filtering models are models that predict on the basis of preference from other similar users. Think Amazon's "people also bought" feature. User behaviour, song metadata and feedback is the main input information that is fed into machine learning models such as autoencoders or matrix factorisation which then allows the inputted user and songs to be represented as vectors.

By combining these 3 vectors and performing a multi-vector search. Spotify is then able to give you the ultimate recommendations on songs that are not only similar sounding, similar meaning, similarly reviewed and similarly listened to which makes the feature so addictive.

Search is not the only application for vectors

  • Clustering - group similar vectors into a cluster and then analyse the cluster by aggregating metrics in each cluster.
Here is an example of clustering image vectors
  • Anomaly detection - instead of finding most similar, find most dissimilar and vectors that lie distant from the core clusters.
  • Explainable AI - extract the vectors from deep learning models to explain predictions.
  • Prediction Accuracy Improvement - by feeding more features such as vectors from images, text, audios into prediction pipelines the accuracy can be drastically improved.

These are just a few examples of the many more applications of vectors.

Building a Vector system

Great technology like vectors should be used and considered whenever rich data like text and images are involved. However, this technology is not accessible to all as there is large amount of maintenance and engineering involved in building one. Here is a simplified example of building a vector search system:

There is a lot of moving pieces involved and this is not even considering training and updating the machine learning model, then integrating all the features like facets and filtering that we enjoy in traditional search engines.

Furthermore, there are still a lot of use cases beyond search with vectors, which then require an aggregation engine, machine learning pipelines for clustering, anomaly detection, etc.

Is there an easier way to utilise vectors without having to build all this out?

Introducing Vector AI. Vector-as-a-Service platform that gives developers, data scientists and enterprises access to vectors, with a magical experience to build production-grade vector based applications (https://gh.vctr.ai/)

With Vector AI, it's all set up for you and ready to use in just a couple of lines of code. Here is a quick example using the squad dataset from huggingface's datasets library:

!pip install datasets
!pip install vectorai

#load squad dataset from huggingface's datasets library
import datasets
documents = [{'_id':str(n), **d} for n, d in enumerate(datasets.load_dataset('squad')['validation'])]

#insert the documents into vectorai and search it
from vectorai.client import ViClient
vi_client = ViClient(username, api_key)

from vectorai.models.deployed import ViText2Vec
text_encoder = ViText2Vec(username, api_key)

collection_name = "nlp_quickstart"
vi_client.insert_documents(
  collection_name, 
  documents, models={'question':text_encoder.encode}
)

vi_client.search(
  collection_name,
  text_encoder.encode('who was the winner for nfl fifty'),
  'question_vector_'
)

Here is another example of building a reverse audio search engine:

#create the audio dataset
documents = []
for i in range(1, 1001):
    documents.append({
        'audio': 'https://vecsearch-bucket.s3.us-east-2.amazonaws.com/voices/common_voice_en_{}.wav'.format(i),
        'name' : 'common_voice_en_{}.wav'.format(i),
        '_id': i
    })

#insert the documents into vectorai and search it
from vectorai.client import ViClient
vi_client = ViClient(username, api_key, url)

from vectorai.models.deployed import ViAudio2Vec
audio_encoder = ViAudio2Vec(username, api_key, url)

collection = "audio_quickstart"
vi_client.insert_documents(collection_name, documents, models={'audio':audio_encoder.encode})

vi_client.search(collection_name, audio_encoder.encode(documents[0]['audio']),
    'audio_vector_', page_size=5)

In conclusion, top billion dollar businesses are already utilizing vectors and deriving value from it. Your business and products could make use of vectors today to extract more value and build better features. The excuse of putting in the "too difficult" bucket no longer applies with Vector AI. Reach out today and we'll tell you 5 different ways you can improve your KPIs with vectors.

Here are a few more examples of top companies using vectors to power different applications:

Sign up to https://getvectorai.com/ to start building your own vector based applications and receive updates when we go deep and breakdown on how to recreate the applications above.