At the recent ITDevCon in Rome, I introduced Vector Databases in my speech.
The topic of Vectors and their use, especially for semantic search, has fascinated me since I first read about it. And even today, also in relation to my ongoing Machine Learning studies, Vectors continue to raise questions that I intend to answer (for myself).
For the benefit of those who will read this article, let's proceed in an orderly manner.
What are Vectors?
Vectors are "the multidimensional representation of an object."
Each dimension encodes specific information about the object and, together with others, contributes to the composition of the Embedding (another way to refer to the vector) of the object itself.
Taking up the example I proposed during my speech, let's consider the following sentence:
"L'ITDevCon è la conferenza su Delphi più attesa nell'anno"
Thanks to the use of an LLM model, we obtain the corresponding vector for this sentence:
As mentioned, each vector element encodes a semantic particularity (in this case) of the text.
For example:
Similarly, we would obtain such a vector in the case of an image, an email, a PDF, an audio track...
...and all this using specific models for text, images, audio, pre-trained and performant.
So far, so good: I pass text or an image to a transformer (BERT for text, ResNet for images) and get the embedding.
Let's now imagine one of those scenarios that we analyze daily; a scenario where there exists a classic RDBMS, supporting a business management system, where products are certainly stored. Products are at the center of business activity and, for many, many possible reasons, an embedding process is required for each of them for their management in a new business process where semantic search is fundamental.
Features
A product, generally, is a well-defined entity: it has a code, description, price, unit of measure, supplier, category, subcategory, image...
...but it also has correlations: it can be linked to other products as complementary; it has correlations on the type of use or its manufacture. It has specific behaviors. It has information about customer interactions (sales) or through social channels.
It has, to generalize, useful and necessary information for accurate and precise semantic search beyond all those characteristics (description, price...) that define it.
These informations are called Features.
Features cannot be ignored. And they particularly cannot be ignored when, as in this example case, products are subject to analysis and prediction through Machine Learning systems. Even if, in these cases, embedding might not be the ultimate goal.
Features and their extraction: a fundamental process
But how do we extract these features?
The process, called Feature Engineering, requires a deep understanding of both the application domain and data processing techniques. For example, from a review's text, we could extract the general sentiment (positive, negative), the most frequent keywords, the text length, the presence of specific technical terms (as seen in the previous example).
And let's not forget behavioral data: the number of times a product is viewed before purchase, the average time spent on its page, products viewed in the same session - these are all features that tell a story about the relationship between the product and the customer.
Feature extraction is not a simple process. Nor is it brief.
Feature extraction lies in balancing the quantity of information with their relevance. Not all features have the same weight, and part of the challenge lies in selecting those that best represent our object in the specific context in which we will use them.
In the embedding creation process, the features are then processed and transformed into those vector dimensions we talked about earlier, contributing to creating a rich and meaningful representation of our product.
The Challenge
The race towards 'Artificial Intelligence' by companies opens new scenarios for those who, like me, like us, design and develop software solutions for business. But it makes us vulnerable to failure when the proposed solution is not the result of real knowledge of the domain and the rules that the domain imposes.
It has always been and always will be this way: domain analysis is fundamental to our work.
And in the field of Artificial Intelligence, a very broad and technical subject, very far from everyday standards and often from our professional training up until now, pitfalls and failure are always just around the corner.
This is why we need to be aware that this is a field of action that must be approached with respect and humility. It's not just simple programming. It cannot be and should not be.
#codinglikeacoder