Exploring Azure AI Search: Building a document analysis system — iteo
Navigating the evolving landscape of AI solutions is crucial for modern organizations aiming to streamline operations and bolster decision-making capabilities. This article delves into the development of a robust document analysis system using Azure AI Search, exploring key aspects such as semantic search implementation, data ingestion challenges, and strategies for maximizing user adoption. By examining both technological advancements and practical considerations, we uncover insights essential for successfully deploying AI-driven solutions in organizational settings.
Business problem
Growing interest in AI solutions
Recently, the interest in AI solutions has been growing significantly. Companies worldwide are beginning to recognize the benefits of using Large Language Models (LLMs). Key applications include:
Improving these processes allows for a reduction in labor, faster and more effective processing, content retrieval, and consequently, potential reduction of the operational costs of organizations.
Developing a document analysis system
To address market needs, we decided to create a system for analyzing large sets of documents using our company’s internal resources, such as HR documents and a database of completed projects.
We will try to share our experiences, critically assess existing market solutions, evaluate technological limitations, conduct a cost analysis, and identify potential organizational barriers that may hinder the introduction of this tool to the end user.
Theoretical Background
Azure AI Search
The service was first released in March 2015 and has undergone several changes since then. Although it was first known as Azure Search, more people might recognize the next label, Azure Cognitive Search, announced in October 2019. The third (and newest) rebranding took place quite recently, in November 2023.
Azure AI Search allows users to maintain and search information in their knowledge base, quickly and easily adding new documents, even semi-structured ones. It supports full-text search using Apache Lucene in case of AI Search; however, in this article we want to focus on semantic search using vectors.
Searching the database using vectors was first published in the preview version in June 2023. Azure AI Search cannot calculate vectors itself, but it allows developers to smoothly integrate with external services. It is no surprise that it fully supports other Azure services such as Azure OpenAI, providing access to vectorization using their embedding models like text-embedding-ada-002 or newest text-embedding-3-small and text-embedding-3-large. Users can easily save these vectors in the database and use them to search for phrases with similar meaning.
Implementing a natural language question-answering tool
In our case, we attempted to implement a tool enabling users to ask questions in natural language and receive answers grounded in our internal knowledge base. We chose this problem because it is a commonly faced issue in many organizations. While any knowledge base is better than no base at all, large datasets get increasingly difficult to navigate and search, becoming less and less efficient with each new document.
If full-text search is sufficient, all documents may be imported from common sources like Storage Account or CosmosDB with just a few clicks in the Azure panel. Also, if vector search with default settings is just enough, it can now be done almost automatically using a new option in the panel. However, we needed more control over the process, and we also wanted to try out hybrid search with semantic ranking (more on that later), so we had to configure document ingestion manually.
Document ingestion and chunking
The most important part of document ingestion is chunking. It is almost always required because embedding models impose a hard limit on text length for a single vector. In the case of our embedding model, text-embedding-3-large it is 8191 tokens, which gives roughly 32,000 characters or around 12 pages. Bear in mind that this is just the limit, and actual text length should match several requirements:
- Text cannot be too long, as it will become too general, nor too short because it may fail to describe a broader context of information.
- Embedding guidelines also suggest keeping length similar to typical query length in order to improve the quality of search results.
- Another important factor was that our knowledge base is written in the Polish language, which has different average token count (usually higher).
The default chunking strategy offered by Azure AI Search is constant-length, which simply splits the body of documents into parts of equal character count, regardless of its content. While this approach is really fast and simple to implement, it may lead to situations, where information context is split between chunks — even with overlap in place. Another approach we tested was splitting documents using subsections. It may still create multiple parts if the chapter is really long, but it is much more likely to produce parts containing complete context. This approach, however, requires data which has any kind of enclosed thoughts, be it sections, chapters or at least paragraphs, which is not always the case in real-life datasets.
It is also worth mentioning that the Azure Search ingestion pipeline is also able to process images. They may be sent to an Azure Computer Vision resource to extract additional info like metadata or even image captions. Such data may be later used to enrich documents in Azure Search index and be searched just like the textual body of a document. Unfortunately, image extraction did not work well in our documents, full of screenshots and diagrams (more on that later).
Deeper data extraction
The next step in our journey was to further enrich our chunks, trying to find data hidden deeper in text content, for example names of people or organizations and keyphrases. It may sound trivial that we just separated our colleagues’ names into yet another collection. However, names of all kinds are not ordinary words and obey completely different rules in the languages we use every day. Such data might, and actually should, be used to help users find information they need, just in a different way. Keyphrases are also very helpful, as they often embrace the main topic, summarizing it in just a few words. They may contain words which might be probably searched for. During configuration, it is really important to keep in mind the language which will probably be used in the data source because such information might depend on it. Word which is a verb in one language, might also be a common first name on the other end of the world. Take, for example, the word “Ken”: it is a name in the US (more often used as a name of a toy), but at the same time, it might be used as a verb in Japanese (meaning “to see”, “to look at”). If language is unknown at design time or may vary between documents, we are again provided with simple and seamless integration with Azure Language AI, which can detect language and provide translations, if needed.
Creating and utilizing embeddings in vector space
The final step in the ingestion pipeline is creating vectors, also called embeddings, for our text. Shortly speaking, embedding models create a numerical representation of a phrase it gets. In fact, a vector might be understood as a point in multidimensional space. It is important to mention that humans perceive 3 dimensions, while our points are located in a space of more than 3000 dimensions! While it is completely impossible to accurately present such points to users by any means, computers can handle them with ease. Using well-tested mathematical algorithms, we can quickly find the nearest neighbors in the vector space. Given that all points were produced by the same embedding model, we are guaranteed that data located close to each other actually has semantically similar meaning — regardless of whether it was a text, image or any other data representation.
Index schema design and configuration
Last but not least, all this data needs a place to be kept in. We need to create a schema of our index. In order to do that, we have to define all the properties of the documents we processed, like chunk text, organization names, people names, keyphrases and a bunch of metadata such as file location. Every single field should then be configured. We need to specify which pieces of information might ever be retrieved, filtered, sorted or how many dimensions will the vectors have.
While this might sound time-consuming and unnecessary, it has a direct influence on index size and search performance. The reason for that is quite simple — Azure AI Search will prepare for us all background metadata required to deliver selected functionalities. It is really tempting to leave everything checked because it might be required at some point. However, we found out that the approach, which worked well for us, was to set the bare minimum and upgrade it incrementally over time when such a requirement arose. It’s really hard to predict what will be needed in the future, and index size might quickly become a problem, especially on lower service tiers of Azure AI Search. Again, this is completely taken care of automatically by the service, should you choose to import knowledge from existing sources using their tools. We only went down this path to see how it works and try advanced features.
Initial indexing might take a good while — even for around a hundred documents, processing takes a few minutes. Of course, it all depends on the complexity of the pipeline.
Querying and Search Mechanics in Azure AI Search
When our data is finally loaded into an index, we can start running queries against it. Azure AI Search has a dedicated UI in the Azure panel; however, a REST API call is preferred.
Quite surprisingly, this service does not use SQL, GQL or even Kusto. All queries need to be prepared in the form of a JSON object. While the simplest possible query requires only a single text property, there are many other possible parameters. A bunch of them are responsible for semantic search. For example, a typical hybrid query includes 10 parameters.
A comprehensive description of every possible configuration setting would require a separate blog post, so let’s focus on the most important parts. The first thing which should be noted is that full-text search and semantic search actually run in parallel (and might be used separately, if needed). Every result is scored by relevance to show the best results first. Unfortunately, these two methods use completely different metrics and their quality cannot be compared directly. Because of that, AI Search also calculates another type of score, RRF, which combines all factors in a normalized manner.
Semantic search and on-the-fly vector calculation
Another feature worth mentioning is that we can send our semantic prompt as a plain text. Normally, this type of search requires another vector which might be compared with values stored in the index. While this is also allowed, with proper configuration, Azure AI Search is capable of sending our prompt to the embedding model and calculating vector value on the fly.
The result will include the most relevant documents with all properties (or just the ones we explicitly asked for). Typically, results consist of pieces of text which are somehow similar to the query. While Azure AI Search’s job is done, this response is still difficult — if not impossible — to use by the end user. It should be further processed by another application or by a language model. A very popular example of such processing is Retrieval Augmented Generation (more commonly known as RAG).
Potential challenges
Cost considerations
So far, we were focusing on things which we could do, thanks to the aforementioned technologies and services. However, one of the main goals of this post was to critically assess and review these tools. That’s why this text would be incomplete without a section dedicated to problems which we had to face implementing our knowledge base solution. Our solution was designed to, among others, cut operational costs, so let’s start with pricing.
Compared to other popular services Microsoft provides, Azure AI Search is quite expensive. Of course, there’s a free tier, but performance it offers is just right for demo and development purposes. In order to prepare anything usable in real-life application, at least Basic tier is required. At the moment of writing, it costs 70€ per month, but depending on the size of a knowledge base price might quickly jump. The next tier, Standard, scales from 230€ up to 1872€ per month. Honorable mention to the top tier named “Storage optimized” which costs more than 5000€ per month.
Quite interesting fact is that this price is not affected by usage. There are additional costs for some features but the base price does not change, regardless of whether service was used or not. Also, service tier cannot be changed after creation, so you have to delete and recreate the whole resource in order to do so. AI Search cannot be deallocated, turned off or temporarily disabled by any other means.
On top of that, we need to keep in mind the costs of services utilized during ingestion. Of course, it completely depends on the specific use case, but every new document will trigger additional calls, be it Language AI for translations, Computer Vision for images processing or Function App for virtually anything You might wish to do with Your data. Thankfully, all documents which were already indexed will be omitted. On the other hand, AI Search might not be able to recognize that the document has changed, depending on the data source, so it might require additional logic to handle that. With that in mind, it may be a viable option to delete the service and recreate it if startup time is not critical and the application won’t be needed for a longer period of time, especially on more expensive tiers.
Data quality and technical challenges
Whole another topic is the quality of the data. While Microsoft is happy to say that with Azure AI Search data quality is not an issue anymore, a saying coined in the early days of computing “garbage in, garbage out” still holds. The reason for that is, quality still does have an influence on the development process, and ultimately, on the results’ quality. It is true that you can just simply index any supported text document (even with images) and AI Search will be able to efficiently search through it. It is also true, however there are several limitations and caveats which we had to face, even in our internal use case. Some of them were minor, for example, Computer Vision could not handle images in our knowledge base, often producing captions like “screenshot from a mobile phone” or “gallery of bikes” in case of ecommerce site listing. Moreover, embedding models offered by OpenAI were cut off in 2021 so it may have some difficulties with phrases which got new or changed meaning recently, like new technologies’ names in our case.
Having said that, major issues which we had to overcome were maximizing the amount of extracted information and finding the optimal chunking strategy. In the first case, understanding the input data is the key factor. If the data is unknown or highly unstructured, the development process might quickly come down to costly and time consuming trial-and-error approach, becoming harder and harder to maintain with every single improvement. The latter issue was even more difficult. Finding an acceptable solution took us many hours and countless indexer runs. Unfortunately, this seems to be a universal problem, as we found sources, discussions and questions about it — all answers starting with “it depends” and giving only vague guidelines.
User Adoption and Expectations
There is one another problem, which we only realized when we released an app internally. Semantic search, RAG and all other technologies we used were really powerful, but users need to understand its capabilities and limitations. When we checked search history and issues posted by users (technical and non-technical alike), we found out that they usually don’t use it’s full potential or use it in a wrong way. Logs showed that people often ask for keywords, just like they would ask a SQL database or in a search bar opened with CTRL+F in a text editor. While they often found relevant documents, they did not use any of the advantages our solution offers — they would do just fine using simple (and much, much cheaper) SQL databases or existing knowledge base solutions like Atlassian Confluence. The other group we were able to isolate consisted of users, which expected that this solution would be better or at least as good as GPT4, just based on internal knowledge. Some of them were surprised that their prompts were not found. Some of them were also confused when seemingly irrelevant changes in query, like removing the determiners or prepositions like “any”, “with”, or “each”, produced completely different results. We understood that introducing such a tool requires also making users aware of its advantages and teaching them how to make the most of it.
Conclusion
So, we’ve already outlined the foundational steps in developing a document analysis system leveraging Azure AI Search, emphasizing its capabilities and challenges. We also pointed out several aspects which need to be carefully assessed before choosing this service as a foundation for a knowledge management system. Next time we’ll delve into Retrieval Augmented Generation (RAG), exploring how this advanced technique enhances search result relevance and usability, further advancing the utility of AI in information retrieval and analysis.
Originally published at https://iteo.com on June 26, 2024.