The fast-paced, ever-growing, and continuously changing Internet content makes it cumbersome to retrieve vital information to specific subjects. Analysts are faced with the challenge of always being aware of the growing number of new data sources and also being able to search them accordingly.
To increase the efficiency and quality of this work, we developed an assistance system, the so-called Data Fusion Platform. The fundamental core element Federated Search retrieves a thick subject-related bowl of results by simultaneously exploring information on various platforms, e.g., Google, Yahoo, Yandex, Naver, Bing, Ahmia, DuckDuckGo, Twitter, and many more, and by using Artificial Intelligence to post-process all retrieved information as written in this article.
The Federated Search can take care of some Internet challenges while simplifying individuals’ or companies’ searches.
However, each search results’ quality heavily depends on the search query itself. A specific, context-related, and exact search query is likely to retrieve the wanted and expected result, even if the search query covers multiple search contexts at once. But, finding and developing good keywords can be cumbersome for an individual as well, as it requires domain knowledge as shown in the following section.
Let’s say you are interested in Tesla’s accidents, the basic keywords for our Federated Search might look like:
This search will now run a query against all data sources connected to our Federated Search and retrieve information linked to the keywords.
By analyzing the first results, you will see that this search is not “wide” enough. Thus, you will extend the keywords with some synonyms:
This technique is well-known as a synonym look-up and often used to optimize the queries.
But synonym lookups are not really satisfying for a lot of scenarios. The following examples show main keywords and their contextual extensions that cannot be derived by using simple synonym lookups.
These examples show that optimizing a query cannot be done just by adding more synonyms. You always need the full context to optimize queries and unfortunately, the context depends on multiple factors, such as region, time, or language spoken.
Thus, the question is:
Is there a smart way, how our Federated Search can support analysts developing high-quality keywords without being experts in the targeted domain or language?
Fortunately, Artificial Intelligence and its Machine Learning techniques can generate and define new words out of context. These keywords can then be used to form a new search query for a Federated Search.
Thus, we conducted a master thesis with the Computer Science Department of the Nuremberg Institute of Technology (THN GSO) to optimize keywords for our Federated Search solution to improve users’ searches and retrieve more relevant search results. In this article, we will show some insights from our research on that hot topic.
In the following technical description, the technical term search term is used instead of the term keyword.
Optimizing search terms does not mean merely generating new search terms from the search results’ content. Moreover, the search term optimization covers the generation of search terms based on the same search context, evaluation of the search term’s relevance, and the recommendation of these search terms to a user. The search term generation can be done with simple Natural Language Processing (NLP) techniques like Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), or with more advanced NLP techniques like with Facebook’s fastText library based on Neural Networks.
The resulting term models are holding new, potential, and context-based search terms for the Federated Search. Therefore, the term models are used for recommendation purposes. When a user enters a search term into the Federated Search Search Editor, the most similar, relevant search terms are displayed to expand the search query. The search term relevance is covered by evaluating users’ click interactions with search results in the Federated Search. Additionally, to simplify and optimize the search term recommendation, a thesaurus is used, which holds the generated, entered search terms and relevance.
The NLP models are getting fed by the Federated Search results. Due to the support of various data platforms like webpages, news articles, or tweets, the search results’ content covers a wide range of symbols, special characters, or even simple spelling mistakes (e.g., hashtags, emojis, GIFs, pictures, links, and more). These symbols, special characters, and misspellings are often written in sentences that people can easily understand and comprehend. However, NLP techniques require a clean character or token-based input to be able to analyze the content and create models.
Therefore, a data preparation process is mandatory to identify and clean the raw content for the model training. The prepared content is directly used to train a new term model. A unique feature for the training is that the prepared search results’ content is grouped by the used search term, which guarantees the creation of search terms that share the same search context. BoW and TF-IDF create an n-gram word-based model, which covers search queries with a single search term or multiple search terms.
However, fastText creates an n-gram character-based model to cover search queries of various lengths. Thus, the semantic similarity of words is improved, enabling the possibility to identify similar terms, which are not part of the trained model.
How are the term models updated or refreshed? BoW and TF-IDF are fast and straightforward techniques that almost immediately estimate a new term model out of a small dataset. Such characteristic allows the training of a fresh model directly after a Federated Search execution. FastText, on the other hand, requires a large dataset and demands a comparably long training time. As described, an outdated fastText model still identifies similar search terms. Therefore, fastText models support and require model training on-demand.
The search term relevance evaluation requires user-click interactions from the front-end. Therefore, Query Click Logs are introduced that hold the user click action, the search result, and the related search term. The Query Click Logs are added to the search results in the front-end. Each click action has a unique weight, which is applied before the evaluation. Thus, enable a quick and straightforward relevance evaluation. The continuous acquisition of user click interactions and evaluation of search term relevances ensures that the search terms are kept updated.
The search term recommendation depends on the user-entered search terms, the generated and context-based search terms (term model and thesaurus), and the algorithm to identify the most similar and relevant search terms. Similarity functions like the Jaro-Winkler similarity function are comparing two strings by estimating a similarity score. The higher the similarity score is, the more similar the entered and the potentially recommended search term is. Additionally, the evaluated relevance score filters and sorts these most similar search terms. This simple two-level recommendation approach realizes a quick and reliable search term recommendation.
As before, the user can interact with the Federated Search of the Data Fusion Platform. However, to use the new feature, a search term recommendation is added to the Federated Search, which visualizes the most similar and relevant search terms. Every time a new search is added to the search editor, the most similar and relevant search terms are recommended. The search query gets automatically expanded when a user clicks on the recommended search terms. Moreover, the next executed search includes the entered and recommended search term.
Furthermore, with each search and search result interaction, the Search Term Optimization improves and optimizes itself.
(Acquisition of new Data → Creation of new Term Models → Evaluation of the latest Search Term Relevances → Recommendation of Search Terms.)
In short, the TF-IDF and BoW technique performed best when it comes to create a new term model and recommend search terms. The term model creation was finished almost instantly and could be used directly for recommendation purposes. Additionally, the TF-IDF ranks the search terms by their relevance inside the last retrieved search results, which slightly improved the recommendation. Both techniques’ recommendations narrowed down or expanded the search context, as visible in the screenshot above. However, fastText’s search term recommendations performed not so well. The reason for this might be insufficient data and is under investigation.
The execution time of the term model training took quite some time. Additionally, the search term recommendation only returned similar semantic search terms, which did not narrow down or expand the search context (e.g., search term: `Angela Merkel’ recommendations: `Derkel,’ `Berkel,’ `Merkel’). Though, the poor recommendation results might have been caused by the small dataset. Nevertheless, fastText can be used if the term model training was fed with a larger dataset, trained on-demand, and persisted on an internal database. Then, for the search term recommendation, the term model just needs to be loaded.
Copyright © 2023, Traversals Analytics and Intelligence GmbH. All Rights Reserved.