When we started the first real-world trials of our Global Disaster Alerting System at the end of 2020, we reckoned how important and essential it is to visualize information on maps. A geospatial visualization adds considerable value to get a much better understanding of certain situations and reveals details that would otherwise stay undisclosed.
For all our use cases, be it monitoring of global disasters or crisis situations, Competitive Intelligence, Strategic Intelligence or Medical Intelligence, we collect and analyze huge amounts of Publicly Available Information (PAI) or Open-Source Intelligence (OSINT). Unfortunately, most of the collected information does not contain any machine-readable geographic coordinates at all. The geographic data and information contained in the text might be expressed as a paraphrase or metaphor, unable to be directly transposed to a map. It would take a score of well-trained and experienced analysts to manually assign geographic coordinates to these expressions, resulting in a huge and costly workload.
To show the complexity of the problem, we prepared some examples. Let’s stay in the realm of disaster analysis and look at the following events:
1) A major fire occurred in Frankfurt a.M., Germany.
2) @RadioFrankfurtOder: A major fire occurred in Frankfurt.
3) ‘N Groot brand het in Frankfort voorgekom. (Engl.: A major fire occurred in Frankfort.)
These examples show that analysis tasks can quickly become very challenging. A reader can probably solve the first two tasks as you understand the context and are familiar with the country. The first one relates to Frankfurt am Main, a major city of the federal German State of Hesse. The second one is Frankfurt an der Oder, a major city in the federal German State of Brandenburg. But can you also solve the 3rd one?
The text is written in Afrikaans and contains Frankfort, a small city in South Africa named after Frankfurt am Main. It is obvious that identifying the wrong Frankfurt, can cause a lot of confusion and result in wrong decisions or recommendations.
In general, identifying geographic locations can get really challenging when the:
Traversals specializes in information analysis and offers solutions in the area of Disaster Monitoring, Competitive Intelligence and Vendor Risk Management. Our job is to automate things whenever possible and to ease our users’ daily routines. That’s why we have been working intensively on the geocoding challenge over the past few months, trying out various services and finally finishing by programming our own service.
This article will give some insights on this.
Retrieving geographic coordinates from text for multimodal information is a multistep approach. In this section, we focus on textual information only and describe the steps necessary to implement a basic processing workflow.
If you remember the first example of this article, you can see that Frankfurt was mentioned within the sentence. In the first step, we use an NLP-based Named-Entity Recognition (NER) to identify all locations mentioned within the given sentence.
NER is a sub-task of information retrieval to locate named entities mentioned in unstructured text and classify them into predefined categories such as names of people, organizations, places, etc.
The result is a text list containing all identified locations.
Getting back to our first example, Frankfurt was the one and only location extracted. The second step is now to convert the text location into geographic coordinates. There are various SaaS solutions capable of doing this:
Instead of using one of these services, you can also implement your own for that purpose. GeoNames gives you a nice foundation for this. If you take the full package, you will get more than 12 Million geographic names including their alternative names and geographic coordinates. We have seen implementations storing the geographic names in a searchable database and use this database as a look-up service. If you do it that way, you need a database supporting a fuzzy search because of spelling errors.
We have decided against this approach. The reasons will become clear in the next section.
Since the idea is pretty straightforward, it was also our starting point months ago. Similar approaches have been discussed in the Open Source community. After extensive testing with real-world data and also in close cooperation with our customers, we realized that this approach did not work for us and here is the list of why:
If you know us already, you will know that what we do, we do it multilingually with 90+ languages. That means geocoding should also work in languages like Arabic, Russian, Chinese, Hausa, … and it should be scalable to process 100.000 sentences per day easily.
We haven’t seen any working solution (which does not mean, there is none), thus we started to implement our own.
Thus, we now have a really modular approach that runs fully on-premise and is part of our Data Fusion Platform.
No, we are not. We are constantly improving the geographic coding solution by randomly post-analyzing our results. As this workflow is now part of our Federated Search, we can collect real data at any time.
When implementing the first version of our self-learning bounding boxes, we identified new ideas that will increase the context-awareness of the geocoding solution.
This means: stay tuned!
Copyright © 2021, Traversals Analytics and Intelligence GmbH. All Rights Reserved.