How AI/NLP and Geocoding Support Monitoring of Global Conflicts

Key Findings

  • Converting text into geographic findings on a map is crucial for any kind of OSINT analysis.
  • This task is challenging for every analyst and needs some smart automation in the background.
  • It can be done automatically with modern AI/NLP and some smart processing steps.
  • It requires a lot of sample data to identify all the corner cases and to improve the overall quality.

What is Geocoding?

When we started the first real-world trials of our Global Disaster Alerting System at the end of 2020, we reckoned how important and essential it is to visualize information on maps. A geospatial visualization adds considerable value to get a much better understanding of certain situations and reveals details that would otherwise stay undisclosed.

 

For all our use cases, be it monitoring of global disasters or crisis situations, Competitive Intelligence, Strategic Intelligence or Medical Intelligence, we collect and analyze huge amounts of Publicly Available Information (PAI) or Open-Source Intelligence (OSINT). Unfortunately, most of the collected information does not contain any machine-readable geographic coordinates at all. The geographic data and information contained in the text might be expressed as a paraphrase or metaphor, unable to be directly transposed to a map. It would take a score of well-trained and experienced analysts to manually assign geographic coordinates to these expressions, resulting in a huge and costly workload.

 

To show the complexity of the problem, we prepared some examples. Let’s stay in the realm of disaster analysis and look at the following events:

 

1) A major fire occurred in Frankfurt a.M., Germany.

2) @RadioFrankfurtOder: A major fire occurred in Frankfurt.

3) ‘N Groot brand het in Frankfort voorgekom. (Engl.: A major fire occurred in Frankfort.)

 

These examples show that analysis tasks can quickly become very challenging. A reader can probably solve the first two tasks as you understand the context and are familiar with the country. The first one relates to Frankfurt am Main, a major city of the federal German State of Hesse. The second one is Frankfurt an der Oder, a major city in the federal German State of Brandenburg. But can you also solve the 3rd one? 

 

The text is written in Afrikaans and contains Frankfort, a small city in South Africa named after Frankfurt am Main. It is obvious that identifying the wrong Frankfurt, can cause a lot of confusion and result in wrong decisions or recommendations.

 

In general, identifying geographic locations can get really challenging when the:

 

  • information is expressed in a foreign language or slang,
  • information is about countries you are not familiar with,
  • information is incomplete,
  • amount of information is just too much.

 

Traversals specializes in information analysis and offers solutions in the area of Disaster Monitoring, Competitive Intelligence and Vendor Risk Management. Our job is to automate things whenever possible and to ease our users’ daily routines. That’s why we have been working intensively on the geocoding challenge over the past few months, trying out various services and finally finishing by programming our own service. 

 

This article will give some insights on this.

A Basic Workflow to Bring Text to the Map

Retrieving geographic coordinates from text for multimodal information is a multistep approach. In this section, we focus on textual information only and describe the steps necessary to implement a basic processing workflow. 

Step 1 – Using AI/NLP to Extract Location Entities 

If you remember the first example of this article, you can see that Frankfurt was mentioned within the sentence. In the first step, we use an NLP-based Named-Entity Recognition (NER) to identify all locations mentioned within the given sentence. 

 

NER is a sub-task of information retrieval to locate named entities mentioned in unstructured text and classify them into predefined categories such as names of people, organizations, places, etc.

 

There are various SaaS-based NER systems available, e.g. from Google or Microsoft, or you can set up your own system by using the famous Stanford NER or the new Spacy libraries. 

 

The result is a text list containing all identified locations.

Step 2 – Converting Location Entities into Geographic Coordinates

Getting back to our first example, Frankfurt was the one and only location extracted. The second step is now to convert the text location into geographic coordinates. There are various SaaS solutions capable of doing this:

 

 

Instead of using one of these services, you can also implement your own for that purpose. GeoNames gives you a nice foundation for this. If you take the full package, you will get more than 12 Million geographic names including their alternative names and geographic coordinates. We have seen implementations storing the geographic names in a searchable database and use this database as a look-up service. If you do it that way, you need a database supporting a fuzzy search because of spelling errors.

 

We have decided against this approach. The reasons will become clear in the next section.

Drawbacks and Problems of this Approach

Since the idea is pretty straightforward, it was also our starting point months ago. Similar approaches have been discussed in the Open Source community. After extensive testing with real-world data and also in close cooperation with our customers, we realized that this approach did not work for us and here is the list of why:

 

  • Using Named-Entity Recognition (NER) for identifying geographic locations in the text is a good basic approach. Why it will not work is shown by the following example: “There is a big fire. Please close your windows and stay safe. #Frankfurt #fire #staysave”. The locations are not part of the sentence and cannot be identified by NER as a consequence. What you often see on Twitter is that the locations are appended at the end of the tweet in the form of hashtags. Using hashtags is probably the only reliable standard on social media ;).
  • Using a SaaS-based geocoding service is a simple starting point. It seems that especially the cheaper geocoding services have been built for a very specific use case: the conversion of a complete and standardized address with street name, postal code, … into geographic coordinates. Have you ever seen someone writing a full standardized location on Twitter or Reddit?
  • Some of the SaaS solutions had problems when it came to spelling errors. This resulted in no results, e.g. for the misspelled location “Frnkfurt”. In the case of the geographic name-based approach, the problem may be solved by adding a fuzzy search.
  • Almost all approaches we have seen failed with incomplete location names. Close to our headquarters, there is a forest called “Reichswald”. However, the official name is “Sebalder Reichswald” or “Lorenzer Reichswald”, a name not used at all. Also, the fuzzy logic does not work here.
  • We also have seen interesting results for abbreviations, such as “NRW” which is the short name for the German federal State of North Rhine-Westphalia. One of the SaaS solutions had a full match for a location in Sudan…!
  • What about Santa Cruz? How many Santa Cruz do you know? The Santa Cruz in Bolivia, Argentina, United States, or Santa Cruz in Tenerife. Which Santa Cruz should be selected by the geocoding service? If you think, Santa Cruz is a problem, then it gets even worse, if someone writes it as a hashtag #SantaCruz: No blank → spelling error → a lot of systems will fail and return 0 results instead of possible 20.

 

If you know us already, you will know that what we do, we do it multilingually with 90+ languages. That means geocoding should also work in languages like Arabic, Russian, Chinese, Hausa, … and it should be scalable to process 100.000 sentences per day easily.

 

We haven’t seen any working solution (which does not mean, there is none), thus we started to implement our own.

How we Modified the Basic Workflow

  • We selected one of the NER libraries and extended the approach to get more relevant possible locations. This also includes all hashtags and similar expressions. The downside is that by doing so, we multiplied the possible location candidates easily by factor 5, putting a lot of pressure on the geocoding itself.
  • We decided against an indexing database like Elasticsearch. Although that allowed us to implement a more complex fuzzy search approach, the downside was a reduced performance. However, we were still able to fulfill our Service-Level Objective (SLO).
  • Getting rid of an additional database allowed us to implement nice autoscaling for the pure geocoding function. In case of an increased service load, we just let the function scale. It also simplified the operations as no backup is required and as the deployment is managed by our DevOps chain.
  • What really increased the performance and quality of our new workflow was the introduction of self-learning bounding boxes. By analyzing the full sentence, we can identify additional hints to provide bounding boxes. These bounding boxes will improve the performance and also eliminate the ambiguities like in the Santa Cruz example.

 

Thus, we now have a really modular approach that runs fully on-premise and is part of our Data Fusion Platform.

Are we satisfied now?

No, we are not. We are constantly improving the geographic coding solution by randomly post-analyzing our results. As this workflow is now part of our Federated Search, we can collect real data at any time.

 

When implementing the first version of our self-learning bounding boxes, we identified new ideas that will increase the context-awareness of the geocoding solution.

 

This means: stay tuned!

Key Findings

  • Converting text into geographic findings on a map is crucial for any kind of OSINT analysis.
  • This task is challenging for every analyst and needs some smart automation in the background.
  • It can be done automatically with modern AI/NLP and some smart processing steps.
  • It requires a lot of sample data to identify all the corner cases and to improve the overall quality.

Copyright © 2024, Traversals Analytics and Intelligence GmbH. All Rights Reserved.