Cyril de Kock, NLP Data Scientist
Reading time: 8 minutes ⏱
This blogpost will explain how to use the BigARTM topic modelling library, how to preprocess the data for topic modelling using Spacy and how to calculate coherence using the Gensim package. In other words, how to train a topic model in an unsupervised manner on a custom dataset.
Introduction
The ultimate goal at Findest [1] is to realise ten thousand innovations within ten years. We help other companies and individuals innovate by finding technologies, materials and knowledge that they require. The technology scouts within Findest utilise IGOR^AI to locate the relevant technologies and deliver the results in a compact overview.
To find the right technologies the scouts have to cast around millions of scientific papers, patents and websites. They do this by composing queries which filter the search space based on functions, properties and keywords. A basic example of such a query is:
Functions: Desalinate salt water OR remove salt.
Keywords: seawater, desalination, separation technique, salt, sustainable, membrane, electrodialysis, Reverse electrodialysis, Reverse OsmosisThis query attempts to find technologies that can desalinate salt water. Notice how each pair of functions that is separated by an OR forms an action-object pair. It is refined by the addition of relevant keywords. The results from this query could be tuned further by adding keywords that the search should or should not match, such as diesel or wastewater. The query and found technologies can be examine here.
Problem statement
A query as such will return multiple thousands to ten-thousands of results. This makes it difficult for a user to get a feeling for what types of technologies and in which fields those technologies are present. The amount of search results is too large to process by a human and users may struggle to come up with additional keywords to filter the results. Filtering down more than hundred documents is an arduous task and could lead to the user missing the right information.At the Findest development team we devised a topic modelling approach to provide filters to use in search. We hypothesised the modelling should provide the following benefits:
Most topic modelling approaches take a (large) dataset of articles or tweets and attempt to discover narrative structure within the data. This often comes with a lot of manual fine-tuning to discover the best possible topic distribution. An example of this is the research done by the Pew center [2]. However, this use case requires training a new model for each query issued. This brings with it technical challenges which we will discuss in the following paragraphs.
Data
Since searches can run up to hundreds of thousands of results we limit the scope of the topic modelling. We only consider the first thousand hits sorted by relevance. This is a relatively small amount of documents for topic modelling. However, our users are seeking a few key results among hundreds of technologies and reducing the scope is essential in finding the correct answers. Furthermore, we restrict ourselves to modelling scientific articles as these are the most relevant to finding new technology insights.Our data consists of the titles and abstracts of the scientific articles retrieved upon issuing the query. An example of a document is the paper “Reverse osmosis in the desalination of brackish water and sea water [3]”. The title and abstract body are concatenated into one feature per document.
Topic model
Our method of choice is the Additive Regularisation of Topics Models (ARTM) model from the paper “Tutorial on Probabilistic Topic Modelling: Additive Regularisation for Stochastic Matrix Factorisation” [4]. The model has an implementation on Github as a python module [5]. The library also has a Latent Dirichlet Allocation [6] implementation in case you want to compare how it fares up against the ARTM model. The difference between ARTM models and the former two methods is that it does not make any probabilistic assumptions about the data. Instead, ARTM adds weighted sums of regularisers to the optimisation criterion which are differentiable with respect to the model.
Preprocessing
ARTM only accepts datasets in UCI Bag of Words (BOW) [7] or Vowpal Wabbit [8] format, and thus requires any tokenisation and preprocessing to be completed beforehand. A convenient way of getting the data in the right format is to use the CountVectorizer class of SKlearn. The vectoriser class produces a feature matrix in BOW form which is perfectly suited for use by ARTM.Every element is preprocessed to include only topically relevant information. This includes:
We also remove a total of 1,192 stopwords from the text. This list is composed of Scikit-learn’s English stopword list [9], the list of scientific stopwords curated by Seinecle [10] and a list of stopwords we added ourselves. Lastly, we exclude any terms that occur in more than 90% of documents as well as those that occur in less than 1% of documents.To efficiently preprocess the data we start by putting all documents through Spacy’s pipe functionality and then subsequently tokenise the data by putting it through our custom preprocessor function. We provide a custom preprocessor function to the Countvectorizer to ensure the object works well with the output of Spacy. The code shows the functions used to execute all of these steps:
Training the model
Now that we have our preprocessed data in BOW format we can almost train our topic model. Initialising an ARTM model requires the data to be divided into batches and requires the user to specify the amount of desired topics. This is not desirable for live topic modelling and later we will explain how we tackled this problem. For now, the ARTM model is initialised as follows:
The model is initialised with a few regularisers we estimate suit our purposes. You can find all regularisers and their descriptions in the ARTM documentation [11]. We also add the TopTokensScore as a metric to keep track off as it can later be used to retrieve topic representations. Here topic_names is a list of names, each representing one topic. We simply went with topic1, topic2 etc.’ The dictionary is the dataset dictionary as produced by the BatchVectorizer object of the ARTM package. The method conveniently splits your data into batches for the ARTM model to learn from. Example usage can be seen below:
Here data is the result of the earlier preprocessing and cv is the CountVectorizer object we used to transform our data. Now that we have our model we can train it with the following one-liner:
The topic representations generated by the trained model can be easily retrieved using the TopTokensScore metric we added earlier. This will get us the five most representative words for each of the topics generated by the model. This is illustrated in the following code fragment:
The resulting topic representations are:
We can see clear distinctions between the representations. Each topic seems to focus on a specific niche and there is no overlap in words between the topics. Attentive readers may have noticed we never specified the amount of topics we wanted the model to converge to. This is due to the fact we automatically determine the amount of the topics. The next section explains how.
Estimating N topics
To determine the optimal amount of topics to initialise the model with, we iterate over a range of values and pick the N that provides the highest coherence. Coherence is a metric that estimates how interpretable topics are based on how well topics are supported by the data. It is calculated based on how often words appear together in the corpus and how they appear in the topic representations. We use the Gensim implementation of Coherence [12] which requires the topic representations (e.g. Topic 0 is represented by the words ‘osmosis’, ‘reverse’, ‘reverse osmosis’, ‘treatment’, ‘osmosis membrane’) and the tokenised documents in the corpus. Our CountVectorizer only gives us the BOW and does not save the tokenised documents, luckily we can get the tokeniser function used by the vectoriser and use to tokenise our data once again as follows:
We can calculate the Coherence as shown:
The function returns a value between zero and one where higher values indicate a stronger coherence, which in turn indicates a better model. Now that we have a method to determine the best model we can iteratively decide which amount of topics best suits our purposes. This is done by simply looping over a range of topic counts while keeping track of the coherence score of the best performing amount of topics.
Concluding remarks
The process of automatically determining the amount of topics is likely less optimal than tweaking the results under human supervision. However, the described approach to topic modeling does allow us to perform fast, unsupervised topic modeling of the scientific abstracts returned by a query to IGOR^AI. In turn, this allows users of the technology scouting service to filter the search results and to learn new keywords from the generated topics, resulting in a better search experience and swifter discovery of technologies. We hope this blogpost was both clear and instructive and helps you develop your own topic modeling approach.
References