The SCImago group has had the possibility of downloading a copy of Dimensions in Json format through an agreement with Dimensions Science. From the Scopus and Dimensions data of April 2020, the SCImago group created a relational database for internal use that allows for massive computation operations that would otherwise be unfeasible. For the analysis that was an objective of this study, it was necessary to implement a matching procedure between the Dimensions and Scopus databases.

To this end, we applied a method developed in the SCImago group to match PATSTAT NPL references with documents.

This method has two phases: a broad generation of candidate pairs, followed by a second phase of pair validation. In this case, a modification was made. Instead, once there was a set of candidate pairs, a validation procedure was applied, accepting as valid the matches that exceeded a certain threshold. This reduced the combinatorial variability of the following generations of candidates.

The pairs that did not exceed the threshold were not discarded but were saved in case at the end they were unpaired and were those with the greatest similarity. In more detail, our procedure began with the normalization of the fields to facilitate pairing. This is the case with journals such as PLOS One or Frontiers, for instance. Then we started to generate candidate pairs. The phases were centered on the following: One of these conditions:(1) Same year of publication, title with a high degree of similarity, and the same DOI.

As can be seen, there are conditions that include some previous phases. It should be borne in mind that each candidate pair generation phase is followed by a validation. So the first phases are quite specific; they generate a small number of candidate pairs, most of which are accepted and come to constitute the majority of the definitively matched pairs.

In this way, the lists of documents waiting to be matched are reduced, allowing for broader searches in the following phases without greatly increasing the computational cost. Logically, the rate of success in the candidate pairs decreases from phase to phase. The last three were compared both numerically and alpha-numerically. The comparison of each field generated a numerical score corresponding to the number of matching characters with some adjustments, for which the Levenshtein distance was used.

Once the coincidence score had been calculated in each field, we took the product to get the total score. The individual scores by field never have a zero value because that would mean the total score would be zero.

In case of nonessential fields, the field score may be unity if the field is considered to be nonessential, 0. In either of the databases, the fields of some records may be empty. With this process, coincidence in several fields increases the total score geometrically rather than arithmetically. Once the candidate pairs of a phase have been validated, we take as matched the pairs that have a total score greater than 1,000, and in which neither the Scopus nor the Dimensions record scores higher with any other pair.

The total score threshold of 1,000 was set after sampling and verifying that under these conditions no mismatched pair was found.

Once the 5 phases had been carried out, a repechage operation was initiated for the rejected candidate pairs. This accepted pairs in which both components obtained a lower score in the rest of the pairs, down to a total score of 50. Also accepted were those in which the score was greater than 300, but one of the pairs had another pair with exactly the same score. This latter was done because both databases have some duplicated records.

The general results are given in Table 1. It is true that, even though our study includes more years than that of Visser et al. The number of matched pairs grows from year to year, and in Scopus, the percentage of matches also grows.

This is not the case for Dimensions, however, due to the great growth this database experienced from year to year.



